Learn专题五——Pandas
第一课 创建和读写工作薄
1.导入Pandas包
import pandas as pd
2.创建Series结构的数据(注意大小写)
s1=pd.Series([1,2,3,4],index=["one","two","three","four"],name="s1")
3.创建DataFrame结构数据
- 通过字典创建
方法一:直接在创建DataFrame时设置index即可
dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict,index=[0])
print(data)
方法二:通过from_dict函数将value为标称变量的字典转换为DataFrame对象
dict = {'a':1,'b':2,'c':3}
pd.DataFrame.from_dict(dict,orient='index').T
print(data)
方法三:输入字典时不要让Value为标称属性,把Value转换为list对象再传入即可
dict = {'a':[1],'b':[2],'c':[3]}
data = pd.DataFrame(dict)
print(data)
方法四:直接将key和value取出来,都转换成list对象
dict = {'a':1,'b':2,'c':3}
pd.DataFrame(list(dict.items()))
print(data)
错误方法:直接传入标称属性为value的字典需要写入index,也就是说,需要在创建DataFrame对象时设定index。
dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict)
- 直接创建
方法一:自动产生序号为1,2……的index
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
方法二:自定义index
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
由上面的例子可以看出创建DataFrame时,字典的键对应DataFrame的列名,行名要么是自动生成,要么自定义,同时掌握index的用法。
4.读写常见文件
- CSV(comma-seperated values)
#read
wine=pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#write
wine_reviews.head().to_csv("wine_reviews.csv")
- Excel(单个sheet)
#read
wic = pd.read_excel("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xlsx",
sheet_name='Total Women')
#write
wic.to_excel('wic.xlsx', sheet_name='Total Women')
- Excel(多个sheet)
#read
wic = pd.ExcelFile("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xls",
sheet_name='Total Women')
df=pd.read_excel(wic,"sheet1")
- SQL文件(Structured Query Language ,仅支持SQLite数据库)
#read
import sqlite3
conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")
fires = pd.read_sql_query("SELECT * FROM fires", conn)
#write
conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)
注意:创建DataFrame时加索引用index=[0],而读取文件的时候则是index_col=0。此外,python to_csv和read_csv索引加入内容的问题可以参考:DemonHunter211
更详细的内容参考:pandas-docs
第二课 索引、选择与分配
1.索引:Series和DataFrame的索引方式有很多种;这里以DataFrame为例作简略介绍,Series与之相似。
- 直接通过索引和标签,类似于列表索引
- 通过各种内置方法:.iloc、.loc、.ix、.iat、.index、.column等,关于.iloc、.loc、.ix、.iat等的区别见:roamer314
例如由一个名为reviews的数据表如下:
操作如下:
Select the
description
column from reviews
and assign the result to the variable desc
.
desc = reviews["description"]
desc = reviews.description
Select the first value from the description column of reviews
, assigning it to variable first_description
.
first_description = reviews["description"][0]
first_row=reviews.index[0]
Select the first 10 values from the description
column in reviews
, assigning the result to variable first_descriptions
.
first_descriptions = reviews["description"][:10]
reviews.loc[:9, "description"]
review.head(10)
Select the records with index labels 1
, 2
, 3
, 5
, and 8
, assigning the result to the variable sample_reviews
.
sample_reviews = reviews.iloc[[1,2,3,5,8]]
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]
Create a variable df
containing the country
, province
, region_1
, and region_2
columns of the records with the index labels 0
, 1
, 10
, and 100
.
dict=["country","province","region1","region2"]
df = pd.DataFrame(dict,index=[0,1,10,100],columns=[0])
cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]
Create a variable df
containing the country
and variety
columns of the first 100 records.
df = reviews.loc[:99][["country","variety"]]
df=reviews.iloc[:100][["country","variety"]]
df=reviews[["country","variety"]][:100]
df = reviews.loc[:99, ["country","variety"]]
df = reviews.iloc[:100, [0,11]]
2.选择:通常利用关系表达式来进行判断选择,例如:
Create a DataFrame italian_wines
containing reviews of wines made in Italy
.
italian_wines = reviews[reviews.country=="Italy"]
Create a DataFrame top_oceania_wines
containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.
top_oceania_wines = reviews.loc[(reviews.country in (["Australia","New Zealand"]))
& (reviews.points>=95)]
第三课 总结的方法与映射
1.总结
pandas内置了很多总结的方法:.sum()、.cumsum()(用于计算各行累加值)、.min()、.max()、.describe()、.mean()、.median()、.idxmin()(用于计算最小的索引值)、.idxmax()等。
其他的方法包括:.unique()、.count()、.isna()、isnull()、.value_counts()等等。
- 例子:
Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.
bargain_wine = reviews.loc[((reviews.points)/(reviews.price)).idxmax(),"title"]
# print(type(((reviews.points)/(reviews.price))))#Series类型
2.map(映射)
- 例子1:
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Seriesdescriptor_counts
counting how many times each of these two words appears in thedescription
column in the dataset.
一种实现方法:
import time
time_start=time.time()
ind1=[]
for i in range(len(reviews["description"])):
if "tropical" in reviews["description"][i]:
ind1.append(i)
ind2=[]
for i in range(len(reviews["description"])):
if "fruity" in reviews["description"][i]:
ind2.append(i)
descriptor_counts = pd.Series([len(ind1), len(ind2)], index=['tropical', 'fruity'])
time_end=time.time()
print('totally cost',time_end-time_start)
耗时:3.5854461193084717
另一种实现方法:
import time
time_start=time.time()
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
time_end=time.time()
print('totally cost',time_end-time_start)
耗时:0.15914034843444824
可见内置的map函数对于提高程序执行速度有很大提升作用,多用map少用循环。
- 例子2:
Create a seriesstar_ratings
with the number of stars corresponding to each review in the dataset.
一种实现方法:
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
耗时:5.851803779602051
另一种实现方法:
s1=reviews.country.map(lambda y: y=="Canada")|reviews.points.map(lambda x: x>=95)
s2=reviews.points.map(lambda z: (z>=85)) & reviews.country.map(lambda l: l!="Canada")
s3=reviews.points.map(lambda z: z<85) & reviews.country.map(lambda l: l!="Canada")
star_ratings=(s1.replace(True,3)+s2.replace(True,2)+s3.replace(True,2)).astype(int)
耗时:0.20263314247131348
记住:Series用map,DataFrame用apply,apply要指定axis.
第四课 分组和排序
1.分组
pandas中进行分组的主要方法是groupby,它会是一个非常有用的函数,常常和其他方法组合在一起使用,来看几个例子:
- taster_twitter_handle数量
reviews_written = reviews.groupby('taster_twitter_handle').size()
- 得分统计
reviews.groupby('points').points.count()
- 以得分情况为分组依据,得出各得分值对应价格的最小值
reviews.groupby('points').price.min()
- 以价格作为分组依据,得出各不同价格对应得分最大值,并升序排列
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
- 以酒品分类作为分组依据,对各分类对应价格的最大值和最小值进行输出,生成DataFrame,
price_extremes = reviews.groupby('variety').price.agg([min, max])
- 以国家和酒的分类作为分组依据,得出每种分类对应的酒之数量,至少有三种实现方法,第一种生成的是DataFrame,第二、三种是Series
country_variety_counts = reviews.groupby(['country', 'variety']).description.agg([len]).sort_values(by='len', ascending=False)
scountry_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts = reviews.groupby(["country","variety"]).variety.count().sort_values(ascending=False)
- 结合apply函数,以酿酒厂作为分类依据,分析每个酿酒厂的第一种酒名
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
- 以多个特征:country和province作为分类依据,然后在各自的分组中按照得分的降序排列。
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.argmax()])
- 结合agg方法,以country作为分类依据,对价格进行输出:不同价格数,最低价、最高价
reviews.groupby(['country']).price.agg([len, min, max])
- 多个索引,可以使用reset_index()进行索引复原
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
Grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not the data.
2.排序
pandas的中常用的排序是sort_values以及sort_index,可以一次对多个特征进行排序。
sorted_varieties = price_extremes.sort_values(["min","max"],ascending=False)
countries_reviewed.sort_index()
第五课 数据类型和缺失数据
1.数据类型
- dtype:int、float、object(没有str)
注意用dtype和dtypes的区别
reviews.price.dtype
reviews.price.dtypes
reviews.dtypes
- 数据类型的转换:astype
reviews.points.astype('float64')
使用TAB键结合通配符可以查询个多的内置方法。
reviews.*type*?
reviews.astype
reviews.dtypes
reviews.ftypes
reviews.get_dtype_counts
reviews.get_ftype_counts
reviews.select_dtypes
2.缺失数据
- 缺失数据的查询与统计
reviews.isna
reviews.notna
reviews.price.isnull()
reviews.price.isnull()
reviews[reviews.country.isnull()]
reviews.price.isnull().sum()
- 缺失数据的填充与丢弃
reviews["region_1"].fillna("Unknown")
reviews.dropna("region_1")
NaN值无法用replace方法进行替换.
第六课 重命名和合并
1.重命名
- 改变特征名
renamed = reviews.rename(columns={"region_1":"region","region_2":"locale"})
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))
- 改变索引名,可以用rename,也可以用rename_axis
reindexed = reviews.rename(columns=dict(index="wines"))
reindexed = reviews.rename_axis('wines', axis='rows')
- 改变索引值
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
2.合并(连接)
pandas提供了三种主要的数据结合的方式,依据复杂由低到高的顺序依次是:concat、join、merge,另外还有一个combine_first实例方法。
- concat
当数据具有相同的列属性时,进行数据堆叠,建议使用concat。
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
- joint
方便实现按索引合并,还可以用于合并多个带有相同或者相似索引的DataFrame对象,而不管它们之间有没有重叠的列。
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
这里lsuffix、rsuffix用来区别两个数据集。
- merge
可根据一个或者多个键将不同的DF中的行连接起来,其功能比较强大,具体参pandas.DataFrame.merge - combine_first
可以将重复数据编结在一起,用一个对象中的值填充另一个对象中的缺失值。
参考1:小白白白又白cdllp
参考2:DemonHunter211
参考3:roamer314
参考4:pandas-docs
参考5:pandas.DataFrame.merge
参考6:《利用Python进行数据分析》Wes McKinney 著 唐学韬 等译