Kaggle

Learn专题五——Pandas

2019-06-03  本文已影响0人  Python解决方案

第一课 创建和读写工作薄


1.导入Pandas包

import pandas as pd

2.创建Series结构的数据(注意大小写)

s1=pd.Series([1,2,3,4],index=["one","two","three","four"],name="s1")

3.创建DataFrame结构数据

方法一:直接在创建DataFrame时设置index即可

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict,index=[0])
print(data)

方法二:通过from_dict函数将value为标称变量的字典转换为DataFrame对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame.from_dict(dict,orient='index').T
print(data)

方法三:输入字典时不要让Value为标称属性,把Value转换为list对象再传入即可

dict = {'a':[1],'b':[2],'c':[3]}
data = pd.DataFrame(dict)
print(data)

方法四:直接将key和value取出来,都转换成list对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame(list(dict.items()))
print(data)

错误方法:直接传入标称属性为value的字典需要写入index,也就是说,需要在创建DataFrame对象时设定index。

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict)

方法一:自动产生序号为1,2……的index

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

方法二:自定义index

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
              Bob             Sue
Product A   I liked it. Pretty good.
Product B   It was awful.   Bland.

由上面的例子可以看出创建DataFrame时,字典的键对应DataFrame的列名,行名要么是自动生成,要么自定义,同时掌握index的用法。

4.读写常见文件

#read
wine=pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#write
wine_reviews.head().to_csv("wine_reviews.csv")
#read
wic = pd.read_excel("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xlsx", 
                    sheet_name='Total Women')
#write
wic.to_excel('wic.xlsx', sheet_name='Total Women')
#read
wic = pd.ExcelFile("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xls", 
                    sheet_name='Total Women')
df=pd.read_excel(wic,"sheet1")
#read
import sqlite3
conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")

fires = pd.read_sql_query("SELECT * FROM fires", conn)

#write
conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)

注意:创建DataFrame时加索引用index=[0],而读取文件的时候则是index_col=0。此外,python to_csv和read_csv索引加入内容的问题可以参考:DemonHunter211

更详细的内容参考:pandas-docs


第二课 索引、选择与分配


1.索引:Series和DataFrame的索引方式有很多种;这里以DataFrame为例作简略介绍,Series与之相似。

例如由一个名为reviews的数据表如下:

reviews
操作如下:
Select the description column from reviews and assign the result to the variable desc.
desc = reviews["description"]

desc = reviews.description

Select the first value from the description column of reviews, assigning it to variable first_description.

first_description = reviews["description"][0]

first_row=reviews.index[0]

Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.

first_descriptions = reviews["description"][:10]

reviews.loc[:9, "description"]

review.head(10)

Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

sample_reviews = reviews.iloc[[1,2,3,5,8]]

indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]

Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100.

dict=["country","province","region1","region2"]
df = pd.DataFrame(dict,index=[0,1,10,100],columns=[0])

cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]

Create a variable df containing the country and variety columns of the first 100 records.

df = reviews.loc[:99][["country","variety"]]
df=reviews.iloc[:100][["country","variety"]]
df=reviews[["country","variety"]][:100]
df = reviews.loc[:99, ["country","variety"]]
df = reviews.iloc[:100, [0,11]]

2.选择:通常利用关系表达式来进行判断选择,例如:
Create a DataFrame italian_wines containing reviews of wines made in Italy.

italian_wines = reviews[reviews.country=="Italy"]

Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

top_oceania_wines = reviews.loc[(reviews.country in (["Australia","New Zealand"])) 
&  (reviews.points>=95)]

第三课 总结的方法与映射


1.总结
pandas内置了很多总结的方法:.sum()、.cumsum()(用于计算各行累加值)、.min()、.max()、.describe()、.mean()、.median()、.idxmin()(用于计算最小的索引值)、.idxmax()等。
其他的方法包括:.unique()、.count()、.isna()、isnull()、.value_counts()等等。

bargain_wine = reviews.loc[((reviews.points)/(reviews.price)).idxmax(),"title"]
# print(type(((reviews.points)/(reviews.price))))#Series类型

2.map(映射)

import time
time_start=time.time()

ind1=[]
for i in range(len(reviews["description"])):
    if "tropical" in reviews["description"][i]:
        ind1.append(i)

ind2=[]
for i in range(len(reviews["description"])):
    if "fruity" in reviews["description"][i]:
        ind2.append(i)

descriptor_counts = pd.Series([len(ind1), len(ind2)], index=['tropical', 'fruity'])
time_end=time.time()
print('totally cost',time_end-time_start)

耗时:3.5854461193084717
另一种实现方法:

import time
time_start=time.time()

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

time_end=time.time()
print('totally cost',time_end-time_start)

耗时:0.15914034843444824
可见内置的map函数对于提高程序执行速度有很大提升作用,多用map少用循环。

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

耗时:5.851803779602051
另一种实现方法:

s1=reviews.country.map(lambda y: y=="Canada")|reviews.points.map(lambda x: x>=95)
s2=reviews.points.map(lambda z: (z>=85)) & reviews.country.map(lambda l: l!="Canada")
s3=reviews.points.map(lambda z: z<85) & reviews.country.map(lambda l: l!="Canada")
star_ratings=(s1.replace(True,3)+s2.replace(True,2)+s3.replace(True,2)).astype(int)

耗时:0.20263314247131348
记住:Series用map,DataFrame用apply,apply要指定axis.


第四课 分组和排序


1.分组
pandas中进行分组的主要方法是groupby,它会是一个非常有用的函数,常常和其他方法组合在一起使用,来看几个例子:

reviews_written = reviews.groupby('taster_twitter_handle').size()
reviews.groupby('points').points.count()
reviews.groupby('points').price.min()
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
price_extremes = reviews.groupby('variety').price.agg([min, max])
country_variety_counts = reviews.groupby(['country', 'variety']).description.agg([len]).sort_values(by='len', ascending=False)
scountry_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts = reviews.groupby(["country","variety"]).variety.count().sort_values(ascending=False)
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.argmax()])
reviews.groupby(['country']).price.agg([len, min, max])
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not the data.

2.排序
pandas的中常用的排序是sort_values以及sort_index,可以一次对多个特征进行排序。

sorted_varieties = price_extremes.sort_values(["min","max"],ascending=False)

countries_reviewed.sort_index()

第五课 数据类型和缺失数据


1.数据类型

注意用dtype和dtypes的区别

reviews.price.dtype
reviews.price.dtypes

reviews.dtypes
reviews.points.astype('float64')

使用TAB键结合通配符可以查询个多的内置方法。

reviews.*type*?
reviews.astype
reviews.dtypes
reviews.ftypes
reviews.get_dtype_counts
reviews.get_ftype_counts
reviews.select_dtypes

2.缺失数据

reviews.isna
reviews.notna

reviews.price.isnull()
reviews.price.isnull()

reviews[reviews.country.isnull()]

reviews.price.isnull().sum()
reviews["region_1"].fillna("Unknown")

reviews.dropna("region_1")

NaN值无法用replace方法进行替换.


第六课 重命名和合并


1.重命名

renamed = reviews.rename(columns={"region_1":"region","region_2":"locale"})
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))
reindexed = reviews.rename(columns=dict(index="wines"))

reindexed = reviews.rename_axis('wines', axis='rows')
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

2.合并(连接)

pandas提供了三种主要的数据结合的方式,依据复杂由低到高的顺序依次是:concat、join、merge,另外还有一个combine_first实例方法。

当数据具有相同的列属性时,进行数据堆叠,建议使用concat。

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

方便实现按索引合并,还可以用于合并多个带有相同或者相似索引的DataFrame对象,而不管它们之间有没有重叠的列。

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

这里lsuffix、rsuffix用来区别两个数据集。

最后贴上一张图: Pandas基础使用方法表

参考1:小白白白又白cdllp
参考2:DemonHunter211
参考3:roamer314
参考4:pandas-docs
参考5:pandas.DataFrame.merge
参考6:《利用Python进行数据分析》Wes McKinney 著 唐学韬 等译

上一篇下一篇

猜你喜欢

热点阅读