Learn专题五——Pandas

2019-06-03 本文已影响0人 Python解决方案

第一课创建和读写工作薄

1.导入Pandas包

import pandas as pd

2.创建Series结构的数据(注意大小写)

s1=pd.Series([1,2,3,4],index=["one","two","three","four"]，name="s1")

3.创建DataFrame结构数据

通过字典创建

方法一：直接在创建DataFrame时设置index即可

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict,index=[0])
print(data)

方法二：通过from_dict函数将value为标称变量的字典转换为DataFrame对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame.from_dict(dict,orient='index').T
print(data)

方法三：输入字典时不要让Value为标称属性，把Value转换为list对象再传入即可

dict = {'a':[1],'b':[2],'c':[3]}
data = pd.DataFrame(dict)
print(data)

方法四：直接将key和value取出来，都转换成list对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame(list(dict.items()))
print(data)

错误方法：直接传入标称属性为value的字典需要写入index，也就是说，需要在创建DataFrame对象时设定index。

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict)

直接创建

方法一:自动产生序号为1,2……的index

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

方法二:自定义index

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

              Bob             Sue
Product A   I liked it. Pretty good.
Product B   It was awful.   Bland.

由上面的例子可以看出创建DataFrame时，字典的键对应DataFrame的列名，行名要么是自动生成，要么自定义，同时掌握index的用法。

4.读写常见文件

CSV(comma-seperated values)

#read
wine=pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#write
wine_reviews.head().to_csv("wine_reviews.csv")

Excel(单个sheet)

#read
wic = pd.read_excel("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xlsx", 
                    sheet_name='Total Women')
#write
wic.to_excel('wic.xlsx', sheet_name='Total Women')

Excel(多个sheet)

#read
wic = pd.ExcelFile("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xls", 
                    sheet_name='Total Women')
df=pd.read_excel(wic,"sheet1")

SQL文件(Structured Query Language ,仅支持SQLite数据库)

#read
import sqlite3
conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")

fires = pd.read_sql_query("SELECT * FROM fires", conn)

#write
conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)

注意：创建DataFrame时加索引用index=[0]，而读取文件的时候则是index_col=0。此外，python to_csv和read_csv索引加入内容的问题可以参考：DemonHunter211

更详细的内容参考：pandas-docs

第二课索引、选择与分配

1.索引：Series和DataFrame的索引方式有很多种；这里以DataFrame为例作简略介绍，Series与之相似。

直接通过索引和标签，类似于列表索引
通过各种内置方法：.iloc、.loc、.ix、.iat、.index、.column等，关于.iloc、.loc、.ix、.iat等的区别见：roamer314

例如由一个名为reviews的数据表如下：

reviews
操作如下：
Select the description column from reviews and assign the result to the variable desc.

desc = reviews["description"]

desc = reviews.description

Select the first value from the description column of reviews, assigning it to variable first_description.

first_description = reviews["description"][0]

first_row=reviews.index[0]

Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.

first_descriptions = reviews["description"][:10]

reviews.loc[:9, "description"]

review.head(10)

Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

sample_reviews = reviews.iloc[[1,2,3,5,8]]

indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]

Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100.

dict=["country","province","region1","region2"]
df = pd.DataFrame(dict,index=[0,1,10,100],columns=[0])

cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]

Create a variable df containing the country and variety columns of the first 100 records.

df = reviews.loc[:99][["country","variety"]]
df=reviews.iloc[:100][["country","variety"]]
df=reviews[["country","variety"]][:100]
df = reviews.loc[:99, ["country","variety"]]
df = reviews.iloc[:100, [0,11]]

2.选择：通常利用关系表达式来进行判断选择，例如：
Create a DataFrame italian_wines containing reviews of wines made in Italy.

italian_wines = reviews[reviews.country=="Italy"]

Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

top_oceania_wines = reviews.loc[(reviews.country in (["Australia","New Zealand"])) 
&  (reviews.points>=95)]

第三课总结的方法与映射

1.总结
pandas内置了很多总结的方法:.sum()、.cumsum()(用于计算各行累加值)、.min()、.max()、.describe()、.mean()、.median()、.idxmin()(用于计算最小的索引值)、.idxmax()等。
其他的方法包括：.unique()、.count()、.isna()、isnull()、.value_counts()等等。

例子：
Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_wine = reviews.loc[((reviews.points)/(reviews.price)).idxmax(),"title"]
# print(type(((reviews.points)/(reviews.price))))#Series类型

2.map(映射)

例子1：
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.
一种实现方法：

import time
time_start=time.time()

ind1=[]
for i in range(len(reviews["description"])):
    if "tropical" in reviews["description"][i]:
        ind1.append(i)

ind2=[]
for i in range(len(reviews["description"])):
    if "fruity" in reviews["description"][i]:
        ind2.append(i)

descriptor_counts = pd.Series([len(ind1), len(ind2)], index=['tropical', 'fruity'])
time_end=time.time()
print('totally cost',time_end-time_start)

耗时：3.5854461193084717
另一种实现方法：

import time
time_start=time.time()

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

time_end=time.time()
print('totally cost',time_end-time_start)

耗时：0.15914034843444824
可见内置的map函数对于提高程序执行速度有很大提升作用，多用map少用循环。

例子2：
Create a series star_ratings with the number of stars corresponding to each review in the dataset.
一种实现方法：

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

耗时：5.851803779602051
另一种实现方法：

s1=reviews.country.map(lambda y: y=="Canada")|reviews.points.map(lambda x: x>=95)
s2=reviews.points.map(lambda z: (z>=85)) & reviews.country.map(lambda l: l!="Canada")
s3=reviews.points.map(lambda z: z<85) & reviews.country.map(lambda l: l!="Canada")
star_ratings=(s1.replace(True,3)+s2.replace(True,2)+s3.replace(True,2)).astype(int)

耗时：0.20263314247131348
记住：Series用map,DataFrame用apply，apply要指定axis.

第四课分组和排序

1.分组
pandas中进行分组的主要方法是groupby，它会是一个非常有用的函数，常常和其他方法组合在一起使用，来看几个例子：

taster_twitter_handle数量

reviews_written = reviews.groupby('taster_twitter_handle').size()

得分统计

reviews.groupby('points').points.count()

以得分情况为分组依据，得出各得分值对应价格的最小值

reviews.groupby('points').price.min()

以价格作为分组依据，得出各不同价格对应得分最大值，并升序排列

best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()

以酒品分类作为分组依据，对各分类对应价格的最大值和最小值进行输出，生成DataFrame,

price_extremes = reviews.groupby('variety').price.agg([min, max])

以国家和酒的分类作为分组依据，得出每种分类对应的酒之数量，至少有三种实现方法，第一种生成的是DataFrame，第二、三种是Series

country_variety_counts = reviews.groupby(['country', 'variety']).description.agg([len]).sort_values(by='len', ascending=False)

scountry_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)

country_variety_counts = reviews.groupby(["country","variety"]).variety.count().sort_values(ascending=False)

结合apply函数，以酿酒厂作为分类依据，分析每个酿酒厂的第一种酒名

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

以多个特征:country和province作为分类依据,然后在各自的分组中按照得分的降序排列。

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.argmax()])

结合agg方法，以country作为分类依据，对价格进行输出：不同价格数，最低价、最高价

reviews.groupby(['country']).price.agg([len, min, max])

多个索引，可以使用reset_index()进行索引复原

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not the data.

2.排序
pandas的中常用的排序是sort_values以及sort_index,可以一次对多个特征进行排序。

sorted_varieties = price_extremes.sort_values(["min","max"],ascending=False)

countries_reviewed.sort_index()

第五课数据类型和缺失数据

1.数据类型

dtype:int、float、object(没有str)

注意用dtype和dtypes的区别

reviews.price.dtype
reviews.price.dtypes

reviews.dtypes

数据类型的转换：astype

reviews.points.astype('float64')

使用TAB键结合通配符可以查询个多的内置方法。

reviews.*type*?

reviews.astype
reviews.dtypes
reviews.ftypes
reviews.get_dtype_counts
reviews.get_ftype_counts
reviews.select_dtypes

2.缺失数据

缺失数据的查询与统计

reviews.isna
reviews.notna

reviews.price.isnull()
reviews.price.isnull()

reviews[reviews.country.isnull()]

reviews.price.isnull().sum()

缺失数据的填充与丢弃

reviews["region_1"].fillna("Unknown")

reviews.dropna("region_1")

NaN值无法用replace方法进行替换.

第六课重命名和合并

1.重命名

改变特征名

renamed = reviews.rename(columns={"region_1":"region","region_2":"locale"})
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))

改变索引名，可以用rename,也可以用rename_axis

reindexed = reviews.rename(columns=dict(index="wines"))

reindexed = reviews.rename_axis('wines', axis='rows')

改变索引值

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

2.合并（连接）

pandas提供了三种主要的数据结合的方式，依据复杂由低到高的顺序依次是：concat、join、merge，另外还有一个combine_first实例方法。

concat

当数据具有相同的列属性时，进行数据堆叠，建议使用concat。

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

joint

方便实现按索引合并，还可以用于合并多个带有相同或者相似索引的DataFrame对象，而不管它们之间有没有重叠的列。

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

这里lsuffix、rsuffix用来区别两个数据集。

merge
可根据一个或者多个键将不同的DF中的行连接起来，其功能比较强大，具体参pandas.DataFrame.merge
combine_first
可以将重复数据编结在一起，用一个对象中的值填充另一个对象中的缺失值。

最后贴上一张图：

Pandas基础使用方法表

参考1：小白白白又白cdllp
参考2：DemonHunter211
参考3：roamer314
参考4：pandas-docs
参考5：pandas.DataFrame.merge
参考6：《利用Python进行数据分析》Wes McKinney 著唐学韬等译

Learn专题五——Pandas

第一课创建和读写工作薄

第二课索引、选择与分配

第三课总结的方法与映射

第四课分组和排序

第五课数据类型和缺失数据

第六课重命名和合并

猜你喜欢

热点阅读

Learn专题五——Pandas

第一课 创建和读写工作薄

第二课 索引、选择与分配

第三课 总结的方法与映射

第四课 分组和排序

第五课 数据类型和缺失数据

第六课 重命名和合并

猜你喜欢

热点阅读

第一课创建和读写工作薄

第二课索引、选择与分配

第三课总结的方法与映射

第四课分组和排序

第五课数据类型和缺失数据

第六课重命名和合并