python数据分析与机器学习实战我爱编程

(八)pandas知识学习3-python数据分析与机器学习实战

2018-05-03  本文已影响57人  努力奋斗的durian

文章原创,最近更新:2018-05-3

引言:关于series的介绍

这这里为了方便大家可以学习series这个案例,将fandango_score_comparison.csv这个文件以百度网盘分享给大家,链接: https://pan.baidu.com/s/1U6z7OvXK75L1AGm1vYlN4w 密码: qe1a

课程来源: python数据分析与机器学习实战-唐宇迪

dataframe是相当于矩阵,series是相当于矩阵的一行,series类型由一组数据及与之相关的数据索引组成.
比如以下一个小的案例:

import pandas as pd
a=pd.Series([9,8,7,6])
a
Out[19]: 
0    9
1    8
2    7
3    6
dtype: int64

以下是关于电影的一个评分以及相关的数据.我们观察以下用series结构有没有什么特别之处?

import pandas as pd

fandango=pd.read_csv('fandango_score_comparison.csv')

series_film = fandango['FILM']

type(series_film)
Out[85]: pandas.core.series.Series

通过上面可以看出从fandango是个Datafram,然后将fandango其中的一列['FILM']拿出来,fandango['FILM']变成了Series.

在Series进行定位,与Datafram有什么区别呢?

其实都是一样的用法,通过索引和切片的方式.

series_film = fandango['FILM']
series_film[0:5]
Out[84]: 
0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

series_rt = fandango['RottenTomatoes']
series_rt[0:5]
Out[87]: 
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64

新建一个Series结构应该怎么办?

首先我们查看series.values的结构.发现结果是一个ndarray.即就是从series每一个值拿出来,每个值就是ndarray.这就说明了,dataframe里面的结构是series,series里面的结构是ndarray.其实pandas是封装在numpy的基础之上的.

很多操作就是把numpy组合形成便利的条件,pandas与numpy很多操作都是互通的.

film_names=series_film.values

type(film_names)
Out[89]: numpy.ndarray

下面的操作是创建一个series出来,在pandas当中要将series导进来.

from pandas  import Series

Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引,于是会自动创建一个0到N-1(N为数据的长度)的整数型索引。你可以通过Series 的values和index属性获取其数组表示形式和索引对象:

与普通NumPy数组相比,你可以通过索引的方式选取Series中的单个或一组值:

案例创建一个series,在这个结构当中,一个电影名字,对应其中一个媒体的评分值等于多少.

from pandas  import Series
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names)

series_custom[['Minions (2015)', 'Leviathan (2014)']]
Out[96]: 
Minions (2015)      54
Leviathan (2014)    99
dtype: int64

series如何排序?

reindex更多的不是修改pandas对象的索引,而只是修改索引的顺序,如果修改的索引不存在就会使用默认的None代替此行。且不会修改原数组,要修改需要使用赋值语句。

首先提取电影的名称,即是将index提取成列表.

original_index = series_custom.index.tolist()

original_index
Out[110]: 
['Avengers: Age of Ultron (2015)',
 'Cinderella (2015)',
 'Ant-Man (2015)',
 'Do You Believe? (2015)',
 'Hot Tub Time Machine 2 (2015)',
 ....
 'Mr. Holmes (2015)',
 "'71 (2015)",
 'Two Days, One Night (2014)',
 'Gett: The Trial of Viviane Amsalem (2015)',
 'Kumiko, The Treasure Hunter (2015)']

对电影的名称进行排序.排序后的结果如下:

sorted_index = sorted(original_index)

sorted_index
Out[112]: 
["'71 (2015)",
 '5 Flights Up (2015)',
 'A Little Chaos (2015)',
 'A Most Violent Year (2014)',
 'About Elly (2015)',
....
 'What We Do in the Shadows (2015)',
 'When Marnie Was There (2015)',
 "While We're Young (2015)",
 'Wild Tales (2014)',
 'Woman in Gold (2015)']

用reindex函数,根据排序后的电影名称修改series_custom的索引顺序,具体如下:

sorted_by_index = series_custom.reindex(sorted_index)

sorted_by_index
Out[114]: 
'71 (2015)                                         97
5 Flights Up (2015)                                52
A Little Chaos (2015)                              40
A Most Violent Year (2014)                         90
About Elly (2015)                                  97
....
When Marnie Was There (2015)                       89
While We're Young (2015)                           83
Wild Tales (2014)                                  96
Woman in Gold (2015)                               52
Length: 146, dtype: int64

如何用对series的索引以及值进行排序?

用sort_index()对索引进行排序,得到sc2

sc2 = series_custom.sort_index()

sc2
Out[116]: 
'71 (2015)                                         97
5 Flights Up (2015)                                52
A Little Chaos (2015)                              40
A Most Violent Year (2014)                         90
About Elly (2015)                                  97
....
What We Do in the Shadows (2015)                   96
When Marnie Was There (2015)                       89
While We're Young (2015)                           83
Wild Tales (2014)                                  96
Woman in Gold (2015)                               52
Length: 146, dtype: int64

用sort_values()对值进行排序,得到sc3

sc3 = series_custom.sort_values()

sc3
Out[118]: 
Paul Blart: Mall Cop 2 (2015)                    5
Hitman: Agent 47 (2015)                          7
Hot Pursuit (2015)                               8
Fantastic Four (2015)                            9
Taken 3 (2015)                                   9
....
Song of the Sea (2014)                          99
Phoenix (2015)                                  99
Selma (2014)                                    99
Seymour: An Introduction (2015)                100
Gett: The Trial of Viviane Amsalem (2015)      100
Length: 146, dtype: int64

如何对2个series进行相加?

对于两个维度一样的series,相加之后就会得到一个新的series.如果维度一样,对应位置相加,如果维度不一样,直接是分别相加的要给操作.


通过用add函数将2个series_custom进行相加.

series_custom
Out[123]: 
Avengers: Age of Ultron (2015)                     74
Cinderella (2015)                                  85
Ant-Man (2015)                                     80
Do You Believe? (2015)                             18
Hot Tub Time Machine 2 (2015)                      14
....
Mr. Holmes (2015)                                  87
'71 (2015)                                         97
Two Days, One Night (2014)                         97
Gett: The Trial of Viviane Amsalem (2015)         100
Kumiko, The Treasure Hunter (2015)                 87
Length: 146, dtype: int64

np.add(a,b)等价于a+b,相加的结果如下:

np.add(series_custom, series_custom)#等价于series_custom+series_custom
Out[124]: 
Avengers: Age of Ultron (2015)                    148
Cinderella (2015)                                 170
Ant-Man (2015)                                    160
Do You Believe? (2015)                             36
Hot Tub Time Machine 2 (2015)                      28
....
Mr. Holmes (2015)                                 174
'71 (2015)                                        194
Two Days, One Night (2014)                        194
Gett: The Trial of Viviane Amsalem (2015)         200
Kumiko, The Treasure Hunter (2015)                174
Length: 146, dtype: int64

用np.sin()对series求sin值

np.sin(series_custom)
Out[126]: 
Avengers: Age of Ultron (2015)                   -0.985146
Cinderella (2015)                                -0.176076
Ant-Man (2015)                                   -0.993889
Do You Believe? (2015)                           -0.750987
Hot Tub Time Machine 2 (2015)                     0.990607
....
Mr. Holmes (2015)                                -0.821818
'71 (2015)                                        0.379608
Two Days, One Night (2014)                        0.379608
Gett: The Trial of Viviane Amsalem (2015)        -0.506366
Kumiko, The Treasure Hunter (2015)               -0.821818
Length: 146, dtype: float64

求series_custom的最大值,用np.max()进行计算

np.max(series_custom)
Out[127]: 100

判断series_custom中大于50的数

series_custom > 50
Out[128]: 
Avengers: Age of Ultron (2015)                     True
Cinderella (2015)                                  True
Ant-Man (2015)                                     True
Do You Believe? (2015)                            False
Hot Tub Time Machine 2 (2015)                     False
....
Mr. Holmes (2015)                                  True
'71 (2015)                                         True
Two Days, One Night (2014)                         True
Gett: The Trial of Viviane Amsalem (2015)          True
Kumiko, The Treasure Hunter (2015)                 True
Length: 146, dtype: bool

查找series_custom中大于50的数

series_greater_than_50
Out[130]: 
Avengers: Age of Ultron (2015)                                             74
Cinderella (2015)                                                          85
Ant-Man (2015)                                                             80
The Water Diviner (2015)                                                   63
Top Five (2014)                                                            86
....
Mr. Holmes (2015)                                                          87
'71 (2015)                                                                 97
Two Days, One Night (2014)                                                 97
Gett: The Trial of Viviane Amsalem (2015)                                 100
Kumiko, The Treasure Hunter (2015)                                         87
Length: 94, dtype: int64

查找series_custom中>50,<75的数

criteria_one = series_custom > 50

criteria_two = series_custom < 75

both_criteria = series_custom[criteria_one & criteria_two]

both_criteria
Out[134]: 
Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
Unbroken (2014)                                                           51
Southpaw (2015)                                                           59
Insidious: Chapter 3 (2015)                                               59
The Man From U.N.C.L.E. (2015)                                            68
....
Woman in Gold (2015)                                                      52
The Last Five Years (2015)                                                60
Jurassic World (2015)                                                     71
Minions (2015)                                                            54
Spare Parts (2015)                                                        52
dtype: int64

如何使2个series的index相同?如何进行计算?

index相同,两个value会在相对应的位置进行计算,会得到一个新的series

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])

rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])

rt_mean = (rt_critics + rt_users)/2

rt_mean
Out[138]: 
FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
Do You Believe? (2015)                            51.0
Hot Tub Time Machine 2 (2015)                     21.0
....
Inside Out (2015)                                 94.0
Mr. Holmes (2015)                                 82.5
'71 (2015)                                        89.5
Two Days, One Night (2014)                        87.5
Gett: The Trial of Viviane Amsalem (2015)         90.5
Kumiko, The Treasure Hunter (2015)                75.0
Length: 146, dtype: float64

如何指定一个索引?

set_index函数拓展:
DataFrame可以通过set_index方法,可以设置单索引和复合索引。
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
append添加新索引,drop为False,inplace为True时,索引将会还原为列

fandango的index是0-146.

fandango=pd.read_csv('fandango_score_comparison.csv')
fandango.index
Out[149]: RangeIndex(start=0, stop=146, step=1)

通过set_index,将0-146更改为'FILM'这一列的值为索引,结果如下:

fandango_films = fandango.set_index('FILM', drop=False)
fandango_films.index
Out[140]: 
Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
       'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
       'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
       'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
       ...
       'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
       'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
       'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
       'Gett: The Trial of Viviane Amsalem (2015)',
       'Kumiko, The Treasure Hunter (2015)'],
      dtype='object', name='FILM', length=146)

对指定索引进行切片

一个数值型可以进行切片选择,对str之间用冒号:,安装字典的排列,比如a:c,代表a,b,c进行排列的.会将对应索引的行所有的数据都可以拿出来.与数值做索引的方法是类似的.

案例:切片从"Avengers: Age of Ultron (2015)"到"Hot Tub Time Machine 2 (2015)"的行.

fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]与fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]等价

fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
Out[147]: 
                                                          FILM  \
FILM                                                             
Avengers: Age of Ultron (2015)  Avengers: Age of Ultron (2015)   
Cinderella (2015)                            Cinderella (2015)   
Ant-Man (2015)                                  Ant-Man (2015)   
Do You Believe? (2015)                  Do You Believe? (2015)   
Hot Tub Time Machine 2 (2015)    Hot Tub Time Machine 2 (2015)   


                                RT_user_norm         ...           IMDB_norm  \
FILM                                                 ...                       
Avengers: Age of Ultron (2015)           4.3         ...                3.90   
Cinderella (2015)                        4.0         ...                3.55   
Ant-Man (2015)                           4.5         ...                3.90   
Do You Believe? (2015)                   4.2         ...                2.70   
Hot Tub Time Machine 2 (2015)            1.4         ...                2.55   

                                RT_norm_round  RT_user_norm_round  \

                                Fandango_Difference  
FILM                                                 
Avengers: Age of Ultron (2015)                  0.5  
Cinderella (2015)                               0.5  
Ant-Man (2015)                                  0.5  
Do You Believe? (2015)                          0.5  
Hot Tub Time Machine 2 (2015)                   0.5  

[5 rows x 22 columns]

相类似的小练习:

#查找一个索引对应的行
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
#查找三个索引对应的行
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]

如何更改数据类型?

通过dtypes函数查询dataframe每行的数据类型,得到结果如下:

import numpy as np

types = fandango_films.dtypes

types
Out[158]: 
FILM                           object
RottenTomatoes                  int64
RottenTomatoes_User             int64
Metacritic                      int64
Metacritic_User               float64
....
IMDB_norm_round               float64
Metacritic_user_vote_count      int64
IMDB_user_vote_count            int64
Fandango_votes                  int64
Fandango_Difference           float64
dtype: object

获取数据类型是float64的索引

float_columns = types[types.values == 'float64'].index

float_columns
Out[160]: 
Index(['Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Fandango_Difference'],
      dtype='object')

通过获得的float64的索引,以此得到对应索引中所有行的数据

float_df = fandango_films[float_columns]

float_df
Out[162]: 
                                                Metacritic_User  IMDB  \
FILM                                                                    
Avengers: Age of Ultron (2015)                              7.1   7.8   
Cinderella (2015)                                           7.5   7.1   
Ant-Man (2015)                                              8.1   7.8   
Do You Believe? (2015)                                      4.7   5.4   
Hot Tub Time Machine 2 (2015)                               3.4   5.1   
The Water Diviner (2015)                                    6.8   7.2   
Irrational Man (2015)                                       7.6   6.9   
Top Five (2014)                                             6.8   6.5   
Shaun the Sheep Movie (2015)                                8.8   7.4   
Love & Mercy (2015)                                         8.5   7.8   
Far From The Madding Crowd (2015)                           7.5   7.2   
Black Sea (2015)                                            6.6   6.4   
Leviathan (2014)                                            7.2   7.7   
Unbroken (2014)                                             6.5   7.2   
The Imitation Game (2014)                                   8.2   8.1   
Taken 3 (2015)                                              4.6   6.1   
Ted 2 (2015)                                                6.5   6.6   
Southpaw (2015)                                             8.2   7.8   
Night at the Museum: Secret of the Tomb (2014)              5.8   6.3   
Pixels (2015)                                               5.3   5.6   
McFarland, USA (2015)                                       7.2   7.5   
Insidious: Chapter 3 (2015)                                 6.9   6.3   
The Man From U.N.C.L.E. (2015)                              7.9   7.6   
Run All Night (2015)                                        7.3   6.6   
Trainwreck (2015)                                           6.0   6.7   
Selma (2014)                                                7.1   7.5   
Ex Machina (2015)                                           7.9   7.7   
Still Alice (2015)                                          7.8   7.5   
Wild Tales (2014)                                           8.8   8.2   
The End of the Tour (2015)                                  7.5   7.9   
                                                            ...  
Clouds of Sils Maria (2015)                                     0.1  
Testament of Youth (2015)                                       0.1  
Infinitely Polar Bear (2015)                                    0.1  
Phoenix (2015)                                                  0.1  
The Wolfpack (2015)                                             0.1  
The Stanford Prison Experiment (2015)                           0.1  
Tangerine (2015)                                                0.1  
Magic Mike XXL (2015)                                           0.1  
Home (2015)                                                     0.1  
The Wedding Ringer (2015)                                       0.1  
Woman in Gold (2015)                                            0.1  
The Last Five Years (2015)                                      0.1  
Mission: Impossible – Rogue Nation (2015)                     0.1  
Amy (2015)                                                      0.1  
Jurassic World (2015)                                           0.0  
Minions (2015)                                                  0.0  
Max (2015)                                                      0.0  
Paul Blart: Mall Cop 2 (2015)                                   0.0  
The Longest Ride (2015)                                         0.0  
The Lazarus Effect (2015)                                       0.0  
The Woman In Black 2 Angel of Death (2015)                      0.0  
Danny Collins (2015)                                            0.0  
Spare Parts (2015)                                              0.0  
Serena (2015)                                                   0.0  
Inside Out (2015)                                               0.0  
Mr. Holmes (2015)                                               0.0  
'71 (2015)                                                      0.0  
Two Days, One Night (2014)                                      0.0  
Gett: The Trial of Viviane Amsalem (2015)                       0.0  
Kumiko, The Treasure Hunter (2015)                              0.0  

[146 rows x 15 columns]

通过std()函数,对每个指标都进行计算标准差

deviations = float_df.apply(lambda x: np.std(x))

deviations
Out[165]: 
Metacritic_User               1.505529
IMDB                          0.955447
Fandango_Stars                0.538532
Fandango_Ratingvalue          0.501106
RT_norm                       1.503265
RT_user_norm                  0.997787
Metacritic_norm               0.972522
Metacritic_user_nom           0.752765
IMDB_norm                     0.477723
RT_norm_round                 1.509404
RT_user_norm_round            1.003559
Metacritic_norm_round         0.987561
Metacritic_user_norm_round    0.785412
IMDB_norm_round               0.501043
Fandango_Difference           0.152141
dtype: float64

相类似的小练习:

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)
上一篇 下一篇

猜你喜欢

热点阅读