Python数据分析《Pandas 1.x Cookbook·第二版》

《Pandas 1.x Cookbook · 第二版》第01章

2021-02-02  本文已影响0人  SeanCheney

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

下载本书:https://www.jianshu.com/p/62524f4c240e


1.1 引入Pandas和Numpy

>>> import pandas as pd
>>> import numpy as np

1.2 Pandas的DataFrame(数据帧)

使用read_csv()函数将数据从磁盘读入内存中的DataFrame对象。

所有数据可从GitHub下载:下载地址

>>> movies = pd.read_csv("data/movie.csv")
>>> movies
      color        direc/_name  ...  aspec/ratio  movie/likes
0     Color      James Cameron  ...         1.78        33000
1     Color     Gore Verbinski  ...         2.35            0
2     Color         Sam Mendes  ...         2.35        85000
3     Color  Christopher Nolan  ...         2.35       164000
4       NaN        Doug Walker  ...          NaN            0
...     ...                ...  ...          ...          ...
4911  Color        Scott Smith  ...          NaN           84
4912  Color                NaN  ...        16.00        32000
4913  Color   Benjamin Roberds  ...          NaN           16
4914  Color        Daniel Hsia  ...         2.35          660
4915  Color           Jon Gunn  ...         1.85          456
DataFrame的结构

在上图中,索引index是0轴,列column是1轴。

Pandas使用NaN(not a number)表示缺失值。

movies.head(n)可以返回前n行,movies.tail(n)可以返回后n行。


1.3 DataFrame的属性

提取DataFrame的列、索引和数据:

>>> movies = pd.read_csv("data/movie.csv")
>>> columns = movies.columns
>>> index = movies.index
>>> data = movies.to_numpy()

展示列、索引和数据:

>>> columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
>>> index 
RangeIndex(start=0, stop=4916, step=1)
>>> data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

列、索引和数据的数据类型:

>>> type(index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> type(columns)
<class 'pandas.core.indexes.base.Index'>
>>> type(data)
<class 'numpy.ndarray'>

index和column是Index的子类,有时也被称为行索引和列索引:

>>> issubclass(pd.RangeIndex, pd.Index)
True
>>> issubclass(columns.__class__, pd.Index)
True

DataFrame的.values属性(或.to_numpy()方法)可以将索引、列、数据转换为ndarray,也就是Numpy的n维数组:

>>> index.to_numpy()
array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64))
>>> columns.to_numpy()
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)

1.4 了解数据类型

广义上讲,可以将数据分为连续数据和离散的类别数据。

可以使用.dtypes属性展示列名和对应的数据类型:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies.dtypes
color                       object
director_name               object
num_critic_for_reviews     float64
duration                   float64
director_facebook_likes    float64
                            ...   
title_year                 float64
actor_2_facebook_likes     float64
imdb_score                 float64
aspect_ratio               float64
movie_facebook_likes         int64
Length: 28, dtype: object

使用.value_counts方法返回每种数据类型的数量:

>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

使用.info方法查看数据类型:

>>> movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
color                        4897 non-null object
director_name                4814 non-null object
num_critic_for_reviews       4867 non-null float64
duration                     4901 non-null float64
director_facebook_likes      4814 non-null float64
actor_3_facebook_likes       4893 non-null float64
actor_2_name                 4903 non-null object
actor_1_facebook_likes       4909 non-null float64
gross                        4054 non-null float64
genres                       4916 non-null object
actor_1_name                 4909 non-null object
movie_title                  4916 non-null object
num_voted_users              4916 non-null int64
cast_total_facebook_likes    4916 non-null int64
actor_3_name                 4893 non-null object
facenumber_in_poster         4903 non-null float64 plot_keywords                4764 non-null object
movie_imdb_link              4916 non-null object
num_user_for_reviews         4895 non-null float64
language                     4904 non-null object
country                      4911 non-null object
content_rating               4616 non-null object
budget                       4432 non-null float64
title_year                   4810 non-null float64
actor_2_facebook_likes       4903 non-null float64
imdb_score                   4916 non-null float64
aspect_ratio                 4590 non-null float64
movie_facebook_likes         4916 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB

Pandas默认将数值类型用64位表示,所以上面出现的是int64和float64。

object类型中可能包含任意Python的数据类型,也可能包含缺失值。对于Pandas的Series,如果有缺失值和字符串,则数据类型是O:

上来就讲应用最广的DataFrame是这本书的一个特点,原本应该从Series讲起的。

>>> pd.Series(["Paul", np.nan, "George"]).dtype
dtype('O')

1.5 选择一列

使用列索引选择一列:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies["director_name"]
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

使用属性选择一列:

>>> movies.director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

使用.loc.iloc选择一列,前者使用列名,后者使用位置序号:

# :表示从第一行到最后一行全选
>>> movies.loc[:, "director_name"]
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object
>>> movies.iloc[:, 1]
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

查看列的属性

>>> movies["director_name"].index
RangeIndex(start=0, stop=4916, step=1)
>>> movies["director_name"].dtype
dtype('O')
>>> movies["director_name"].size
4196
>>> movies["director_name"].name
'director_name'

确认输出是Series对象:

>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>

DataFrame中的每一列都可以被取出,当做Series进行操作。


1.6 调用Series方法

dir()查看pd.Series和pd.DataFrame的方法:

>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400

先读取两列:

>>> movies = pd.read_csv("data/movie.csv")
>>> director = movies["director_name"]
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director.dtype
dtype('O')
>>> fb_likes.dtype
dtype('float64')

除了可以用.head方法列出Series的前5行,还可以用.sample查看数据:

>>> director.head()
0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object
>>> director.sample(n=5, random_state=42)
2347      Brian Percival
4687         Lucio Fulci
691        Phillip Noyce
3911       Sam Peckinpah
2488    Rowdy Herrington
Name: director_name, dtype: object
>>> fb_likes.head()
0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

Series的数据类型决定了哪些方法最常用。例如,object最常用的方法是.value_counts

>>> director.value_counts()
Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
Ridley Scott        16
                    ..
Eric England         1
Moustapha Akkad      1
Jay Oliva            1
Scott Speer          1
Leon Ford            1
Name: director_name, Length: 2397, dtype: int64

数值型数据也可以使用.value_counts

>>> fb_likes.value_counts()
1000.0     436
11000.0    206
2000.0     189
3000.0     150
12000.0    131
          ... 
362.0        1
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

.size.shapelen()查看个数,.uinique()返回唯一值:

>>> director.size
4916
>>> director.shape
(4916,)
>>> len(director)
4916
>>> director.unique()
array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
       'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)

.count()返回的是非缺失值:

>>> director.count()
4814
>>> fb_likes.count()
4909

方法.min.max.mean.median.std,可以查看统计值:

>>> fb_likes.min()
0.0
>>> fb_likes.max()
640000.0
>>> fb_likes.mean()
6494.488490527602
>>> fb_likes.median()
982.0
>>> fb_likes.std()
15106.986883848309

.describe也可以返回统计信息:

>>> fb_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64
>>> director.describe()
count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

.quantile()方法可以返回分位数:

>>> fb_likes.quantile(0.2)
510.0
>>> fb_likes.quantile(
...     [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
... )
0.1      240.0
0.2      510.0
0.3      694.0
0.4      854.0
0.5      982.0
0.6     1000.0
0.7     8000.0
0.8    13000.0
0.9    18000.0
Name: actor_1_facebook_likes, dtype: float64

.isna()用于查看是否有缺失值:

>>> director.isna()
0       False
1       False
2       False
3       False
4       False
        ...  
4911    False
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

.fillna()用于填充缺失值:

>>> fb_likes_filled = fb_likes.fillna(0)
>>> fb_likes_filled.count()
4916

.dropna()用于删除缺失值:

>>> fb_likes_dropped = fb_likes.dropna()
>>> fb_likes_dropped.size
4909

对于.value_counts()方法,将参数normalize设为True,返回的是相对频率:

>>> director.value_counts(normalize=True)
Steven Spielberg    0.005401
Woody Allen         0.004570
Clint Eastwood      0.004155
Martin Scorsese     0.004155
Ridley Scott        0.003324
                      ...
Eric England        0.000208
Moustapha Akkad     0.000208
Jay Oliva           0.000208
Scott Speer         0.000208
Leon Ford           0.000208
Name: director_name, Length: 2397, dtype: float64

另一个查看是否有缺失值的属性是.hasnans

>>> director.hasnans
True

.notna()方法返回是否不是缺失值:

>>> director.notna()
0        True
1        True
2        True
3        True
4        True
        ...  
4911     True
4912    False
4913     True
4914     True
4915     True
Name: director_name, Length: 4916, dtype: bool

.isnull()的作用和.isna()相同,因为Pandas中使用NaN表示缺失值,后者更便于记忆。


1.7 Series运算

加载列imdb_score:

>>> movies = pd.read_csv("data/movie.csv")
>>> imdb_score = movies["imdb_score"]
>>> imdb_score
0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

加减乘除、指数运算,直接对列操作就成:

>>> imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

//%分别返回除法的整数和余数部分:

>>> imdb_score // 7
0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
4911    1.0
4912    1.0
4913    0.0
4914    0.0
4915    0.0
Name: imdb_score, Length: 4916, dtype: float64

六种比较运算符,><>=<===!=返回的是布尔值:

>>> imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool
>>> director = movies["director_name"]
>>> director == "James Cameron"
0        True
1       False
2       False
3       False
4       False
        ...  
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

.add()方法等同于+

>>> imdb_score.add(1)  # imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7)  # imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

使用方法的原因是,方法中可以添加参数,比如.sub方法中,可以设置参数fill_value

>>> money = pd.Series([100, 20, None])
>>> money – 15
0    85.0
1     5.0
2     NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0    85.0
1     5.0
2   -15.0
dtype: float64

算数方法包括:.add.sub.mul.div.floordiv.mod.pow

比较方法包括:.lt.gt.le.ge.eq.ne


1.8 链式方法

将方法连用。

>>> movies = pd.read_csv("data/movie.csv")
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director = movies["director_name"]
>>> director.value_counts().head(3)
Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Name: director_name, dtype: int64

统计缺失值的个数。

>>> fb_likes.isna().sum()
7
>>> fb_likes.dtype
dtype('float64')
>>> (fb_likes.fillna(0).astype(int).head())
0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

.pipe()可以用于检测链式方法中的中间值:

>>> def debug_ser(ser):
...     print("BEFORE")
...     print(ser)
...     print("AFTER")
...     return ser
>>> (fb_likes.fillna(0).pipe(debug_ser).astype(int).head())
BEFORE
0        1000.0
1       40000.0
2       11000.0
3       27000.0
4         131.0
         ...   
4911      637.0
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
AFTER
0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

用全局变量存储中间值,也可以使用.pipe

>>> intermediate = None
>>> def get_intermediate(ser):
...     global intermediate
...     intermediate = ser
...     return ser
>>> res = (
...     fb_likes.fillna(0)
...     .pipe(get_intermediate)
...     .astype(int)
...     .head()
... )
>>> intermediate
0        1000.0
1       40000.0
2       11000.0
3       27000.0
4         131.0
         ...   
4911      637.0
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

1.9 对列进行重命名

>>> movies = pd.read_csv("data/movie.csv")

先定义好列名字典

>>> col_map = {
...     "director_name": "director",
...     "num_critic_for_reviews": "critic_reviews",
... }

将列名字典传给rename方法:

>>> movies.rename(columns=col_map).head()
   color           director  ...  aspec/ratio  movie/likes
0  Color      James Cameron  ...         1.78        33000
1  Color     Gore Verbinski  ...         2.35            0
2  Color         Sam Mendes  ...         2.35        85000
3  Color  Christopher Nolan  ...         2.35       164000
4    NaN        Doug Walker  ...          NaN            0

重命名行索引:

>>> idx_map = {
...     "Avatar": "Ratava",
...     "Spectre": "Ertceps",
...     "Pirates of the Caribbean: At World's End": "POC",
... }
>>> col_map = {
...     "aspect_ratio": "aspect",
...     "movie_facebook_likes": "fblikes",
... }
>>> (
...     movies.set_index("movie_title")
...     .rename(index=idx_map, columns=col_map)
...     .head(3)
... )
             color   director_name  ...  aspect  fblikes
movie_title                         ...                 
Ratava       Color   James Cameron  ...    1.78    

重命名行索引和列索引的另一种方法,是直接对属性.index.column赋值:

>>> movies = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> ids = movies.index.to_list()
>>> columns = movies.columns.to_list()
# rename the row and column labels with list assignments
>>> ids[0] = "Ratava"
>>> ids[1] = "POC"
>>> ids[2] = "Ertceps"
>>> columns[1] = "director"
>>> columns[-2] = "aspect"
>>> columns[-1] = "fblikes"
>>> movies.index = ids
>>> movies.columns = columns
>>> movies.head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

另一种方法,是将一个函数传给.rename方法。下面的例子删去了列名中的空格,将所有字母转换成了小写:

>>> def to_clean(val):
...     return val.strip().lower().replace(" ", "_")
>>> movies.rename(columns=to_clean).head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

用列表生成式的方法,重命名列索引:

>>> cols = [
...     col.strip().lower().replace(" ", "_")
...     for col in movies.columns
... ]
>>> movies.columns = cols
>>> movies.head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

1.10 创建和删除列

最简单的创建列的方法是赋值:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies["has_seen"] = 0

使用.assign方法进行赋值:

>>> movies = pd.read_csv("data/movie.csv")
>>> idx_map = {
...     "Avatar": "Ratava",
...     "Spectre": "Ertceps",
...     "Pirates of the Caribbean: At World's End": "POC",
... }
>>> col_map = {
...     "aspect_ratio": "aspect",
...     "movie_facebook_likes": "fblikes",
... }
>>> (
...     movies.rename(
...         index=idx_map, columns=col_map
...     ).assign(has_seen=0)
... )
      color      director_name  ...  fblikes  has_seen
0     Color      James Cameron  ...    33000         0
1     Color     Gore Verbinski  ...        0         0
2     Color         Sam Mendes  ...    85000         0
3     Color  Christopher Nolan  ...   164000         0
4       NaN        Doug Walker  ...        0         0
...     ...                ...  ...      ...       ...
4911  Color        Scott Smith  ...       84         0
4912  Color                NaN  ...    32000         0
4913  Color   Benjamin Roberds  ...       16         0
4914  Color        Daniel Hsia  ...      660         0
4915  Color           Jon Gunn  ...      456         0

对几列进行操作之后,再赋值到新列:

最简单的方法,先对列进行操作:

>>> total = (
...     movies["actor_1_facebook_likes"]
...     + movies["actor_2_facebook_likes"]
...     + movies["actor_3_facebook_likes"]
...     + movies["director_facebook_likes"]
... )
>>> total.head(5)
0     2791.0
1    46563.0
2    11554.0
3    95000.0
4        NaN
dtype: float64

第二种方法,使用.sum方法:

>>> cols = [
...     "actor_1_facebook_likes",
...     "actor_2_facebook_likes",
...     "actor_3_facebook_likes",
...     "director_facebook_likes",
... ]
>>> sum_col = movies.loc[:, cols].sum(axis="columns")
>>> sum_col.head(5)
0     2791.0
1    46563.0
2    11554.0
3    95000.0
4      274.0
dtype: float64
>>> movies.assign(total_likes=sum_col).head(5)
   color        direc/_name  ...  movie/likes  total/likes
0  Color      James Cameron  ...        33000       2791.0
1  Color     Gore Verbinski  ...            0      46563.0
2  Color         Sam Mendes  ...        85000      11554.0
3  Color  Christopher Nolan  ...       164000      95000.0
4    NaN        Doug Walker  ...            0        274.0

另一种方法是将函数传入.assign方法中:

>>> def sum_likes(df):
...     return df[
...         [
...             c
...             for c in df.columns
...             if "like" in c
...             and ("actor" in c or "director" in c)
...         ]
...     ].sum(axis=1)
>>> movies.assign(total_likes=sum_likes).head(5)
   color        direc/_name  ...  movie/likes  total/likes
0  Color      James Cameron  ...        33000       2791.0
1  Color     Gore Verbinski  ...            0      46563.0
2  Color         Sam Mendes  ...        85000      11554.0
3  Color  Christopher Nolan  ...       164000      95000.0
4    NaN        Doug Walker  ...            0        274.0

如果列中有缺失值,则运算后该行会变成NaN.sum方法将NaN变成了0:

>>> (
...     movies.assign(total_likes=sum_col)["total_likes"]
...     .isna()
...     .sum()
... )
0
>>> (
...     movies.assign(total_likes=total)["total_likes"]
...     .isna()
...     .sum()
... )
122
# 填充缺失值之后,结果就变为0了。
>>> (
...     movies.assign(total_likes=total.fillna(0))[
...         "total_likes"
...     ]
...     .isna()
...     .sum()
... )
0

movie中有一列cast_total_facebook_likes,现在想比较一下cast_total_facebook_likes和刚刚创建的列total_likes

>>> def cast_like_gt_actor(df):
...     return (
...         df["cast_total_facebook_likes"]
...         >= df["total_likes"]
...     )
>>> df2 = movies.assign(
...     total_likes=total,
...     is_cast_likes_more=cast_like_gt_actor,
... )

.all方法检查is_cast_likes_more是否全为True

>>> df2["is_cast_likes_more"].all()
False

至少存在一行的total_likes大于cast_total_facebook_likes,这可能是因为director Facebook likes不属于total likes。所以先删掉total_likes这列:

>>> df2 = df2.drop(columns="total_likes")

重新创建只包含actor likes的列:

>>> actor_sum = movies[
...     [
...         c
...         for c in movies.columns
...         if "actor_" in c and "_likes" in c
...     ]
... ].sum(axis="columns")
>>> actor_sum.head(5)
0     2791.0
1    46000.0
2    11554.0
3    73000.0
4      143.0
dtype: float64

再次检查是否cast_total_facebook_likes大于actor_sum

>>> movies["cast_total_facebook_likes"] >= actor_sum
0       True
1       True
2       True
3       True
4       True
        ... 
4911    True
4912    True
4913    True
4914    True
4915    True
Length: 4916, dtype: bool
>>> movies["cast_total_facebook_likes"].ge(actor_sum)
0       True
1       True
2       True
3       True
4       True
        ... 
4911    True
4912    True
4913    True
4914    True
4915    True
Length: 4916, dtype: bool
>>> movies["cast_total_facebook_likes"].ge(actor_sum).all()
True

最后,计算actor_sumcast_total_facebook_likes的比例:

>>> pct_like = actor_sum.div(
...     movies["cast_total_facebook_likes"]
... ).mul(100)

检查pct_like这列中的值是否位于0和1之间:

>>> pct_like.describe()
count    4883.000000
mean       83.327889
std        14.056578
min        30.076696
25%        73.528368
50%        86.928884
75%        95.477440
max       100.000000
dtype: float64

使用movie_title作为索引创建一个Series:

>>> pd.Series(
...     pct_like.to_numpy(), index=movies["movie_title"]
... ).head()
movie_title
Avatar                                         57.736864
Pirates of the Caribbean: At World's End       95.139607
Spectre                                        98.752137
The Dark Knight Rises                          68.378310
Star Wars: Episode VII - The Force Awakens    100.000000
dtype: float64

insert在指定位置插入一列,insert方法不返回新的对象。

>>> profit_index = movies.columns.get_loc("gross") + 1
>>> profit_index
9
>>> movies.insert(
...     loc=profit_index,
...     column="profit",
...     value=movies["gross"] - movies["budget"],
... )

del命令同样可以删除列,但不返回新对象。

>>> del movies["director_name"]

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

上一篇下一篇

猜你喜欢

热点阅读