Pandas 基础知识总结

2023-01-13 本文已影响0人敬子v

Series

Series:一种一维标记的数组型对象，能够保存任何数据类型(int, str, float, python object...),包含了数据标签,称为索引。
创建Series的方法：
1. 数组
2. python 字典

# 通过数组创建
arr1 = np.arange(1,6)
s1 = pd.Series(arr1)
s1
0    1
1    2
2    3
3    4
4    5
dtype: int32
#索引在左边，值在右边，当我们没有为数据指定索引，默认生成索引
#默认生成的索引是从0到N-1(N是数据的长度)

# values属性和index属性分别获取Series对象的值和索引
s1.values
array([1, 2, 3, 4, 5]) #值就是我们的array对象

s1.index
RangeIndex(start=0, stop=4, step=1) # 和rang(4)类似

通常我们都是自己需要创建一个索引序列，用标签标示每个数据点。

#索引长度和数据长度必须相同。
s2 = pd.Series([1,3,-4,8],index=['a','b','c','d'])
s2
a    1
b    3
c   -4
d    8
dtype: int64
  
pd2.index
Index(['a', 'b', 'c', 'd'], dtype='object')

python 字典创建Series

dict1 = {'name':'hah','age':18,"city":'cs'}
s3 = pd.Series(dict1)
s3
name    hah
age      18
city     cs
dtype: object
#通过字典创建,Series的索引就是我们排序好的字典的键

#也可以按照我们自己想要的顺序去指定索引
s4 = pd.Series(dict1, index=['city','name','age','sex'])
s4
city     cs
name    hah
age      18
sex     NaN
dtype: object
#前三行按照我们给定的顺序生成，但是sex并没有在字典的key中，它对应的值是NaN,
# NaN是pandas中标记的缺失值

isnull 和 notnull 检查缺失值

s4.isnull() #判断是否为空,空就是True
city    False
name    False
age     False
sex      True
dtype: bool
    
s4.notnull() # 判断是否不为空,非空就是True
city     True
name     True
age      True
sex     False
dtype: bool

#返回一个Series对象

索引和切片

s5 = pd.Series(np.random.rand(5),index=['a','b','c','d','e'])
s5
a    0.968340
b    0.727041
c    0.607197
d    0.134053
e    0.240239
dtype: float64
  
# 下标
s5[1] #通过下标获取到元素，不能倒着取，和我们python列表不一样, s5[-1]错误的写法
0.7270408328885498

#通过标签名
s5['c']
0.6071966171492978

#选取多个，还是Series
s5[[1,3]] 或 s5[['b','d']]  # [1,3] 或['b','d']是索引列表
b    0.727041
d    0.134053
dtype: float64

#切片 标签切片包含末端数据（指定了标签）
s5[1:3]
b    0.727041
c    0.607197
dtype: float64
    
s5['b':'d']
b    0.727041
c    0.607197
d    0.134053
dtype: float64
 
#布尔索引
s5[s5>0.5] #保留为True的数据
a    0.968340
b    0.727041
c    0.607197
dtype: float64

name属性

#Series对象本身和其本身索引都具有name属性
s6 = pd.Series({'apple':7.6,'banana':9.6,'watermelon':6.8,'orange':3.6})
s6.name = 'fruit_price'  # 设置Series对象的name属性
s6.index.name = 'fruit'  # 设置索引name属性
s6
fruit
apple         7.6
banana        9.6
watermelon    6.8
orange        3.6
Name: fruit_price, dtype: float64
        
#查看索引
s6.index
Index(['apple', 'banana', 'watermelon', 'orange'], dtype='object', name='fruit')

DateFrame

DateFrame:一个二维标记数据结构，具有可能不同类型的列，每一列可以是不同值类型(数值，字符串，布尔值)，既有行索引也有列索引。我们可以把它看作为excel表格，或者SQL表，或Series对象的字典。
构建DateFrame的方法:

字典类：数组、列表或元组构成的字典构造dataframe，Series构成的字典构造dataframe，字典构成的字典构造dataframe

列表类:2D ndarray 构造dataframe，字典构成的列表构造dataframe，Series构成的列表构造dataframe

查看数据 head()，tail()

pd5 = pd.DataFrame(np.arange(20).reshape(10,2))
pd5
 0   1
0    0   1
1    2   3
2    4   5
3    6   7
4    8   9
5    10  11
6    12  13
7    14  15
8    16  17
9    18  19

# head()默认查看前5行，输入参数N，就查看前N行
pd5.head()
 0   1
0    0   1
1    2   3
2    4   5
3    6   7
4    8   9

#tail()默认查看后5行,输入参数N，就查看后N行
pd5.tail()
 0   1
5    10  11
6    12  13
7    14  15
8    16  17
9    18  19

#和Numpy一样，进行转置
pd6 = pd.DataFrame(np.arange(4).reshape(2,2),index=['a','b'],columns=['A','B'])
pd6
 A   B
a    0   1
b    2   3

pd6.T #行和列进行转置
 a   b
A    0   2
B    1   3

数组、列表或元组构成的字典构造dataframe

#构造一个字典
data = {'a':[1,2,3,4],
        'b':(5,6,7,8),
        'c':np.arange(9,13)}
#构造dataframe
frame = pd.DataFrame(data)
frame
 a   b   c
0    1   5   9
1    2   6   10
2    3   7   11
3    4   8   12  

#index属性查看行索引
frame.index
RangeIndex(start=0, stop=4, step=1)
#columns属性查看列索引
frame.columns
Index(['a', 'b', 'c'], dtype='object')
#values属性查看值
frame.values
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]], dtype=int64)
# 每个序列是DateFrame的一列，所有的序列长度必须相等，columns为字典的key,index为默认的数字标签，我们可以通过index属性进行修改

#指定index
frame = pd.DataFrame(data,index=['A','B','C','D'])
frame
 a   b   c
A    1   5   9
B    2   6   10
C    3   7   11
D    4   8   12

#指定columns，显示所指定列的数据，并按指定的顺序进行排序,当没有数据中没有该列('e')，那么就会用NaN来填充
frame = pd.DataFrame(data,index=['A','B','C','D'],columns=['a','b','c','e'])
frame
 a   b   c   e
A    1   5   9   NaN
B    2   6   10  NaN
C    3   7   11  NaN
D    4   8   12  NaN

2D ndarray 构造dataframe

#构造二维数组对象
arr1 = np.arange(12).reshape(4,3)
#构造dateframe
frame1 = pd.DataFrame(arr1)
frame1
 0   1   2
0    0   1   2
1    3   4   5
2    6   7   8
3    9   10  11
#我们通过二维数组对象创建dataframe，行索引和列索引都是可选参数，指定index和columns必须和原数组长度一致，默认0到N-1
frame2 = pd.DataFrame(arr1,index=['a','b','c','d'],columns=['A','B','C'])
frame2
 A   B   C
a    0   1   2
b    3   4   5
c    6   7   8
d    9   10  11

Series构成的字典构造dataframe

pd1 = pd.DataFrame({'a':pd.Series(np.arange(3)),
                   'b':pd.Series(np.arange(3,5)), 
                   })
pd1
 a   b   
0    0   4   
1    1   5   
2    2   NaN 

#设置index,
pd1 = pd.DataFrame({'a':pd.Series(np.arange(3),index=['a','b','c']),
                   'b':pd.Series(np.arange(3,5),index=['a','b']),
                   })
pd1
 a   b
a    0   3.0
b    1   4.0
c    2   NaN
#我们用Series构成的字典创建dataframe，指定索引我们需要在Series里面指定索引,index为Series的标签，Series长度可以不一样，会以NaN填充

字典构成的字典构造dataframe

#字典嵌套
data = {
    'a':{'apple':3.6,'banana':5.6},
    'b':{'apple':3,'banana':5},
    'c':{'apple':3.2}
}
#构造dataframe
pd2 = pd.DataFrame(data3)
pd2
     a   b   c
apple    3.6 3   3.2
banana   5.6 5   NaN
#内部字典是一列,内部字典的key是行索引index,外部字典的key是列索引columns,

字典构成的列表构造dataframe

l1 = [{'apple':3.6,'banana':5.6},{'apple':3,'banana':5},{'apple':3.2}]
pd3 = pd.DataFrame(l1)
pd3
 apple   banana
0    3.6     5.6
1    3.0     5.0
2    3.2     NaN
#列表中的每一个元素是一行,字典的key是列索引columns

#指定行索引index,必须和数据长度一致
pd3 = pd.DataFrame(l1,index=['a','b','c'])
pd3
 apple   banana
a    3.6     5.6
b    3.0     5.0
c    3.2     NaN

Series构成的列表构造dataframe

l2 = [pd.Series(np.random.rand(3)),pd.Series(np.random.rand(2))]
pd4=pd.DataFrame(l2)
pd4
    0           1           2
0   0.482106    0.025374    0.020586
1   0.912417    0.229153    NaN
#列表中的每一个元素是一行
#设置行索引index,和原数组长度一致
pd4=pd.DataFrame(l2,index=['a','b'])
pd4
    0           1           2
a   0.482106    0.025374    0.020586
b   0.912417    0.229153    NaN
#设置列索引columns,我们需要在series对象设置index
l2 = [pd.Series(np.random.rand(3),index=['A','B','C']),pd.Series(np.random.rand(2),index=['A','B'])]

pd4=pd.DataFrame(l2,index=['a','b'])
pd4

      A           B          C
a   0.999713    0.507880    0.091274
b   0.798486    0.268391    NaN

DateFrame

DateFrame:一个二维标记数据结构，具有可能不同类型的列，每一列可以是不同值类型(数值，字符串，布尔值)，既有行索引也有列索引。我们可以把它看作为excel表格，或者SQL表，或Series对象的字典。
构建DateFrame的方法:

字典类：数组、列表或元组构成的字典构造dataframe，Series构成的字典构造dataframe，字典构成的字典构造dataframe

列表类:2D ndarray 构造dataframe，字典构成的列表构造dataframe，Series构成的列表构造dataframe

查看数据 head()，tail()

pd5 = pd.DataFrame(np.arange(20).reshape(10,2))
pd5
 0   1
0    0   1
1    2   3
2    4   5
3    6   7
4    8   9
5    10  11
6    12  13
7    14  15
8    16  17
9    18  19

# head()默认查看前5行，输入参数N，就查看前N行
pd5.head()
 0   1
0    0   1
1    2   3
2    4   5
3    6   7
4    8   9

#tail()默认查看后5行,输入参数N，就查看后N行
pd5.tail()
 0   1
5    10  11
6    12  13
7    14  15
8    16  17
9    18  19

#和Numpy一样，进行转置
pd6 = pd.DataFrame(np.arange(4).reshape(2,2),index=['a','b'],columns=['A','B'])
pd6
 A   B
a    0   1
b    2   3

pd6.T #行和列进行转置
 a   b
A    0   2
B    1   3

数组、列表或元组构成的字典构造dataframe

#构造一个字典
data = {'a':[1,2,3,4],
        'b':(5,6,7,8),
        'c':np.arange(9,13)}
#构造dataframe
frame = pd.DataFrame(data)
frame
 a   b   c
0    1   5   9
1    2   6   10
2    3   7   11
3    4   8   12  

#index属性查看行索引
frame.index
RangeIndex(start=0, stop=4, step=1)
#columns属性查看列索引
frame.columns
Index(['a', 'b', 'c'], dtype='object')
#values属性查看值
frame.values
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]], dtype=int64)
# 每个序列是DateFrame的一列，所有的序列长度必须相等，columns为字典的key,index为默认的数字标签，我们可以通过index属性进行修改

#指定index
frame = pd.DataFrame(data,index=['A','B','C','D'])
frame
 a   b   c
A    1   5   9
B    2   6   10
C    3   7   11
D    4   8   12

#指定columns，显示所指定列的数据，并按指定的顺序进行排序,当没有数据中没有该列('e')，那么就会用NaN来填充
frame = pd.DataFrame(data,index=['A','B','C','D'],columns=['a','b','c','e'])
frame
 a   b   c   e
A    1   5   9   NaN
B    2   6   10  NaN
C    3   7   11  NaN
D    4   8   12  NaN

2D ndarray 构造dataframe

#构造二维数组对象
arr1 = np.arange(12).reshape(4,3)
#构造dateframe
frame1 = pd.DataFrame(arr1)
frame1
 0   1   2
0    0   1   2
1    3   4   5
2    6   7   8
3    9   10  11
#我们通过二维数组对象创建dataframe，行索引和列索引都是可选参数，指定index和columns必须和原数组长度一致，默认0到N-1
frame2 = pd.DataFrame(arr1,index=['a','b','c','d'],columns=['A','B','C'])
frame2
 A   B   C
a    0   1   2
b    3   4   5
c    6   7   8
d    9   10  11

Series构成的字典构造dataframe

pd1 = pd.DataFrame({'a':pd.Series(np.arange(3)),
                   'b':pd.Series(np.arange(3,5)), 
                   })
pd1
 a   b   
0    0   4   
1    1   5   
2    2   NaN 

#设置index,
pd1 = pd.DataFrame({'a':pd.Series(np.arange(3),index=['a','b','c']),
                   'b':pd.Series(np.arange(3,5),index=['a','b']),
                   })
pd1
 a   b
a    0   3.0
b    1   4.0
c    2   NaN
#我们用Series构成的字典创建dataframe，指定索引我们需要在Series里面指定索引,index为Series的标签，Series长度可以不一样，会以NaN填充

字典构成的字典构造dataframe

#字典嵌套
data = {
    'a':{'apple':3.6,'banana':5.6},
    'b':{'apple':3,'banana':5},
    'c':{'apple':3.2}
}
#构造dataframe
pd2 = pd.DataFrame(data3)
pd2
     a   b   c
apple    3.6 3   3.2
banana   5.6 5   NaN
#内部字典是一列,内部字典的key是行索引index,外部字典的key是列索引columns,

字典构成的列表构造dataframe

l1 = [{'apple':3.6,'banana':5.6},{'apple':3,'banana':5},{'apple':3.2}]
pd3 = pd.DataFrame(l1)
pd3
 apple   banana
0    3.6     5.6
1    3.0     5.0
2    3.2     NaN
#列表中的每一个元素是一行,字典的key是列索引columns

#指定行索引index,必须和数据长度一致
pd3 = pd.DataFrame(l1,index=['a','b','c'])
pd3
 apple   banana
a    3.6     5.6
b    3.0     5.0
c    3.2     NaN

Series构成的列表构造dataframe

l2 = [pd.Series(np.random.rand(3)),pd.Series(np.random.rand(2))]
pd4=pd.DataFrame(l2)
pd4
    0           1           2
0   0.482106    0.025374    0.020586
1   0.912417    0.229153    NaN
#列表中的每一个元素是一行
#设置行索引index,和原数组长度一致
pd4=pd.DataFrame(l2,index=['a','b'])
pd4
    0           1           2
a   0.482106    0.025374    0.020586
b   0.912417    0.229153    NaN
#设置列索引columns,我们需要在series对象设置index
l2 = [pd.Series(np.random.rand(3),index=['A','B','C']),pd.Series(np.random.rand(2),index=['A','B'])]

pd4=pd.DataFrame(l2,index=['a','b'])
pd4

      A           B          C
a   0.999713    0.507880    0.091274
b   0.798486    0.268391    NaN

重建索引

reindex：该方法用于创建一个符合新索引的新对象

#series
s1 = pd.Series(np.random.rand(4),index=['b','c','a','d'])
s1
b    0.714204
c    0.139476
a    0.362383
d    0.046476
dtype: float64

s2 = s1.reindex(['a','b','c','d','e'])
s2
a    0.362383
b    0.714204
c    0.139476
d    0.046476
e         NaN
dtype: float64
    
# 调用reindex方法,将数据按照新的索引进行排列,如果某个索引值不存在，就会用NaN填充

#dataframe
pd1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','b'],columns=['A','B','C'])
pd1
 A   B   C
a    0   1   2
c    3   4   5
b    6   7   8
#行重建索引和我们的series一样
pd2 = pd1.reindex(['a','b','c','d'])
pd2
 A   B   C
a    0.0 1.0 2.0
b    6.0 7.0 8.0
c    3.0 4.0 5.0
d    NaN NaN NaN
#列重建索引，需要指定参数columns
pd3=pd1.reindex(columns=['C','B','A'])
pd3
 C   B   A
a    2   1   0
c    5   4   3
b    8   7   6

drop:删除轴上的数据

 #series
#删除一条
s1.drop('b')
c    0.139476
a    0.362383
d    0.046476
dtype: float64
#删除多条
s1.drop(['b','c'])
a    0.362383
d    0.046476
dtype: float64

    #dataframe
pd1.drop('a')
 A   B   C
c    3   4   5
b    6   7   8

pd1.drop(['a','c'])
 A   B   C
b    6   7   8

#删除列,删除列需要指定参数axis=1,或者axis='columns'
pd1.drop('A',axis=1)
 B   C
a    1   2
c    4   5
b    7   8

pd1.drop(['A','B'],axis='columns')
 C
a    2
c    5
b    8

#inplace属性，在原对象上进行删除，并不会返回一个新对象
s1.drop('b',inplace=True)
s1
c    0.139476
a    0.362383
d    0.046476
dtype: float64

dataframe的索引

#当传递单个元素，或一个列表到 []中如['A']或[['A','B']]，选择列,传递一个切片到[],选择行[:2]或[:'three']
data= pd.DataFrame(np.arange(16).reshape(4,4),index=['one','two','three','four'],columns=['A','B','C','D'])

data
     A   B   C   D
one      0   1   2   3
two      4   5   6   7
three    8   9   10  11
four 12  13  14  15

#直接使用，先列后行
data['A'] #获取到A列，类型是Series
one       0
two       4
three     8
four     12
Name: A, dtype: int32
#选取多列
data[['A','C']]
 A   C
one  0   2
two  4   6
three    8   10
four 12  14

# 选取一个值
data['A']['one']
0

#切片 
data[:2] #获得行
 A   B   C   D
one  0   1   2   3
two  4   5   6   7


#使用loc和iloc选择数据，Numpy风格获取数据(先行后列)
#loc使用轴标签
#iloc整数标签
data.loc['one','B'] #获取'one'行，'B'列的数据
1

data.loc['one',['B','D']] #获取'one'行，'B','D'列的数据
B    1
D    3
Name: one, dtype: int32

data.iloc[2,[1,3]] #获取第3行，第2,4列的数据
B     9
D    11
Name: three, dtype: int32

#同样也可以用于切片
data.loc[:'three',:'B'] #获取前三行，前两列数据
     A   B
one      0   1
two      4   5
three    8   9

data.iloc[:2,:2] #获取前两行前两列数据
 A   B
one  0   1
two  4   5
#当然也可以切片和索引组合，和我们的Numpy风格一样

赋值

data['D'] =8 #我们选择‘D’列,并且给它赋值为8
data
     A   B   C   D
one      0   1   2   8
two      4   5   6   8
three    8   9   10  8
four 12  13  14  8

#或
data.D = 6
data
     A   B   C   D
one      0   1   2   6
two      4   5   6   6
three    8   9   10  6
four 12  13  14  6

#关于赋值我们有两种操作方法，一种就是直接使用索引，一种可以通过对象.列的形式进行赋值

总结
- reindex重建索引
- drop删除轴上的条目
- 索引操作，切片
- loc和iloc的操作（Numpy风格)
- 赋值操作

pandas_数据集合并

在外面实际应用中,可能数据在多张表中,我们需要把数据整合在一起进行分析,这个时候我们需要对多张表进行合并
包含pandas对象的数据可以通过多种方式进行合并

pd.merge:根据一个或多个键将行进行拼接

pd.concat：对象在轴向上进行黏合
pd.merge:(left, right, how='inner',on=None,left_on=None, right_on=None )

left:合并时左边的DataFrame

right:合并时右边的DataFrame

how:合并的方式,默认'inner', 'outer', 'left', 'right'

on:需要合并的列名,必须两边都有的列名，并以 left 和 right 中的列名的交集作为连接键

left_on: left Dataframe中用作连接键的列

right_on: right Dataframe中用作连接键的列

inner内连接 :对两张表都有的键的交集进行联合

outer全连接：对两者表的都有的键的并集进行联合
```
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left,right,on='key') #指定连接键key
  key A   B   C   D
0 K0  A0  B0  C0  D0
1 K1  A1  B1  C1  D1
2 K2  A2  B2  C2  D2
3 K3  A3  B3  C3  D3
```


left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left,right,on=['key1','key2']) #指定多个键，进行合并
    key1    key2    A   B   C   D
0   K0      K0      A0  B0  C0  D0
1   K1      K0      A2  B2  C1  D1
2   K1      K0      A2  B2  C2  D2

#指定右连接

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, how='right', on=['key1', 'key2'])
    key1    key2    A   B   C   D
0   K0      K0      A0  B0  C0  D0
1   K1      K0      A2  B2  C1  D1
2   K1      K0      A2  B2  C2  D2
3   K2      K0      NaN NaN C3  D3

#指定左连接

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left, right, how='left', on=['key1', 'key2'])
    key1    key2    A   B   C   D
0   K0      K0      A0  B0  C0  D0
1   K0      K1      A1  B1  NaN NaN
2   K1      K0      A2  B2  C1  D1
3   K1      K0      A2  B2  C2  D2
4   K2      K1      A3  B3  NaN NaN

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,how='outer',on=['key1','key2'])
    key1    key2    A   B   C   D
0   K0      K0      A0  B0  C0  D0
1   K0      K1      A1  B1  NaN NaN
2   K1      K0      A2  B2  C1  D1
3   K1      K0      A2  B2  C2  D2
4   K2      K1      A3  B3  NaN NaN
5   K2      K0      NaN NaN C3  D3

pd.concat:([data1, data2], axis=0,join='outer')

[data1,data2]:需要拼接的pandas对象

axis:连接的轴向

join:指定连接的方式，‘outer'或 'inner'

df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['one','two'])

df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,index=list('ac'),columns=['three','four'])

pd.concat([df1,df2]) #默认外连接，axis=0
  four    one three   two
a NaN     0.0 NaN     1.0
b NaN     2.0 NaN     3.0
c NaN     4.0 NaN     5.0
a 6.0     NaN 5.0     NaN
c 8.0     NaN 7.0     NaN

pd.concat([df1,df2],axis='columns') #指定axis=1连接
  one two three   four
a 0   1   5.0     6.0
b 2   3   NaN     NaN
c 4   5   7.0     8.0

#同样我们也可以指定连接的方式为inner
pd.concat([df1,df2],axis=1,join='inner')

  one two three   four
a 0   1   5       6
c 4   5   7       8

pandas_重复索引

我们目前所接触的案例中,我们的索引都是唯一的,但是在pandas中，它的索引不一定都是唯一的，在带有重复索引的情况下，在数据选择上有一些差别，根据标签索引会返回一个序列，而不是一个标量值，is_unique属性判断索引标签是否唯一,返回一个布尔值

#构建一个具有相同索引的Series
s1 = pd.Series(range(5),index=list('abcda'))
s1
a    0
b    1
c    2
d    3
a    4
dtype: int64
#根据标签索引，返回一个序列
s1['a']  #会取所有轴标签为a的值
a    0
a    4
dtype: int64
    
#构建一个具有相同的行索引和列索引的dataframe
df1 = pd.DataFrame(np.random.rand(4,3),index=list('abca'),columns=list('ABA'))
df1
 A           B           A
a    0.550783    0.855232    0.159598
b    0.497246    0.196161    0.499925
c    0.157437    0.038210    0.808912
a    0.245046    0.212659    0.757890
df1['A']
 A           A
a    0.550783    0.159598
b    0.497246    0.499925
c    0.157437    0.808912
a    0.245046    0.757890

df1.loc['a']
 A           B           A
a    0.550783    0.855232    0.159598
a    0.245046    0.212659    0.757890

#假如数据很多我们怎么判断索引是否重复
df1.index.is_unique
False

df1.columns.is_unique
False

pandas_排序

sort_index：按行或列索引进行排序，按照索引进行排序,注意与reindex()重建索引进行区分

#Series
s1 = pd.Series(np.arange(4),index=list('dbca'))
s1
d    0
b    1
c    2
a    3
dtype: int32

s1.sort_index() #默认升序
a    3
b    1
c    2
d    0
dtype: int32
    
s1.sort_index(ascending=False) #参数ascending=False降序
d    0
c    2
b    1
a    3
dtype: int32

#DataFrmme
pd1=pd.DataFrame(np.arange(12).reshape(4,3),index=list('bdca'),columns=list('BCA'))
 B   C   A
b    0   1   2
d    3   4   5
c    6   7   8
a    9   10  11

#按行排序
pd1.sort_index()
 B   C   A
a    9   10  11
b    0   1   2
c    6   7   8
d    3   4   5

#按列排序，需要知道轴,axis=1或axis='columns'
pd1.sort_index(axis=1)
 A   B   C
b    2   0   1
d    5   3   4
c    8   6   7
a    11  9   10
#同样如果需要指定降序，需要设置参数ascending=False

sort_values：按值进行排序

#Series
s2 = pd.Series([5,6,np.nan,1,-1])
s2
0    5.0
1    6.0
2    NaN
3    1.0
4   -1.0
dtype: float64

s2.sort_values() #根据值的大小进行排序,当有缺失值时，会默认排到最后
4   -1.0
3    1.0
0    5.0
1    6.0
2    NaN
dtype: float64

#DataFrame 当我们对dataframe进行值的排序的时候，需要使用参数by
pd2 = pd.DataFrame({'a':[3,7,9,0],'b':[1,-1,4,8],'c':[0,6,-3,2]})
pd2
 a   b   c
0    3   1   0
1    7   -1  6
2    9   4   -3
3    0   8   2

pd2.sort_values(by='b') #指定b列进行排序
 a   b   c
1    7   -1  6
0    3   1   0
2    9   4   -3
3    0   8   2
#也可以进行多列排序,传递一个列名的列表
pd2.sort_values(by=['a','c'])
 a   b   c
3    0   8   2
0    3   1   0
1    7   -1  6
2    9   4   -3
#同样我们也可以根据参数ascending=False来进行降序
pd2.sort_values(by=['a','c'],ascending=False)
 a   b   c
2    9   4   -3
1    7   -1  6
0    3   1   0
3    0   8   2

pandas_唯一值和成员属性

unique():返回一个Series中的唯一值

s1 = pd.Series([2,6,8,9,8,3,6])
s2 = s1.unique()
s2 #返回一个array
array([2, 6, 8, 9, 3], dtype=int64)

value_counts:计算Series值的个数

s1.value_counts() #返回一个series
6    2
8    2
3    1
2    1
9    1
dtype: int64
#左侧索引为series值，右边为值的个数

isin：判断值是否存在,返回布尔类型

s1.isin([8]) #判断8是否存在于s1
0    False
1    False
2     True
3    False
4     True
5    False
6    False
dtype: bool
#判断多个
s3.isin([8,2]) #判断8和2是否存在于s1
0     True
1    False
2     True
3    False
4     True
5    False
6    False
dtype: bool
  
#dataframe
data = pd.DataFrame({'a':[1,2,3,2],'b':[2,4,3,4],'c':[2,3,2,1]})
data.isin([2,4])#判断2,4是否存在于data
 a       b       c
0    False   True    True
1    True    True    False
2    False   False   True
3    True    True    False

pandas_算术运算和数据对齐

在我们pandas中有一个很重要的特性，不同的索引对象之间可以进行算术运算(加，减，乘，除...)

#我们可以先看案例,不同的索引对象相加

#Series
s1 = pd.Series(np.random.rand(4),index=['a','b','c','d'])*100
s2 = pd.Series(np.random.rand(4),index=['a','d','e','f'])*100

s1
a    72.868475
b    86.868903
c    74.800287
d    59.727382
dtype: float64

s2
a    90.683489
d    34.670433
e    12.572728
f    82.626782
dtype: float64
#将s1和s2对象相加
s1 + s2
a    163.551964
b           NaN
c           NaN
d     94.397815
e           NaN
f           NaN
dtype: float64
"""
我们不同的Series对象进行算术运算,在没有相同的标签位置上,内部数据对齐就会产生缺失值,具有相同的标签的值,会进行算术运算,索引对不相同,返回的结果的索引就是索引对的并集,Series进行算术运算的时候,不需要保证Series大小的一致
"""

#DataFrame
df1 = pd.DataFrame(np.random.rand(12).reshape(4,3)*100, index=['a','b','c','d'],columns=list('ABC'))

df2 = pd.DataFrame(np.random.rand(9).reshape(3,3)*100, index=['a','d','f'],columns=list('ABD'))

df1
 A           B           C
a    26.869112   70.599906   32.586599
b    46.996796   19.524614   74.472748
c    94.605620   94.174812   29.406223
d    93.409041   56.094164   21.006926

df2
     A       B           D
a    81.508766   61.112238   94.539634
d    7.888379    90.192787   3.874301
f    8.415423    52.471031   43.082653

df1+df2
 A           B           C      D
a    108.377878  131.712144   NaN    NaN
b    NaN         NaN         NaN    NaN
c    NaN         NaN         NaN    NaN
d    101.297420  146.286951   NaN    NaN
f    NaN         NaN         NaN    NaN
"""
将两个不同的DataFrame进行算术，返回一个DataFrame，它的行索引和列索引，是每个DataFrame的索引,列的并集,在没有相同的标签位置上,内部数据对齐就会产生缺失值,具有相同的标签的值,会进行算术运算,
"""

使用填充值的算术方法:

如上面的案例一样，两个不同的索引对象之间进行算术操作时，当一个标签存在一个对象上，在另外的一个对象中不存在，会出现缺失值，我们可以通过下表中的算术方法将缺失值进行填充。

算术方法表:

方法	描述
add，sadd	加法（+）
sub，rsub	减法（-）
div，rdiv	除法（/）
floordiv，rfllordiv	整除（//）
mul，rmul	乘法（*）
pow，rpow	幂次方（**）

#我们对上面的案例使用算术方法实现
s1.add(s2) 
a    163.551964
b           NaN
c           NaN
d     94.397815
e           NaN
f           NaN
dtype: float64
#效果和我们的s1 + s2等同，算术方法中有个参数,fill_value参数，可以对缺失值进行填充
s1.add(s2,fill_value=0)
a    163.551964
b     86.868903
c     74.800287
d     94.397815
e     12.572728
f     82.626782
dtype: float64

    
df1.add(df2)
 A           B           C      D
a    108.377878  131.712144   NaN    NaN
b    NaN         NaN         NaN    NaN
c    NaN         NaN         NaN    NaN
d    101.297420  146.286951   NaN    NaN
f    NaN         NaN         NaN    NaN
    
df1.add(df2,fill_value=0)
 A           B           C           D
a    108.377878  131.712144  32.586599   94.539634
b    46.996796   19.524614   74.472748   NaN
c    94.605620   94.174812   29.406223   NaN
d    101.297420  146.286951  21.006926   3.874301
f    8.415423    52.471031   NaN         43.082653

#使用fill_value指定填充值，未对齐的数据将和填充值进行运算
#可以理解为不存在的索引，用指定的值（存在的索引的值）进行填充



#在上表中,每个方法都有个以 r 开头的方法, r开头的方法参数是可以翻转的
100 / df1 #100除以df1
     A       B           C
a    3.721746    1.416432    3.068746
b    2.127805    5.121740    1.342773
c    1.057020    1.061855    3.400641
d    1.070560    1.782717    4.760335

df1.rdiv(100) #同样也是100除以df1
     A       B           C
a    3.721746    1.416432    3.068746
b    2.127805    5.121740    1.342773
c    1.057020    1.061855    3.400641
d    1.070560    1.782717    4.760335
#这两种方法是等价的



#对于我们的索引重建，如果某个索引值不存在会以NaN进行填充,我们也可以通过fill_value参数进行填充,填充的值由我们自己指定
df1.reindex(columns=list('ABCE'),fill_value=0)
 A           B           C         E
a    26.869112   70.599906   32.586599   0
b    46.996796   19.524614   74.472748   0
c    94.605620   94.174812   29.406223   0
d    93.409041   56.094164   21.006926   0

DataFrame和Series混合运算

#DataFrame和Series的算术运算和我们的Numpy不同维度的运算操作类似
#DataFrame与Series的数学操作会把Series的索引和DataFrame的列进行匹配，然后广播到行,大家可以参考我们Numpy的二维数组和一维数组的广播
df1 #还是我们上面的df1
 A           B           C
a    26.869112   70.599906   32.586599
b    46.996796   19.524614   74.472748
c    94.605620   94.174812   29.406223
d    93.409041   56.094164   21.006926

s3 = df1.loc['a'] #获取df1的a行，是个series
s4 = df1.A
#dataframe 和series进行运算
df1 + s3
 A           B           C
a    53.738224   141.199812  65.173199
b    73.865908   90.124520   107.059347
c    121.474731  164.774718  61.992822
d    120.278153  126.694070  53.593525

df1 + s4
 A   B   C   a   b   c   d
a    NaN NaN NaN NaN NaN NaN NaN
b    NaN NaN NaN NaN NaN NaN NaN
c    NaN NaN NaN NaN NaN NaN NaN
d    NaN NaN NaN NaN NaN NaN NaN
#可以明显看出Series的索引和DataFrame的列进行匹配,广播到各行


#那么如果我们要在行上匹配，广播到行，
# 1. 使用我们的算术方法,必须指定axis=0或axis='index'，传递的axis值是用于匹配轴的
S4
a    26.869112
b    46.996796
c    94.605620
d    93.409041
Name: A, dtype: float64

df1.add(s4,axis='index')
 A           B           C
a    53.738224   97.469018   59.455711
b    93.993592   66.521410   121.469543
c    189.211239  188.780432  124.011843
d    186.818082  149.503204  114.415967

# 2. 我们可以将DataFrame进行转置,进行运算，然后在转置回到原来的数据结构
df1.T
 a           b           c           d
A    26.869112   46.996796   94.605620   93.409041
B    70.599906   19.524614   94.174812   56.094164
C    32.586599   74.472748   29.406223   21.006926

(df.T+s4).T
 A           B           C
a    53.738224   97.469018   59.455711
b    93.993592   66.521410   121.469543
c    189.211239  188.780432  124.011843
d    186.818082  149.503204  114.415967
#这两种方法都行，第二种转置的方法可能比较绕，推荐使用算术方法，指定轴进行匹配

pandas_统计运算

常见统计计算方法表：

方法	描述
count	非NA值的个数
describe	各列的汇总统计集合
sum	总和
mean	平均数
std	标准差
var	方差
min	最小值
max	最大值
idxmin	最小值所在的索引标签
idxmax	最大值所在的索引标签
median	中位数
pct_change	百分比

arr1 = np.random.rand(4,3)
pd1 = pd.DataFrame(arr1,columns=list('ABC'),index=list('abcd'))
f = lambda x: '%.2f'% x
pd2 = pd1.applymap(f).astype(float)
pd2
 A       B       C
a    0.87    0.26    0.67
b    0.69    0.89    0.17
c    0.94    0.33    0.04
d    0.35    0.46    0.29

pd2.sum() #默认把这一列的Series计算,所有行求和
A    2.85
B    1.94
C    1.17
dtype: float64
    
pd2.sum(axis='columns') #指定求每一行的所有列的和
a    1.80
b    1.75
c    1.31
d    1.10
dtype: float64
    
pd2.idxmax()#查看每一列所有行的最大值所在的标签索引，同样我们也可以通过axis='columns'求每一行所有列的最大值的标签索引
A    c
B    b
C    a
dtype: object

pd2.describe()#查看汇总
     A           B       C
count    4.000000    4.00000 4.000000
mean 0.712500    0.48500 0.292500
std      0.263613    0.28243 0.271585
min      0.350000    0.26000 0.040000
25%      0.605000    0.31250 0.137500
50%      0.780000    0.39500 0.230000
75%      0.887500    0.56750 0.385000
max      0.940000    0.89000 0.670000

#百分比:除以原来的量
pd2.pct_change() #查看行的百分比变化，同样指定axis='columns'列与列的百分比变化
 A           B           C
a    NaN         NaN         NaN
b    -0.206897   2.423077    -0.746269
c    0.362319    -0.629213   -0.764706
d    -0.627660   0.393939    6.250000

pandas_数据聚合与分组

什么是分组聚合？如图：
groupby:(by=None,as_index=True)

by:根据什么进行分组，用于确定groupby的组

as_index:对于聚合输出,返回以组便签为索引的对象，仅对DataFrame

df1 = pd.DataFrame({'fruit':['apple','banana','orange','apple','banana'],
                    'color':['red','yellow','yellow','cyan','cyan'],
                   'price':[8.5,6.8,5.6,7.8,6.4]})
#查看类型
type(df1.groupby('fruit'))
pandas.core.groupby.groupby.DataFrameGroupBy  #GruopBy对象，它是一个包含组名，和数据块的2维元组序列，支持迭代
for name, group in df1.groupby('fruit'):
    print(name) #输出组名
    apple
    banana
    orange
    
    print(group) # 输出数据块
       fruit color  price
    0  apple   red    8.5
    3  apple  cyan    7.8
       fruit   color  price
    1  banana  yellow    6.8
    4  banana    cyan    6.4
       fruit   color  price
    2  orange  yellow    5.6
    
    #输出group类型  
    print(type(group))  #数据块是dataframe类型
    <class 'pandas.core.frame.DataFrame'>
    <class 'pandas.core.frame.DataFrame'>
    <class 'pandas.core.frame.DataFrame'>

#选择任意的数据块
dict(list(df1.groupby('fruit')))['apple']  #取出apple组的数据块
   fruit color  price
0  apple   red    8.5
3  apple  cyan    7.8

聚合

函数名	描述
count	分组中非NA值的数量
sum	非NA值的和
mean	非NA值的平均值
median	非NA值的中位数
std, var	标准差和方差
min, max	非NA的最小值，最大值
prod	非NA值的乘积
first, last	非NA值的第一个,最后一个

#Groupby对象具有上表中的聚合方法

#根据fruit来求price的平均值
df1['price'].groupby(df1['fruit']).mean()
fruit
apple     8.15
banana    6.60
orange    5.60
Name: price, dtype: float64     
#或者
df1.groupby('fruit')['price'].mean()

#as_index=False
df1.groupby('fruit',as_index=False)['price'].mean()
  fruit   price
0 apple   8.15
1 banana  6.60
2 orange  5.60

"""
如果我现在有个需求，计算每种水果的差值,
1.上表中的聚合函数不能满足于我们的需求，我们需要使用自定义的聚合函数
2.在分组对象中，使用我们自定义的聚合函数
"""
#定义一个计算差值的函数
def diff_value(arr):
    return arr.max() - arr.min()
#使用自定义聚合函数，我们需要将函数传递给agg或aggregate方法，我们使用自定义聚合函数时，会比我们表中的聚合函数慢的多，因为要进行函数调用，数据重新排列
df1.groupby('fruit')['price'].agg(diff_value)
fruit
apple     0.7
banana    0.4
orange    0.0
Name: price, dtype: float64

pandas_文件读取与存储

csv文件

读取csv文件read_csv(file_path or buf,usecols,encoding):file_path：文件路径, usecols:指定读取的列名， encoding:编码

data = pd.read_csv('d:/test_data/food_rank.csv',encoding='utf8')
data.head()
 name    num
0    酥油茶 219.0
1    青稞酒 95.0
2    酸奶  62.0
3    糌粑  16.0
4    琵琶肉 2.0

#指定读取的列名
data = pd.read_csv('d:/test_data/food_rank.csv',usecols=['name'])
data.head()
 name
0    酥油茶
1    青稞酒
2    酸奶
3    糌粑
4    琵琶肉

#如果文件路径有中文，则需要知道参数engine='python'
data = pd.read_csv('d:/数据/food_rank.csv',engine='python',encoding='utf8')
data.head()
 name    num
0    酥油茶 219.0
1    青稞酒 95.0
2    酸奶  62.0
3    糌粑  16.0
4    琵琶肉 2.0
#建议文件路径和文件名，不要出现中文

写入csv文件

DataFrame:to_csv(file_path or buf,sep,columns,header,index,na_rep,mode)：file_path：保存文件路径,默认None,sep:分隔符,默认',' , columns:是否保留某列数据,默认None, header：是否保留列名,默认True,index:是否保留行索引,默认True, na_rep:指定字符串来代替空值,默认是空字符, mode:默认'w',追加'a'

Series:Series.to_csv(path=None, index=True, sep=', ', na_rep='', header=False, mode='w', encoding=None)

json文件

读取json文件read_json(file_path or buf）,orient,lines,encoding)：lines:boolean, default False按照每行读取json对象, typ:default ‘frame’，指定转换成的对象类型series或者dataframe

orient:预期json字符串格式

'split' : dict like {index : [index], columns : [columns], data : [values]}
'records' : list like [{column : value}, ... , {column : value}]
'index' : dict like {index : {column : value}}
'columns' : dict like {column : {index : value}},默认该格式
'values' : just the values array

#split json格式只能这样写，就是有索引，有列字段,和数据矩阵构成的json格式
s1 = '{"index":[1,2,3],"columns":["A","B"],"data":[[1,2],[4,5],[7,8]]}'
pd.read_json(s1,orient='split')
 A   B
1    1   2
2    4   5
3    7   8


#records 字典的列表，构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
s2 = '[{"title":"人生","price":19.8,"author":"路遥"},{"title":"人生哲思录","price":49.8,"author":"周国平"}]'
pd.read_json(s2,orient='records')
 author  price   title
0    路遥     19.8    人生
1    周国平   49.8    人生哲思录


data3 = pd.read_json('d:/test_data/p2p_data.json',orient='records'，lines=True,encoding='utf8') #数据中有中文时，需要知道编码
data3.head()
 上线时间    出问题原因   城市         平台名称     问题发生时间
0    2015-12-01  平台清盘    浙江省杭州市  京圆柚理财   2018-08-18
1    2017-07-01  争议平台    浙江省杭州市  聚富蛙      2018-08-17
2    2017-08-01  平台展期    广东省深圳市  索星金服     2018-08-17
3    2014-12-19  平台展期    广东省深圳市  中融投      2018-08-17
4    --          平台清盘    浙江省杭州市  众源理财    2018-08-16

写入json文件to_json:(path_or_buf=None, orient=None, lines=False)

pandas_apply函数和applymap函数

apply(func,axis):将函数func应用到一行或一列的一维数组上

#dataframe的apply
#构造一个dataframe
df1 = pd.DataFrame({'Tom':{'English':88,'Math':68},
                  'Joke':{'English':68,'Math':98},
                  'Mabuqi':{'English':58,'Math':48},
                  'Ohio':{'English':48,'Math':78}})

df1
     Tom Joke Mabuqi Ohio
English  88   68   58      48
Math 68   98   48      78
#现在我有两个需求,一个分别求每个学生成绩的平均分,一个是求每科成绩的平均分，想想怎么计算?
#1.求每个学生的平均分
f1 = lambda x: x.mean()
df1.apply(f1)
Tom       78.0
Joke      83.0
Mabuqi    53.0
Ohio      63.0
dtype: float64
#2.求每科的平均分，指定我们的轴axis='columns'或axis=1
df1.apply(f1,axis='columns')
English    65.5
Math       73.0
dtype: float64
#dataframe的apply函数，他接收的参数是一个series对象
    
#Series
s1 = pd.Series(['Tom','Joke','Mabuqi','Ohio'])
s1
0       Tom
1      Joke
2    Mabuqi
3      Ohio
dtype: object
#需求，过滤掉名字长度<3的人
f2 = lambda x: len(x)>3
s1.apply(f2) #这样我们得到一个布尔类型的series
0    False
1     True
2     True
3     True
dtype: bool
    
s1[s1.apply(f)] #这样我们就过滤掉了名字长度小于3的人
1      Joke
2    Mabuqi
3      Ohio
dtype: object 
#series的appl函数，他接收的参数是series里面的各个值

applymap(func)将函数应用到dataframe每一个元素上

df2 = pd.DataFrame(np.random.rand(4,3))
df2

 0           1           2
0    0.386162    0.178801    0.283059
1    0.386132    0.765665    0.256415
2    0.829829    0.052328    0.344845
3    0.849074    0.949973    0.306215
#需求，这些数据我只要保留两位小数,怎么做？
f3 = lambda x:'%.2f' % x
df2.applymap(f2)
 0       1       2
0    0.39    0.18    0.28
1    0.39    0.77    0.26
2    0.83    0.05    0.34
3    0.85    0.95    0.31

总结:
- DataFrame的applymap()和Series的apply()方法，都是接收的对象的各个值，进行处理
- DataFrame的apply()接收的是series,DataFrame里面的行或列

😊以上是我的笔记整理，希望对需要的朋友有所帮助！

Pandas 基础知识总结

Series

DateFrame

DateFrame

重建索引

pandas_数据集合并

pandas_重复索引

pandas_排序

pandas_唯一值和成员属性

pandas_算术运算和数据对齐

pandas_统计运算

pandas_数据聚合与分组

pandas_文件读取与存储

csv文件

json文件

pandas_apply函数和applymap函数

猜你喜欢

热点阅读