02-pandas-Ⅰ

2018-12-12 本文已影响4人郑元吉

一.什么是pandas

1.Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的
2.pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具
3.pandas提供了大量能使我们快速便捷地处理数据的函数和方法
4.它使Python成为强大而高效的数据分析环境的重要因素之一

二.导入

Series 是一个类似数组的数据结构
DataFrame 数据框
类似于Excel，DataFrame组织数据，处理数据

#pandas 源于 numpy 两个总是一起使用
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

三.Series

3.1 Series创建

3.1.1 由列表或numpy数组创建

默认索引为0到N-1的整数型索引

obj = Series([1,2,3,4])
print(obj)

输出：
0    1
1    2
2    3
3    4
dtype: int64

#还可以通过设置index参数指定索引
obj2 = Series([1,2,3,4],index=['a','b','c','d'])
obj2
输出：
a    1
b    2
c    3
d    4
dtype: int64

特别地，由ndarray创建的是引用，而不是副本。对Series元素的改变也会改变原来的ndarray对象中的元素。（列表没有这种情况）
a = np.array([1,2,3])
obj = Series(a)
obj
输出：
0    1
1    2
2    3
dtype: int64


obj[0]=0
print(a)
print(obj)
输出：
[0 2 3]
0    0
1    2
2    3
dtype: int64

3.1.2 由字典创建

obj = Series({'a':1,'b':2})
obj

输出：
a    1
b    2
dtype: int64

3.2 Series索引和切片

3.2.1 显式索引：

- 使用index中的元素作为索引值
- 使用.loc[]（推荐）

注意，此时是闭区间

obj = Series({'a':10,'b':12,'c':17})
obj.loc['a']
a    10
b    12
dtype: int64

obj['a']
输出：
10

obj['a':'c']
输出：
a    10
b    12
c    17
dtype: int64

3.2.2 隐式索引：

- 使用整数作为索引值
- 使用.iloc[]（推荐）

注意，此时是半开区间

obj[0:1]
输出
a    10
dtype: int64


obj.iloc[0]
输出：
10


obj.iloc[0:1]
输出：
a    10
dtype: int64

3.3 Series的基本概念

可以把Series看成一个定长的有序字典
可以通过shape，size，index,values等得到series的属性
可以通过head(),tail()快速查看Series对象的样式
当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况
可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据
Series对象本身及其索引都有一个name属性

obj = Series([10,4,np.nan])
#判断Series是否为不为null，为null返回false
notnull = pd.notnull(obj)
#如果为false将空值设为0
for i,d in enumerate(notnull):
    if d ==0:
        obj[i] = 0
print(obj)

obj.isnull()

obj.name='123'
print(obj)
Series.name = "Hello World"
print(Series.name)

输出：
a    1.0
b    2.0
d    NaN
Name: 123, dtype: float64
Hello World

3.4.Series的运算

适用于numpy的数组运算也适用于Series
Series之间的运算
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

A = pd.Series([2,4,6],index=[0,1,2])
B = pd.Series([1,3,5],index=[1,2,3])
display(A,B)

输出：
0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64


A+B
输出：
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

注意：要想保留所有的index，则需要使用.add()函数

A.add(B,fill_value=0)

输出：
0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

四.DataFrame

4.1 DataFrame创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。
此外，DataFrame会自动加上每一行的索引（和Series一样）。
同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN。

import pandas as ps
from pandas import Series,DataFrame
data = {'color':['blue','green','yellow','red','white'],
       'object':['ball','pen','pencil','paper','mug'],
       'price':[1.2,1.0,0.6,0.9,1.7]}
frame = DataFrame(data,columns=['color','object','price','weight'],
                 index = ['one','two','three','four','five'])
frame

输出：

            color   object     price      weight
one           blue      ball    1.2         NaN
two          green      pen      1.0            NaN
three        yellow    pencil     0.6           NaN
four          red       paper     0.9           NaN
five          white     mug        1.7          NaN

4.2 DataFrame的索引

4.2.1 对列进行索引

对列进行索引
- 通过类似字典的方式
- 通过属性的方式
可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且name属性也已经设置好了，就是相应的列名。


print(frame)
frame['color']

输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN

Out[14]:
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object

方式二：frame.color

输出：
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object

4.2.2 对行进行索引

(2) 对行进行索引
- 使用.ix[]来进行行索引
- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引
同样返回一个Series，index为原来的columns。

frame.ix['one']

输出：
color     blue
object    ball
price      1.2
weight     NaN
Name: one, dtype: object

type(frame.ix['one'])
输出：
Out[52]:
pandas.core.series.Series


frame.loc["two"]
输出：
color     green
object      pen
price         1
weight      NaN
Name: two, dtype: object
//////////////////////////////////////////////////////////
print(frame)
frame.iloc[0:10]
输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN
Out[41]:
    color       object  price   weight
one blue        ball        1.2 NaN
two green   pen     1.0 NaN
three   yellow  pencil  0.6 NaN
four    red     paper   0.9 NaN
five    white       mug     1.7 NaN

4.2.3 对元素索引的方法

对元素索引的方法
- 先使用列索引
- 先使用行索引
- 使用values属性（二维numpy数组）

print(frame)
print("使用列索引")
print(frame['color']['one'])
print(frame.color['one'])
print("使用行索引")
print(frame.ix['one']['color'])
print(frame.loc['one','color'])
print(frame.iloc[0][0:2])
print("使用values属性")
print(frame.values[[0][0]])
print(frame.values[0][1:3])

输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN
使用列索引
blue
blue
使用行索引
blue
blue
color     blue
object    ball
Name: one, dtype: object
使用values属性
['blue' 'ball' 1.2 nan]
['ball' 1.2]

【注意】直接用中括号时：索引表示的是列索引,切片表示的是行切片

这是列索引
print(frame['color'])

输出：
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object


使用切片--------->对应行

frame['one':'two']
输出：
    color   object  price   weight
one blue    ball    1.2 NaN
two green   pen 1.0 NaN

4.3 DataFrame的运算

4.3.1 DataFrame之间的运算

同Series一样：
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

A = DataFrame(np.random.randint(0,20,(2,2)),columns = list('ab'))
A
输出：
    a       b
0   8       9
1   4       7

B = DataFrame(np.random.randint(0,10,(3,3)),columns = list('abc'))
B

输出：
    a   b   c
0   6   3   8
1   2   7   4
2   6   7   4

A+B

输出：
        a       b           c
0       14.0    12.0        NaN
1       6.0     14.0        NaN
2       NaN     NaN         NaN

A.add(B,fill_value=0)

输出：
    a           b           c
0   14.0        12.0        0.0
1   6.0         14.0        0.0
2   0.0          0.0        0.0

转换成int类型数据：
A.add(B,fill_value=0).astype('int')

输出：

    a           b           c
0   14          12          0
1   6           14          0
2   0            0          0

4.3.2 Series与DataFrame之间的运算

【重要】
使用Python操作符：以行为单位操作（参数必须是行），对所有行都有效。（类似于numpy中二维数组与一维数组的运算，但可能出现NaN）
使用pandas操作函数：
  axis=0：以列为单位操作（参数必须是列），对所有列都有效。
  axis=1：以行为单位操作（参数必须是行），对所有列都有效。

s2
输出：
    Python     69.0
    数学        130.0
    物理         43.5
    英语         47.5
    Name: 张三, dtype: float64

df
输出：

                    Python  数学  物理  英语
    Michael         13.0  15.0  74.5    12.0
    张三            69.0  130.0   43.5    47.5
    李四            95.5  67.5    6.0 114.5
    王五            105.5 81.5    64.0    99.0

#列的数据，加到了每一列上
s2 = df.loc['张三']
df.add(s2,axis = 1)

          Python    数学  物理  英语
Michael     82.0    145.0   118.0   59.5
张三      138.0   260.0   87.0    95.0
李四      164.5   197.5   49.5    162.0
王五      174.5   211.5   107.5   146.5

五.DataFrame处理丢失数据

5.1 有两种丢失数据

5.1.1 None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中

object类型的运算要比int类型的运算慢得多
%timeit sum_int = np.arange(1E6,dtype=int).sum()
%timeit sum_int = np.arange(1E6,dtype=float).sum()
%timeit sum_int = np.arange(1E6,dtype=object).sum()

5.1.2 np.nan(NaN)

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN

5.2 pandas中的None与NaN

5.2.1 pandas中None与np.nan都视作np.nan

a = Series([1,np.nan,2,None])
a

输出：
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

5.2.2 pandas中None与np.nan的操作

isnull() ， notnull()

data = Series([1,np.nan,'hello',None])
data

输出：
0        1
1      NaN
2    hello
3     None
dtype: object


data.isnull()

输出：
0    False
1     True
2    False
3     True
dtype: bool

data[data.notnull()]

输出：
0        1
2    hello
dtype: object

dropna(): 过滤丢失数据

df = DataFrame([[1,np.nan,2],
               [2,3,5],
               [np.nan,4,6]],columns = ['昨天','今天','明天'],index = ['吃饭','睡觉','过家家'])
df

输出：
        昨天      今天  明天
吃饭     1.0      NaN    2
睡觉     2.0      3.0    5
过家家   NaN      4.0    6


df.dropna()

输出：
     昨天 今天  明天
睡觉  2.0  3.0    5

可以选择过滤的是行还是列（默认为行）

df.dropna(axis=1)

输出：
            明天
吃饭          2
睡觉          7
过家家        6
小桥流水人家  1024

可以选择过滤的是行还是列（默认为行）

df.dropna(axis=0)
输出：
             昨天    今天    明天
睡觉          2.0    3.0      7
小桥流水人家  314.0   299.0   1024


df.dropna(axis=1)
输出：
明天
吃饭  2
睡觉  7
过家家 6
小桥流水人家  1024

也可以选择过滤的方式

df[3]=np.nan
display(df)
df.dropna(axis='index',how='all')

输出：
              昨天     今天     明天     3
吃饭           1.0     NaN      2      NaN
睡觉           2.0     3.0      7      NaN
过家家         NaN     4.0      6      NaN
小桥流水人家   314.0   299.0   1024     NaN
Out[44]:
               昨天     今天     明天    3
吃饭           1.0     NaN      2      NaN
睡觉           2.0     3.0      7      NaN
过家家         NaN     4.0      6      NaN
小桥流水人家   314.0   299.0   1024     NaN

fillna(): 填充丢失数据

data = Series([1,np.nan,2,None,4],index = list('abcdf'))
data

输出：
a    1.0
b    NaN
c    2.0
d    NaN
f    4.0
dtype: float64



data.fillna(10)
输出：
a     1.0
b    10.0
c     2.0
d    10.0
e     3.0
dtype: float64

前置填充

'''method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
    Method to use for filling holes in reindexed Series
    pad / ffill: propagate last valid observation forward to next valid
    backfill / bfill: use NEXT valid observation to fill gap'''
display(data)
data.fillna(method='ffill')

输出：
a    1.0
b    NaN
c    2.0
d    NaN
f    4.0
dtype: float64
Out[60]:
a    1.0
b    1.0
c    2.0
d    2.0
f    4.0
dtype: float64

axis = 0 行

display(df)
df.fillna(method='ffill',axis=0)#行

输出：
               昨天     今天     明天    3
吃饭           1.0     NaN      2      NaN
睡觉           2.0     3.0      7      NaN
过家家         NaN     4.0      6      NaN
小桥流水人家   314.0   299.0   1024     NaN


Out[68]:
               昨天     今天     明天    3
吃饭           1.0     NaN      2      NaN
睡觉           2.0     3.0      7      NaN
过家家         2.0     4.0      6      NaN
小桥流水人家   314.0   299.0   1024     NaN

后置填充

data.fillna(method='bfill')

输出：
Out[64]:
a    1.0
b    2.0
c    2.0
d    4.0
f    4.0
dtype: float64

axis = 1 列

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = DataFrame(index = ['吃饭','睡觉','写代码','人生'],columns = ["今天","昨天","明天"],
               data = np.random.randint(0,1024,(4,3)),dtype="object")
display(df)
print(df.loc['睡觉'])
df.loc["睡觉"][::2] =np.nan
df['今天'][::2] = np.nan
display(df)
df.fillna(method='bfill',axis=1)#列


输出：
      今天    昨天    明天
吃饭   673    893     436
睡觉   632    660     783
写代码 783    112      89
人生   1020   619     698

今天    632
昨天    660
明天    783
Name: 睡觉, dtype: object

      今天    昨天    明天
吃饭   NaN    893     436
睡觉   NaN    660     NaN 
写代码 NaN    112      89
人生   1020   619     698

Out[39]:
      今天    昨天    明天
吃饭   673    893     436.0
睡觉   632    660     NaN
写代码 783    112      89.0
人生   1020   619     698.0

对于DataFrame来说，还要选择填充的轴axis。记住，对于DataFrame来说：
axis=0：index/行
axis=1：columns/列