Pandas

2018-01-20 本文已影响57人 Gaius_Yao

1.pandas简介

首先引用官方介绍：“pandas是一个采用BSD协议的开源库，为Python编程语言提供了高性能，易于使用的数据结构和数据分析工具。”这里不做过多展开，需要快速了解pandas的可以尝试官方教程Ten_Minutes_to_Pandas（已汉化），让我们马上通过一个实例来展示pandas的功能吧！首先导入相关库：

# 导入相关库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import datetime
import re

2.文件读取

这里以读取csv文件为例，介绍3种常用的数据读取函数：

df = pd.read_csv(path='file.csv')
df = pd.read_json('file.json') #可以传入json格式字符串
df = pd.read_excel('file.xls', sheetname=[0,1..]) #读取多个sheet，返回多个df的字典

# 导入数据集
dc = pd.read_csv('data/dc.csv')
marvel = pd.read_csv('data/marvel.csv')

3.查看DateFrame

在导入数据集后，我们可以通过下列函数来查看DateFrame：

df.info() #查看DateFrame信息
df.describe() #描述性统计
df.columns #查看列名
df.index #查看索引
df.head() #查看DateFrame头五行
df.tail() #查看DateFrame尾五行

dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6896 entries, 0 to 6895
Data columns (total 13 columns):
page_id             6896 non-null int64
name                6896 non-null object
urlslug             6896 non-null object
ID                  4883 non-null object
ALIGN               6295 non-null object
EYE                 3268 non-null object
HAIR                4622 non-null object
SEX                 6771 non-null object
GSM                 64 non-null object
ALIVE               6893 non-null object
APPEARANCES         6541 non-null float64
FIRST APPEARANCE    6827 non-null object
YEAR                6827 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 700.5+ KB

marvel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
page_id             16376 non-null int64
name                16376 non-null object
urlslug             16376 non-null object
ID                  12606 non-null object
ALIGN               13564 non-null object
EYE                 6609 non-null object
HAIR                12112 non-null object
SEX                 15522 non-null object
GSM                 90 non-null object
ALIVE               16373 non-null object
APPEARANCES         15280 non-null float64
FIRST APPEARANCE    15561 non-null object
Year                15561 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB

dc.columns

Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX',
       'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'YEAR'],
      dtype='object')

marvel.columns

Index(['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX',
       'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'Year'],
      dtype='object')

# 统一列名
marvel.rename(columns={'Year':'YEAR'}, inplace = True)

dc.index

RangeIndex(start=0, stop=6896, step=1)

dc.head() #结果略

marvel.tail() #结果略

4.缺失值处理

在上一步，我们可以看到DateFrame中有不少数据是缺失的（显示为NaN），我们可以通过dropna()函数去掉含有缺失数据的行，但在这里，我们希望保留这些行，因此使用fillna()函数填充缺失的数据：

# 将marvel数据集EYE列缺失的数据填充为Unknow
marvel['EYE'].fillna('UnKnown')

0         Hazel Eyes
1          Blue Eyes
2          Blue Eyes
3          Blue Eyes
4          Blue Eyes
5          Blue Eyes
6         Brown Eyes
7         Brown Eyes
8         Brown Eyes
9          Blue Eyes
10         Blue Eyes
11         Blue Eyes
12        Green Eyes
13         Blue Eyes
14         Blue Eyes
15         Blue Eyes
16         Grey Eyes
17        Green Eyes
18         Blue Eyes
19        Brown Eyes
20         Blue Eyes
21         Blue Eyes
22         Blue Eyes
23         Blue Eyes
24        Green Eyes
25        Brown Eyes
26         Blue Eyes
27        Green Eyes
28        Green Eyes
29       Yellow Eyes
            ...     
16346        UnKnown
16347        UnKnown
16348        UnKnown
16349     White Eyes
16350        UnKnown
16351        UnKnown
16352        UnKnown
16353        UnKnown
16354        UnKnown
16355        UnKnown
16356        UnKnown
16357        UnKnown
16358        UnKnown
16359     Black Eyes
16360     Black Eyes
16361        UnKnown
16362       Red Eyes
16363     Black Eyes
16364     Hazel Eyes
16365        UnKnown
16366     Brown Eyes
16367     Hazel Eyes
16368        UnKnown
16369      Blue Eyes
16370        UnKnown
16371     Green Eyes
16372      Blue Eyes
16373     Black Eyes
16374        UnKnown
16375        UnKnown
Name: EYE, Length: 16376, dtype: object

dc['EYE'].fillna('UnKnown') #结果略

5.添加/插入行列

接下来我们为两个DateFrame添加COMPANY列，并演示如何插入行列。其中插入行的过程会略复杂，需要先切割，再拼接。

marvel['COMPANY'] = 'Marvel'

dc['COMPANY'] = 'DC'

# 查看是否添加成功
dc.head() #结果略

# 将page_id列取出
page_id = dc.pop('page_id')

# 检查是否取出成功
dc.head() #结果略

# 重新插入page_id列
dc.insert(0, 'page_id', page_id)

# 检查是否插入成功
dc.head() #结果略

# 创建需要插入的行的数据
insertRow = pd.DataFrame([[42, 'Gaius', 'Unknown', 'Secret Identity', 'Good Characters','Blake Eyes', 'Black Hair', 'Male Characters', 'Unknown', 'Living Characters', '2333.0', '1995, August', '1995', 'Unknown']],
                         columns=['page_id', 'name', 'urlslug', 'ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE', 'APPEARANCES', 'FIRST APPEARANCE', 'YEAR', 'COMPANY'])
# 将dc分割为above和below
above = dc.loc[:2]
below = dc.loc[3:]
# 拼接above和below
dc = above.append(insertRow,ignore_index=True).append(below,ignore_index=True)
dc.head() #结果略

6.合并DateFrame

通过concat()函数合并DateFrame——pd.concat(list)，list中为各个DateFrame。

comic = pd.concat([dc, marvel], ignore_index=True)

comic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23273 entries, 0 to 23272
Data columns (total 14 columns):
page_id             23273 non-null int64
name                23273 non-null object
urlslug             23273 non-null object
ID                  23273 non-null object
ALIGN               23273 non-null object
EYE                 23273 non-null object
HAIR                23273 non-null object
SEX                 23273 non-null object
GSM                 23273 non-null object
ALIVE               23273 non-null object
APPEARANCES         23273 non-null object
FIRST APPEARANCE    23273 non-null object
YEAR                23273 non-null object
COMPANY             23273 non-null object
dtypes: int64(1), object(13)
memory usage: 2.5+ MB

comic.head() #结果略

7.导出数据

通过to_csv()函数导出数据：

comic.to_csv('data/comic_characters.csv')

至此，我们已经通过pandas将两个数据集合为了一个，下一期将会使用seaborn对该数据集进行可视化的工作。