数据分析（二）——pandas

2018-01-25 本文已影响0人哈喽小生

一：概述

pandas 是python的一个数据分析包,提供了大量能使我们快速便捷地处理数据的函数和方法,是使Python成为强大而高效的数据分析环境的重要因素之一

二：pandas的数据结构

1.Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是：List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。
2.Time- Series：以时间为索引的Series。
3.DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
4.Panel ：三维的数组，可以理解为DataFrame的容器。

三：pandas的应用

通过list创建Series对象

>>> import numpy as np
>>> n1 = pd.Series(range(10,30))
>>> n1
0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
12    22
13    23
14    24
15    25
16    26
17    27
18    28
19    29
dtype: int64
>>> n1.head(3)#获取前三个
0    10
1    11
2    12
dtype: int64

通过dict构建Series

>>> n2 = {'python':100,'php':90,'java':80}#创建一个字典
>>> s1 = pd.Series(n2)
>>> s1
java       80
php        90
python    100
dtype: int64

获取数据和索引

>>> s1.values
array([ 80,  90, 100], dtype=int64)
>>> s1.index
Index(['java', 'php', 'python'], dtype='object')
>>> type(s1.values)
<class 'numpy.ndarray'>
>>> type(s1.index)
<class 'pandas.core.indexes.base.Index'>

通过索引获取数据

>>> s1['java']
80
>>> s1.name = 'score'
>>> s1.index.name='subject'
>>> s1.name
'score'
>>> s1.index.name
'subject'
>>> s1
subject
java       80
php        90
python    100
Name: score, dtype: int64

DataFrame——通过ndarray构建DataFrame

>>> n3 = np.random.rand(3,4)
>>> n3
array([[0.35242218, 0.64135929, 0.51317041, 0.81487747],
       [0.02967373, 0.80212644, 0.7703746 , 0.18438255],
       [0.0362117 , 0.53756172, 0.3605709 , 0.73973935]])

>>> s2 = pd.DataFrame(n3)
>>> s2
          0         1         2         3
0  0.352422  0.641359  0.513170  0.814877
1  0.029674  0.802126  0.770375  0.184383
2  0.036212  0.537562  0.360571  0.739739

通过dict构建DataFrame

>>>  n4 = { "A":1,
...          "B":pd.Timestamp("20180124"),
...          "C":pd.Series(range(10,14),dtype="float64"),
...          "D":['Java','Python','C++',"php"],
...          "E":np.array([3]*4)
... }

>>> s3 = pd.DataFrame(n4)
>>> s3
   A          B     C       D  E
0  1 2018-01-24  10.0    Java  3
1  1 2018-01-24  11.0  Python  3
2  1 2018-01-24  12.0     C++  3
3  1 2018-01-24  13.0     php  3

通过列索引获取列数据

>>> s3['D']
0      Java
1    Python
2       C++
3       php
Name: D, dtype: object
>>> s3['A']
0    1
1    1
2    1
3    1
Name: A, dtype: int64
>>> type(s3['D'])
<class 'pandas.core.series.Series'>
>>> type(s3['A'])
<class 'pandas.core.series.Series'>
>>> type(s3['C'][2])
<class 'numpy.float64'>
>>> s3['D'][2]
'C++'

增加列数据

>>> s3['F']={'html':80,'js':90,'css':85,'xml':75}
>>> s3
   A          B     C       D  E     F
0  1 2018-01-24  10.0    Java  3   xml
1  1 2018-01-24  11.0  Python  3    js
2  1 2018-01-24  12.0     C++  3   css
3  1 2018-01-24  13.0     php  3  html
>>> s3['G'] = 'haha'
>>> s3
   A          B     C       D  E     F     G
0  1 2018-01-24  10.0    Java  3   xml  haha
1  1 2018-01-24  11.0  Python  3    js  haha
2  1 2018-01-24  12.0     C++  3   css  haha
3  1 2018-01-24  13.0     php  3  html  haha
>>> s3['H'] = s3['C'] + 10
>>> s3
   A          B     C       D  E     F     G     H
0  1 2018-01-24  10.0    Java  3   xml  haha  20.0
1  1 2018-01-24  11.0  Python  3    js  haha  21.0
2  1 2018-01-24  12.0     C++  3   css  haha  22.0
3  1 2018-01-24  13.0     php  3  html  haha  23.0

列的删除

>>> del(s3['G'])
>>> s3
   A          B     C       D  E     F     H
0  1 2018-01-24  10.0    Java  3   xml  20.0
1  1 2018-01-24  11.0  Python  3    js  21.0
2  1 2018-01-24  12.0     C++  3   css  22.0
3  1 2018-01-24  13.0     php  3  html  23.0

Series索引——index指定行索引名

>>> import numpy as np
>>> import pandas as pd
>>> m1 = pd.Series([1,2,3,4,5,6,7,8,9], index=['a','b','c','d','e','f','g','h','i'])
>>> m1
a    1
b    2
c    3
d    4
e    5
f    6
g    7
h    8
i    9
dtype: int64

行索引

>>> m1['c']
3
>>> m1[3]
4

切片

>>> m1['b':'e']
b    2
c    3
d    4
e    5
dtype: int64

>>> m1[1:4]
b    2
c    3
d    4
dtype: int64

>>> m1[['b','e','f']]# 不连续的索引
b    2
e    5
f    6
dtype: int64

布尔索引

>>> m1>3
a    False
b    False
c    False
d     True
e     True
f     True
g     True
h     True
i     True
dtype: bool

>>> m1[m1>3]
d    4
e    5
f    6
g    7
h    8
i    9
dtype: int64

>>> s1 = pd.DataFrame(np.random.randn(5,4),columns=['a','
b','c','d'])
>>> s1
          a         b         c         d
0 -3.234718 -1.067102  0.253580  0.996599
1 -0.063340 -1.144646  0.196663 -0.066152
2  1.101328 -0.525478 -0.733888 -0.235543
3 -0.944941  0.178218  0.451760 -0.994914
4 -0.992325  0.176311 -0.439556 -0.291685

columns指定列索引名

>>> s2 = pd.DataFrame(np.random.randn(5,4),index=['A','B'
,'C','D','E'],columns=['a','b','c','d'])
>>> s2
          a         b         c         d
A  1.906068  0.133637 -0.036032  0.297770
B  0.864639  1.030426  0.283934 -0.346310
C -0.303209 -0.487499  0.665573 -1.040874
D -1.320571 -0.443308  0.049811 -0.701478
E -1.103991  1.046340 -0.134866 -0.594568

>>> s2['b']
A    0.133637
B    1.030426
C   -0.487499
D   -0.443308
E    1.046340
Name: b, dtype: float64

>>> s2['b']['D']
-0.44330773944073404

>>> s2[['a','c']]
          a         c
A  1.906068 -0.036032
B  0.864639  0.283934
C -0.303209  0.665573
D -1.320571  0.049811
E -1.103991 -0.134866

>>> s2[['a','c','d']]#不连续索引
          a         c         d
A  1.906068 -0.036032  0.297770
B  0.864639  0.283934 -0.346310
C -0.303209  0.665573 -1.040874
D -1.320571  0.049811 -0.701478
E -1.103991 -0.134866 -0.594568

loc标签索引

>>> s2[['b','d']]
          b         d
A  0.133637  0.297770
B  1.030426 -0.346310
C -0.487499 -1.040874
D -0.443308 -0.701478
E  1.046340 -0.594568
>>> s2.loc['A':'D','b':'d']#切片
          b         c         d
A  0.133637 -0.036032  0.297770
B  1.030426  0.283934 -0.346310
C -0.487499  0.665573 -1.040874
D -0.443308  0.049811 -0.701478


（欢迎加入Python交流群：930353061。人生苦短，我用python！！！）

数据分析（二）——pandas

一：概述

二：pandas的数据结构

三：pandas的应用

猜你喜欢

热点阅读