数据分析(二)——pandas
2018-01-25 本文已影响0人
哈喽小生
一:概述
- pandas 是python的一个数据分析包,提供了大量能使我们快速便捷地处理数据的函数和方法,是使Python成为强大而高效的数据分析环境的重要因素之一
二:pandas的数据结构
1.Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
2.Time- Series:以时间为索引的Series。
3.DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
4.Panel :三维的数组,可以理解为DataFrame的容器。
三:pandas的应用
- 通过list创建Series对象
>>> import numpy as np
>>> n1 = pd.Series(range(10,30))
>>> n1
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
dtype: int64
>>> n1.head(3)#获取前三个
0 10
1 11
2 12
dtype: int64
- 通过dict构建Series
>>> n2 = {'python':100,'php':90,'java':80}#创建一个字典
>>> s1 = pd.Series(n2)
>>> s1
java 80
php 90
python 100
dtype: int64
- 获取数据和索引
>>> s1.values
array([ 80, 90, 100], dtype=int64)
>>> s1.index
Index(['java', 'php', 'python'], dtype='object')
>>> type(s1.values)
<class 'numpy.ndarray'>
>>> type(s1.index)
<class 'pandas.core.indexes.base.Index'>
- 通过索引获取数据
>>> s1['java']
80
>>> s1.name = 'score'
>>> s1.index.name='subject'
>>> s1.name
'score'
>>> s1.index.name
'subject'
>>> s1
subject
java 80
php 90
python 100
Name: score, dtype: int64
- DataFrame——通过ndarray构建DataFrame
>>> n3 = np.random.rand(3,4)
>>> n3
array([[0.35242218, 0.64135929, 0.51317041, 0.81487747],
[0.02967373, 0.80212644, 0.7703746 , 0.18438255],
[0.0362117 , 0.53756172, 0.3605709 , 0.73973935]])
>>> s2 = pd.DataFrame(n3)
>>> s2
0 1 2 3
0 0.352422 0.641359 0.513170 0.814877
1 0.029674 0.802126 0.770375 0.184383
2 0.036212 0.537562 0.360571 0.739739
- 通过dict构建DataFrame
>>> n4 = { "A":1,
... "B":pd.Timestamp("20180124"),
... "C":pd.Series(range(10,14),dtype="float64"),
... "D":['Java','Python','C++',"php"],
... "E":np.array([3]*4)
... }
>>> s3 = pd.DataFrame(n4)
>>> s3
A B C D E
0 1 2018-01-24 10.0 Java 3
1 1 2018-01-24 11.0 Python 3
2 1 2018-01-24 12.0 C++ 3
3 1 2018-01-24 13.0 php 3
- 通过列索引获取列数据
>>> s3['D']
0 Java
1 Python
2 C++
3 php
Name: D, dtype: object
>>> s3['A']
0 1
1 1
2 1
3 1
Name: A, dtype: int64
>>> type(s3['D'])
<class 'pandas.core.series.Series'>
>>> type(s3['A'])
<class 'pandas.core.series.Series'>
>>> type(s3['C'][2])
<class 'numpy.float64'>
>>> s3['D'][2]
'C++'
- 增加列数据
>>> s3['F']={'html':80,'js':90,'css':85,'xml':75}
>>> s3
A B C D E F
0 1 2018-01-24 10.0 Java 3 xml
1 1 2018-01-24 11.0 Python 3 js
2 1 2018-01-24 12.0 C++ 3 css
3 1 2018-01-24 13.0 php 3 html
>>> s3['G'] = 'haha'
>>> s3
A B C D E F G
0 1 2018-01-24 10.0 Java 3 xml haha
1 1 2018-01-24 11.0 Python 3 js haha
2 1 2018-01-24 12.0 C++ 3 css haha
3 1 2018-01-24 13.0 php 3 html haha
>>> s3['H'] = s3['C'] + 10
>>> s3
A B C D E F G H
0 1 2018-01-24 10.0 Java 3 xml haha 20.0
1 1 2018-01-24 11.0 Python 3 js haha 21.0
2 1 2018-01-24 12.0 C++ 3 css haha 22.0
3 1 2018-01-24 13.0 php 3 html haha 23.0
- 列的删除
>>> del(s3['G'])
>>> s3
A B C D E F H
0 1 2018-01-24 10.0 Java 3 xml 20.0
1 1 2018-01-24 11.0 Python 3 js 21.0
2 1 2018-01-24 12.0 C++ 3 css 22.0
3 1 2018-01-24 13.0 php 3 html 23.0
- Series索引——index指定行索引名
>>> import numpy as np
>>> import pandas as pd
>>> m1 = pd.Series([1,2,3,4,5,6,7,8,9], index=['a','b','c','d','e','f','g','h','i'])
>>> m1
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
dtype: int64
- 行索引
>>> m1['c']
3
>>> m1[3]
4
- 切片
>>> m1['b':'e']
b 2
c 3
d 4
e 5
dtype: int64
>>> m1[1:4]
b 2
c 3
d 4
dtype: int64
>>> m1[['b','e','f']]# 不连续的索引
b 2
e 5
f 6
dtype: int64
- 布尔索引
>>> m1>3
a False
b False
c False
d True
e True
f True
g True
h True
i True
dtype: bool
>>> m1[m1>3]
d 4
e 5
f 6
g 7
h 8
i 9
dtype: int64
>>> s1 = pd.DataFrame(np.random.randn(5,4),columns=['a','
b','c','d'])
>>> s1
a b c d
0 -3.234718 -1.067102 0.253580 0.996599
1 -0.063340 -1.144646 0.196663 -0.066152
2 1.101328 -0.525478 -0.733888 -0.235543
3 -0.944941 0.178218 0.451760 -0.994914
4 -0.992325 0.176311 -0.439556 -0.291685
- columns指定列索引名
>>> s2 = pd.DataFrame(np.random.randn(5,4),index=['A','B'
,'C','D','E'],columns=['a','b','c','d'])
>>> s2
a b c d
A 1.906068 0.133637 -0.036032 0.297770
B 0.864639 1.030426 0.283934 -0.346310
C -0.303209 -0.487499 0.665573 -1.040874
D -1.320571 -0.443308 0.049811 -0.701478
E -1.103991 1.046340 -0.134866 -0.594568
>>> s2['b']
A 0.133637
B 1.030426
C -0.487499
D -0.443308
E 1.046340
Name: b, dtype: float64
>>> s2['b']['D']
-0.44330773944073404
>>> s2[['a','c']]
a c
A 1.906068 -0.036032
B 0.864639 0.283934
C -0.303209 0.665573
D -1.320571 0.049811
E -1.103991 -0.134866
>>> s2[['a','c','d']]#不连续索引
a c d
A 1.906068 -0.036032 0.297770
B 0.864639 0.283934 -0.346310
C -0.303209 0.665573 -1.040874
D -1.320571 0.049811 -0.701478
E -1.103991 -0.134866 -0.594568
- loc标签索引
>>> s2[['b','d']]
b d
A 0.133637 0.297770
B 1.030426 -0.346310
C -0.487499 -1.040874
D -0.443308 -0.701478
E 1.046340 -0.594568
>>> s2.loc['A':'D','b':'d']#切片
b c d
A 0.133637 -0.036032 0.297770
B 1.030426 0.283934 -0.346310
C -0.487499 0.665573 -1.040874
D -0.443308 0.049811 -0.701478
(欢迎加入Python交流群:930353061。人生苦短,我用python!!!)