pandas 学习心得(2):选择数据的操作

2018-08-02 本文已影响0人不做废物

简书阅读体验不佳(与有道云笔记的markdown解析不同)，因此建议进入传送门
jupyter notebook:pandas 学习心得(2):选择数据的操作

这个系列是我学习《python数据科学手册》所做的笔记
用于个人备忘
顺便分享，因此存在不严谨的地方或者述说不清晰的地方

Series数据选择方法

索引
切片
掩码(布尔索引)
花式索引
索引器 loc、iloc、ix

import numpy as np
import pandas as pd
data = pd.Series([0.25,0.5,0.75,1],
                index = ['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

1. 索引

索引即获取一个标量
索引方法有两种，1)将Series 对象看作字典. 2) 将Series看作一维数组

data['b']  # 看作字典

0.5

data[1]  # 看作一维数组

0.5

2. 切片

切片即获取一组数据(个数大于1)
切片方法有两种: 1)将Series 对象看作字典. 2) 将Series看作一维数组

data['b':'d']  # 看作字典，包含两个端点, 显式索引

b    0.50
c    0.75
d    1.00
dtype: float64

data[0:2]  # 看作一维数组，左闭右开区间， 隐式索引

a    0.25
b    0.50
dtype: float64

3. 掩码(布尔索引)

data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

4. 花式索引

使用花式索引，结果的形状与索引数组形状一致，而不是与被索引数组的形状一致

select = np.array([[0,1],
                     [2,3]])
select

array([[0, 1],
       [2, 3]])

data[select]  # 花式索引，结果形状与select一致

array([[0.25, 0.5 ],
       [0.75, 1.  ]])

5. 索引器

切片、索引的两种不同方式(显示索引与隐式索引) 经常会引起混乱
例如，Series式显式整数索引，那么data[1] 这样的取值操作会使用显示索引，而data[1:3] 这种切片操作会用隐式索引

data = pd.Series(['a','b','c'], index = [1, 3, 5])
data

1    a
3    b
5    c
dtype: object

data[1]  # 结果为 a，显式索引

'a'

data[1:3] # 结果为 b c ， 隐式索引

3    b
5    c
dtype: object

上述方法容易引起混淆，所以pandas提供了索引器
第一种索引器是 loc，表示显式

data.loc[1]

'a'

data.loc[3:5]

3    b
5    c
dtype: object

第二种索引器是iloc，表示隐式

data.iloc[0]

'a'

data.iloc[0:2]  # 左闭右开

1    a
3    b
dtype: object

第三种索引器是ix ，它相当于标准的[]取值方式，是前两种索引器的混合形式，多用于DataFrame中，此处不推荐

DataFrame数据选取方法

将DataFrame 看作字典
将DataFrame 看作二维数组

别看这个页面了，点击下面的传送门吧，我也难过。。不知道为什么一碰到DataFrame就显示不出来，这里吐槽简书一波
jupyter notebook:pandas 学习心得(2):选择数据的操作

x = {'a': 10,'b':20, 'c':30}
y = {'a':2, 'b':4, 'c':6}
data = pd.DataFrame({'x':x,'y':y})
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>10</td>
<td>2</td>
</tr>
<tr>
<th>b</th>
<td>20</td>
<td>4</td>
</tr>
<tr>
<th>c</th>
<td>30</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>

1. 将DataFrame看作字典

通过对列名进行字典形式的取值获取数据

data['x']

a    10
b    20
c    30
Name: x, dtype: int64

也可以使用属性形式访问

data.x

a    10
b    20
c    30
Name: x, dtype: int64

注意，如果要修改值，可以用 data['x'] = wtf，但不要使用 data.x = wtf

还可以增加一列

data['z'] = data['x'] + data['y']
data

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>10</td>
<td>2</td>
<td>12</td>
</tr>
<tr>
<th>b</th>
<td>20</td>
<td>4</td>
<td>24</td>
</tr>
<tr>
<th>c</th>
<td>30</td>
<td>6</td>
<td>36</td>
</tr>
</tbody>
</table>
</div>

2. 将DataFrame 看作二维数组

这种方法比较实用
类比二维numpy 数组，查看数组的values属性

data.values

array([[10,  2, 12],
       [20,  4, 24],
       [30,  6, 36]], dtype=int64)

进行转置

data.T  # 转置，创建新的副本

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
</tr>
</thead>
<tbody>
<tr>
<th>x</th>
<td>10</td>
<td>20</td>
<td>30</td>
</tr>
<tr>
<th>y</th>
<td>2</td>
<td>4</td>
<td>6</td>
</tr>
<tr>
<th>z</th>
<td>12</td>
<td>24</td>
<td>36</td>
</tr>
</tbody>
</table>
</div>

在进行数组形式的取值时，就可以用上述的索引器 loc, iloc, ix

data.iloc[0:2]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

data.loc['b']

x    20
y     4
z    24
Name: b, dtype: int64

在loc索引器中结合掩码与花式索引

data.loc[data.z > 20, ['x','y']]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>b</th>
<td>20</td>
<td>4</td>
</tr>
<tr>
<th>c</th>
<td>30</td>
<td>6</td>
</tr>
</tbody>
</table>
</div>

data.loc[['a','b'],['y','z']]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>y</th>
<th>z</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>2</td>
<td>12</td>
</tr>
<tr>
<th>b</th>
<td>4</td>
<td>24</td>
</tr>
</tbody>
</table>
</div>

注意

如果对单个标签取值，就选择列，如果大队多个标签切片，就选择行

data['x']  # 如果输入 data['a']  就报错，可以试试

a    10
b    20
c    30
Name: x, dtype: int64

data['a':'c']  # 如果输入 data['x':'y'] ，不是你想要的结果

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

pandas 学习心得(2):选择数据的操作

Series数据选择方法

1. 索引

2. 切片

3. 掩码(布尔索引)

4. 花式索引

5. 索引器

DataFrame数据选取方法

1. 将DataFrame看作字典

2. 将DataFrame 看作二维数组

注意

猜你喜欢

热点阅读