Pandas - 字符串操作

2021-10-12 本文已影响0人山药鱼儿

字符串的基本操作

首先，构造一个 Series：

>> import numpy as np
>> import pandas as pd
>> s = pd.Series(['a','B','CdEfG','PythON',np.nan])
>> s
0         a
1         B
2     CdEfG
3    PythON
4       NaN
dtype: object

将 Series 中的值转换为小写：

>> s.str.lower()
0         a
1         b
2     cdefg
3    python
4       NaN
dtype: object

同样也可以转换为大写形式：

>> s.str.upper()
0         A
1         B
2     CDEFG
3    PYTHON
4       NaN
dtype: object

获取字符串的长度：

>> s.str.len()
0    1.0
1    1.0
2    5.0
3    6.0
4    NaN
dtype: float64

strip、lstrip、rstrip

字符串操作不仅可以应用于值，也可以应用于 index 和 columns，下面构造一个 Index :

>> index = pd.Index([' a ','b ',' c'])
>> index
Index([' a ', 'b ', ' c'], dtype='object')

去掉字符串两端的空白：

>> index.str.strip()
Index(['a', 'b', 'c'], dtype='object')

只去掉左边的空白：

>> index.str.lstrip()
Index(['a ', 'b ', 'c'], dtype='object')

只去掉右边的空白：

>> index.str.rstrip()
Index([' a', 'b', ' c'], dtype='object')

replace

构造一个 DataFrame：

>> df = pd.DataFrame(np.random.randn(3,2), columns=['c 0', 'c 1'])
>> df

获取上述 DataFrame 的列：

>> df.columns
Index(['c 0', 'c 1'], dtype='object')

DataFrame 的列作为一个 Index 对象，也可以应用字符串方法：

>> df.columns = df.columns.str.replace(' ','')
>> df

列名中的空格被成功去掉了：

split

构造一个 Series：

>> s = pd.Series(['a b c', 'd e f', 'g h i'])
>> s
0    a b c
1    d e f
2    g h i
dtype: object

将其中的值按空格进行切分：

>> s.str.split()
0    [a, b, c]
1    [d, e, f]
2    [g, h, i]
dtype: object

也可以在切分时指定 expand=True ，这样结果将扩展为一个 DataFrame：

>> s.str.split(expand=True)

此外，还可以指定切分的次数：

>> s.str.split(expand=True, n=1)

contains

>> s = pd.Series(['Py game', 'Py web', 'Java web', 'python', 'C++'])
>> s
0     Py game
1      Py web
2    Java web
3      Python
4         C++
dtype: object

contains 方法将返回一个布尔型的 Series：

>> s.str.lower().str.contains('p')
0     True
1     True
2    False
3     True
4    False
dtype: bool
>> s.str.lower().str.contains('py')
0     True
1     True
2    False
3     True
4    False
dtype: bool

get_dummies

构造一个 Series：

>> s = pd.Series(['a', 'a|b', 'a|b|c', 'b|c'])
>> s
0        a
1      a|b
2    a|b|c
3      b|c
dtype: object

get_dummies 将使用指定的分隔符 sep 切分字符串，并返回一个由 0 和 1 组成的 DataFrame ：

>> s.str.get_dummies(sep='|')

其实类似于使用二进制编码的方式，反应每条记录包含的值的情况。