10 Pandas字符串处理

2022-11-06 本文已影响0人 Viterbi

[toc]

10 Pandas字符串处理

前面我们已经使用了字符串的处理函数：
df["bWendu"].str.replace("℃", "").astype('int32')

Pandas的字符串处理：

使用方法：先获取Series的str属性，然后在属性上调用函数；
只能在字符串列上使用，不能数字列上使用；
Dataframe上没有str属性和处理方法
Series.str并不是Python原生字符串，而是自己的一套方法，不过大部分和原生str很相似；

Series.str字符串方法列表参考文档: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

本节演示内容：

获取Series的str属性，然后使用各种字符串处理函数
使用str的startswith、contains等bool类Series可以做条件查询
需要多次str处理的链式操作
使用正则表达式的处理

0、读取北京2018年天气数据

import pandas as pd

fpath = "./datas/beijing_tianqi/beijing_tianqi_2018.csv"
df = pd.read_csv(fpath)

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
0	2018-01-01	3℃	-6℃	晴~多云	东北风	1-2级	59	良	2
1	2018-01-02	2℃	-5℃	阴~多云	东北风	1-2级	49	优	1
2	2018-01-03	2℃	-5℃	多云	北风	1-2级	28	优	1
3	2018-01-04	0℃	-8℃	阴	东北风	1-2级	28	优	1
4	2018-01-05	3℃	-6℃	多云~晴	西北风	1-2级	50	优	1

df.dtypes


    ymd          object
    bWendu       object
    yWendu       object
    tianqi       object
    fengxiang    object
    fengli       object
    aqi           int64
    aqiInfo      object
    aqiLevel      int64
    dtype: object



### 1、获取Series的str属性，使用各种字符串处理函数


```python
df["bWendu"].str

    <pandas.core.strings.StringMethods at 0x1af21871808>


# 字符串替换函数
df["bWendu"].str.replace("℃", "")


    0       3
    1       2
    2       2
    3       0
    4       3
           ..
    360    -5
    361    -3
    362    -3
    363    -2
    364    -2
    Name: bWendu, Length: 365, dtype: object

# 判断是不是数字
df["bWendu"].str.isnumeric()

    0      False
    1      False
    2      False
    3      False
    4      False
           ...  
    360    False
    361    False
    362    False
    363    False
    364    False
    Name: bWendu, Length: 365, dtype: bool


df["aqi"].str.len()


    ---------------------------------------------------------------------------

    AttributeError                            Traceback (most recent call last)

    <ipython-input-8-12cdcbdb6f81> in <module>
    ----> 1 df["aqi"].str.len()
    

    d:\appdata\python37\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
       5173             or name in self._accessors
       5174         ):
    -> 5175             return object.__getattribute__(self, name)
       5176         else:
       5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
    

    d:\appdata\python37\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
        173             # we're accessing the attribute of the class, i.e., Dataset.geo
        174             return self._accessor
    --> 175         accessor_obj = self._accessor(obj)
        176         # Replace the property with the accessor object. Inspired by:
        177         # http://www.pydanny.com/cached-property.html
    

    d:\appdata\python37\lib\site-packages\pandas\core\strings.py in __init__(self, data)
       1915 
       1916     def __init__(self, data):
    -> 1917         self._inferred_dtype = self._validate(data)
       1918         self._is_categorical = is_categorical_dtype(data)
       1919 
    

    d:\appdata\python37\lib\site-packages\pandas\core\strings.py in _validate(data)
       1965 
       1966         if inferred_dtype not in allowed_types:
    -> 1967             raise AttributeError("Can only use .str accessor with string " "values!")
       1968         return inferred_dtype
       1969 
    

    AttributeError: Can only use .str accessor with string values!

2、使用str的startswith、contains等得到bool的Series可以做条件查询

condition = df["ymd"].str.startswith("2018-03")

condition



    0      False
    1      False
    2      False
    3      False
    4      False
           ...  
    360    False
    361    False
    362    False
    363    False
    364    False
    Name: ymd, Length: 365, dtype: bool


df[condition].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	ymd	bWendu	yWendu	tianqi	fengxiang	fengli	aqi	aqiInfo	aqiLevel
59	2018-03-01	8℃	-3℃	多云	西南风	1-2级	46	优	1
60	2018-03-02	9℃	-1℃	晴~多云	北风	1-2级	95	良	2
61	2018-03-03	13℃	3℃	多云~阴	北风	1-2级	214	重度污染	5
62	2018-03-04	7℃	-2℃	阴~多云	东南风	1-2级	144	轻度污染	3
63	2018-03-05	8℃	-3℃	晴	南风	1-2级	94	良	2

3、需要多次str处理的链式操作

怎样提取201803这样的数字月份？
1、先将日期2018-03-31替换成20180331的形式 2、提取月份字符串201803

df["ymd"].str.replace("-", "")




    0      20180101
    1      20180102
    2      20180103
    3      20180104
    4      20180105
             ...   
    360    20181227
    361    20181228
    362    20181229
    363    20181230
    364    20181231
    Name: ymd, Length: 365, dtype: object



# 每次调用函数，都返回一个新Series
df["ymd"].str.replace("-", "").slice(0, 6)



    ---------------------------------------------------------------------------

    AttributeError                            Traceback (most recent call last)

    <ipython-input-13-ae278fb12255> in <module>
          1 # 每次调用函数，都返回一个新Series
    ----> 2 df["ymd"].str.replace("-", "").slice(0, 6)
    

    d:\appdata\python37\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
       5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
       5178                 return self[name]
    -> 5179             return object.__getattribute__(self, name)
       5180 
       5181     def __setattr__(self, name, value):
    

    AttributeError: 'Series' object has no attribute 'slice'



df["ymd"].str.replace("-", "").str.slice(0, 6)


    0      201801
    1      201801
    2      201801
    3      201801
    4      201801
            ...  
    360    201812
    361    201812
    362    201812
    363    201812
    364    201812
    Name: ymd, Length: 365, dtype: object

# slice就是切片语法，可以直接用
df["ymd"].str.replace("-", "").str[0:6]




    0      201801
    1      201801
    2      201801
    3      201801
    4      201801
            ...  
    360    201812
    361    201812
    362    201812
    363    201812
    364    201812
    Name: ymd, Length: 365, dtype: object

4. 使用正则表达式的处理

# 添加新列
def get_nianyueri(x):
    year,month,day = x["ymd"].split("-")
    return f"{year}年{month}月{day}日"
df["中文日期"] = df.apply(get_nianyueri, axis=1)

df["中文日期"]



    0      2018年01月01日
    1      2018年01月02日
    2      2018年01月03日
    3      2018年01月04日
    4      2018年01月05日
              ...     
    360    2018年12月27日
    361    2018年12月28日
    362    2018年12月29日
    363    2018年12月30日
    364    2018年12月31日
    Name: 中文日期, Length: 365, dtype: object

问题：怎样将“2018年12月31日”中的年、月、日三个中文字符去除？

# 方法1：链式replace
df["中文日期"].str.replace("年", "").str.replace("月","").str.replace("日", "")


    0      20180101
    1      20180102
    2      20180103
    3      20180104
    4      20180105
             ...   
    360    20181227
    361    20181228
    362    20181229
    363    20181230
    364    20181231
    Name: 中文日期, Length: 365, dtype: object

Series.str默认就开启了正则表达式模式

# 方法2：正则表达式替换
df["中文日期"].str.replace("[年月日]", "")


    0      20180101
    1      20180102
    2      20180103
    3      20180104
    4      20180105
             ...   
    360    20181227
    361    20181228
    362    20181229
    363    20181230
    364    20181231
    Name: 中文日期, Length: 365, dtype: object

本文使用文章同步助手同步

10 Pandas字符串处理