Pandas - 7. 数据类型

2022-05-23  本文已影响0人  陈天睡懒觉
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
tips = sns.load_dataset('tips')
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

astype() 转换数据类型(可用于Series和DataFrame),可转化成python内置的数据类型:str,float,int,complex,bool。以及Numpy库支持的任何dtype。

# 将sex数据转换成字符串类型
tips['sex_str'] = tips['sex'].astype(str)
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

转成数值类型

有些数值列会有missing或null来代替缺失值,导致整列为字符串类型

tips_sub_miss = tips.head(10)
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'
print(tips_sub_miss)
  total_bill   tip     sex smoker  day    time  size sex_str
0      16.99  1.01  Female     No  Sun  Dinner     2  Female
1    missing  1.66    Male     No  Sun  Dinner     3    Male
2      21.01  3.50    Male     No  Sun  Dinner     3    Male
3    missing  3.31    Male     No  Sun  Dinner     2    Male
4      24.59  3.61  Female     No  Sun  Dinner     4  Female
5    missing  4.71    Male     No  Sun  Dinner     4    Male
6       8.77  2.00    Male     No  Sun  Dinner     2    Male
7    missing  3.12    Male     No  Sun  Dinner     4    Male
8      15.04  1.96    Male     No  Sun  Dinner     2    Male
9      14.78  3.23    Male     No  Sun  Dinner     2    Male
print(tips_sub_miss.dtypes)
total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
# Pandas无法把缺失值转换成float
tips_sub_miss['total_bill'].astype(float) # ValueError: could not convert string to float: 'missing'
# 用to_numeric函数也出错
pd.to_numeric(tips_sub_miss['total_bill']) # Unable to parse string "missing" at position 1

to_numeric() 转换成数值
参数:

tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
                                           errors='coerce')
print(tips_sub_miss.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
print(tips_sub_miss)
   total_bill   tip     sex smoker  day    time  size sex_str
0       16.99  1.01  Female     No  Sun  Dinner     2  Female
1         NaN  1.66    Male     No  Sun  Dinner     3    Male
2       21.01  3.50    Male     No  Sun  Dinner     3    Male
3         NaN  3.31    Male     No  Sun  Dinner     2    Male
4       24.59  3.61  Female     No  Sun  Dinner     4  Female
5         NaN  4.71    Male     No  Sun  Dinner     4    Male
6        8.77  2.00    Male     No  Sun  Dinner     2    Male
7         NaN  3.12    Male     No  Sun  Dinner     4    Male
8       15.04  1.96    Male     No  Sun  Dinner     2    Male
9       14.78  3.23    Male     No  Sun  Dinner     2    Male
tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
                                           errors='coerce',
                                           downcast='float')
print(tips_sub_miss.dtypes)
total_bill     float32
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

分类数据

用于对分类值进行编码,具有如下优点:

  1. 节约内存,提高速度
  2. 当值具有一定顺序,需要转化成分类数据
  3. 有些python库可以处理分类数据(拟合统计模型)
tips['sex'] = tips['sex'].astype('str')
print(tips.dtypes)
total_bill     float64
tip            float64
sex             object
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
tips['sex'] = tips['sex'].astype('category')
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

分类Series上的操作

上一篇 下一篇

猜你喜欢

热点阅读