贷款违约预测-数据探索
1. 数据总体了解:
a. 读取数据集并了解数据集大小,原始特征维度;
1)data_test_a.shape
2)data_train.shape
3)data_train.columns
b. 通过info熟悉数据类型;
1)data_train.info()
c. 粗略查看数据集中各特征基本统计量;
1)data_train.describe()
2)data_train.head(3).append(data_train.tail(3))
2. 缺失值和唯一值:
a. 查看数据缺失值情况
1)print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
2)have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value
3)fea_null_moreThanHalf
4)missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
b. 查看唯一值特征情况
3. 深入数据-查看数据类型
a. 类别型数据
1)def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea =
get_numerical_serial_fea(data_train,numerical_fea)
b. 数值型数据
离散数值型数据
1)data_train['term'].value_counts()
连续数值型数据
1)f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
4. 数据间相关关系
a. 特征和特征之间关系
b. 特征和目标变量之间关系
1)fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of
grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of
grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',
ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',
ax=ax4, title='Count of employmentLength non-fraud')
plt.show()
5. 用pandas_profiling生成数据报告
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")