贷款违约预测-数据探索

2020-09-17  本文已影响0人  58506fd3fbed

1. 数据总体了解:

a. 读取数据集并了解数据集大小,原始特征维度;

1)data_test_a.shape

2)data_train.shape

3)data_train.columns

b. 通过info熟悉数据类型;

1)data_train.info()

c. 粗略查看数据集中各特征基本统计量;

1)data_train.describe()

2)data_train.head(3).append(data_train.tail(3))

2. 缺失值和唯一值:

a. 查看数据缺失值情况

1)print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')

2)have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()

fea_null_moreThanHalf = {}

for key,value in have_null_fea_dict.items():

    if value > 0.5:

        fea_null_moreThanHalf[key] = value

3)fea_null_moreThanHalf

4)missing = data_train.isnull().sum()/len(data_train)

missing = missing[missing > 0]

missing.sort_values(inplace=True)

missing.plot.bar()

b. 查看唯一值特征情况

3. 深入数据-查看数据类型

a. 类别型数据

1)def get_numerical_serial_fea(data,feas):

numerical_serial_fea = []

numerical_noserial_fea = []

for fea in feas:

temp = data[fea].nunique()

if temp <= 10:

numerical_noserial_fea.append(fea)

continue

numerical_serial_fea.append(fea)

return numerical_serial_fea,numerical_noserial_fea

numerical_serial_fea,numerical_noserial_fea =

get_numerical_serial_fea(data_train,numerical_fea)

b. 数值型数据

离散数值型数据

1)data_train['term'].value_counts()

连续数值型数据

1)f = pd.melt(data_train, value_vars=numerical_serial_fea)

g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)

g = g.map(sns.distplot, "value")

4. 数据间相关关系

a. 特征和特征之间关系

b. 特征和目标变量之间关系

1)fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))

train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of

grade fraud')

train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of

grade non-fraud')

train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',

ax=ax3, title='Count of employmentLength fraud')

train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',

ax=ax4, title='Count of employmentLength non-fraud')

plt.show()

5. 用pandas_profiling生成数据报告

pfr = pandas_profiling.ProfileReport(data_train)

pfr.to_file("./example.html")

上一篇 下一篇

猜你喜欢

热点阅读