27 Pandas怎样找出最影响结果的那些特征

2022-11-16 本文已影响0人 Viterbi

27 Pandas怎样找出最影响结果的那些特征？

应用场景：

机器学习的特征选择，去除无用的特征，可以提升模型效果、降低训练时间等等
数据分析领域，找出收入波动的最大因素！！

实例演示：泰坦尼克沉船事件中，最影响生死的因素有哪些？

1、导入相关的包

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验，作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S


df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	male	22.0	1	7.2500	S
1	2	1	1	female	38.0	1	71.2833	C
2	3	1	3	female	26.0	0	7.9250	S
3	4	1	1	female	35.0	1	53.1000	S
4	5	0	3	male	35.0	0	8.0500	S

3、数据清理和转换

3.1 查看是否有空值列

df.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

3.2 给Age列填充平均值

df["Age"] = df["Age"].fillna(df["Age"].median())

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	male	22.0	1	7.2500	S
1	2	1	1	female	38.0	1	71.2833	C
2	3	1	3	female	26.0	0	7.9250	S
3	4	1	1	female	35.0	1	53.1000	S
4	5	0	3	male	35.0	0	8.0500	S

3.2 将性别列变成数字

# 性别
df.Sex.unique()


    array(['male', 'female'], dtype=object)



df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	0	22.0	1	7.2500	S
1	2	1	1	1	38.0	1	71.2833	C
2	3	1	3	1	26.0	0	7.9250	S
3	4	1	1	1	35.0	1	53.1000	S
4	5	0	3	0	35.0	0	8.0500	S

3.3 给Embarked列填充空值，字符串转换成数字

# Embarked
df.Embarked.unique()


    array(['S', 'C', 'Q', nan], dtype=object)


# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	0	22.0	1	7.2500	1
1	2	1	1	1	38.0	1	71.2833	2
2	3	1	3	1	26.0	0	7.9250	1
3	4	1	1	1	35.0	1	53.1000	1
4	5	0	3	0	35.0	0	8.0500	1

4、将特征列和结果列拆分开

y = df.pop("Survived")
X = df

X.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	3	0	22.0	1	7.2500	1
1	2	1	1	38.0	1	71.2833	2
2	3	3	1	26.0	0	7.9250	1
3	4	1	1	35.0	1	53.1000	1
4	5	3	0	35.0	0	8.0500	1

y.head()




    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

# 选择所有的特征，目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

df_scores = pd.DataFrame(fit.scores_)
df_scores

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	0
0	3.312934
1	30.873699
2	170.348127
3	21.649163
4	2.581865
5	10.097499
6	4518.319091
7	2.771019

df_columns = pd.DataFrame(X.columns)
df_columns

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	0
0	PassengerId
1	Pclass
2	Sex
3	Age
4	SibSp
5	Parch
6	Fare
7	Embarked

# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	feature_name	Score
0	PassengerId	3.312934
1	Pclass	30.873699
2	Sex	170.348127
3	Age	21.649163
4	SibSp	2.581865
5	Parch	10.097499
6	Fare	4518.319091
7	Embarked	2.771019

df_feature_scores.sort_values(by="Score", ascending=False)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	feature_name	Score
6	Fare	4518.319091
2	Sex	170.348127
1	Pclass	30.873699
3	Age	21.649163
5	Parch	10.097499
0	PassengerId	3.312934
7	Embarked	2.771019
4	SibSp	2.581865

本文使用文章同步助手同步

27 Pandas怎样找出最影响结果的那些特征

27 Pandas怎样找出最影响结果的那些特征？

实例演示：泰坦尼克沉船事件中，最影响生死的因素有哪些？

1、导入相关的包

2、导入泰坦尼克号的数据

3、数据清理和转换

3.1 查看是否有空值列

3.2 给Age列填充平均值

3.2 将性别列变成数字

3.3 给Embarked列填充空值，字符串转换成数字

4、将特征列和结果列拆分开

5、使用卡方检验选择topK的特征

6、按照重要性顺序打印特征列表

猜你喜欢

热点阅读