27 Pandas怎样找出最影响结果的那些特征

2022-11-16  本文已影响0人  Viterbi

27 Pandas怎样找出最影响结果的那些特征?

应用场景:

实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?

1、导入相关的包

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验,作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S

3、数据清理和转换

3.1 查看是否有空值列

df.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

3.2 给Age列填充平均值

df["Age"] = df["Age"].fillna(df["Age"].median())

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S

3.2 将性别列变成数字

# 性别
df.Sex.unique()


    array(['male', 'female'], dtype=object)



df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 0 22.0 1 0 7.2500 S
1 2 1 1 1 38.0 1 0 71.2833 C
2 3 1 3 1 26.0 0 0 7.9250 S
3 4 1 1 1 35.0 1 0 53.1000 S
4 5 0 3 0 35.0 0 0 8.0500 S

3.3 给Embarked列填充空值,字符串转换成数字

# Embarked
df.Embarked.unique()


    array(['S', 'C', 'Q', nan], dtype=object)


# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 0 22.0 1 0 7.2500 1
1 2 1 1 1 38.0 1 0 71.2833 2
2 3 1 3 1 26.0 0 0 7.9250 1
3 4 1 1 1 35.0 1 0 53.1000 1
4 5 0 3 0 35.0 0 0 8.0500 1

4、将特征列和结果列拆分开

y = df.pop("Survived")
X = df

X.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Pclass Sex Age SibSp Parch Fare Embarked
0 1 3 0 22.0 1 0 7.2500 1
1 2 1 1 38.0 1 0 71.2833 2
2 3 3 1 26.0 0 0 7.9250 1
3 4 1 1 35.0 1 0 53.1000 1
4 5 3 0 35.0 0 0 8.0500 1
y.head()




    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

# 选择所有的特征,目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

df_scores = pd.DataFrame(fit.scores_)
df_scores
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
0
0 3.312934
1 30.873699
2 170.348127
3 21.649163
4 2.581865
5 10.097499
6 4518.319091
7 2.771019
df_columns = pd.DataFrame(X.columns)
df_columns
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
0
0 PassengerId
1 Pclass
2 Sex
3 Age
4 SibSp
5 Parch
6 Fare
7 Embarked
# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
feature_name Score
0 PassengerId 3.312934
1 Pclass 30.873699
2 Sex 170.348127
3 Age 21.649163
4 SibSp 2.581865
5 Parch 10.097499
6 Fare 4518.319091
7 Embarked 2.771019
df_feature_scores.sort_values(by="Score", ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
feature_name Score
6 Fare 4518.319091
2 Sex 170.348127
1 Pclass 30.873699
3 Age 21.649163
5 Parch 10.097499
0 PassengerId 3.312934
7 Embarked 2.771019
4 SibSp 2.581865

本文使用 文章同步助手 同步

上一篇 下一篇

猜你喜欢

热点阅读