【阿旭机器学习实战】【3】KNN回归模型---年收入预测实战
2022-11-10 本文已影响0人
阿旭123
本系列文章为机器学习实战内容,旨在通过实战的方式学习各种机器学习算法知识,更易于掌握和学习,更多干货内容持续更新…
目录
问题描述
使用KNN算法训练模型,然后使用模型预测一个人的年收入是否大于50。
读取数据集并查看数据
# 导入相应库
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
df = pd.read_csv("./adults.txt")
df.head()
age | workclass | final_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
该数据集包含14个特征
:分别为age ;workclass ;final_weight ;education ;education_num ;marital_status ;occupation ;relationship ;race ;sex ;capital_gain ;capital_loss ;hours_per_week ;native_country
其中数据集最后一列:salary表示这个人的年收入
特征工程
分割特征与标签
# 特征数据
data = df.iloc[:,:-1].copy()
data.head()
age | workclass | final_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba |
# 标签数据
target = df[["salary"]].copy()
target.head()
salary | |
---|---|
0 | <=50K |
1 | <=50K |
2 | <=50K |
3 | <=50K |
4 | <=50K |
对非数值特征进行量化
由于KNN算法只能对数值类型的值进行计算,因此需要对非数值特征进行量化处理
把字符串类型的特征属性进行量化
对workclass职业这一特征进行量化
# 查看总共有多少个职业
ws = data.workclass.unique()
ws
array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
dtype=object)
可以看出总共有9类职业:包括未知的“?”。下面我们使用0-8这9个数字,分别对9种职业进行编码
# 定义转化函数
def convert_ws(item):
# np.argwhere函数会返回,相应职业对应的索引
return np.argwhere(ws==item)[0,0]
# 将职业转化为职业列表中索引值
data.workclass = data.workclass.map(convert_ws)
# 查看职业转化后的数据
data.head()
age | workclass | final_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | 0 | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States |
1 | 50 | 1 | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States |
2 | 38 | 2 | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States |
3 | 53 | 2 | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States |
4 | 28 | 2 | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba |
np.argwhere函数会返回相应职业对应的索引, np.argwhere(ws==“?”)[0,0],返回值为5
对其他字符串特征属性进行量化
与上述职业量化过程相同
# 需要进行量化的属性
cols = ['education',"marital_status","occupation","relationship","race","sex","native_country"]
# 使用遍历的方式对各列属性进行量化
def convert_item(item):
return np.argwhere(uni == item)[0,0]
for col in cols:
uni = data[col].unique()
data[col] = data[col].map(convert_item)
# 查看对所有列进行量化后的数据
data.head()
age | workclass | final_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | 0 | 77516 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 2174 | 0 | 40 | 0 |
1 | 50 | 1 | 83311 | 0 | 13 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 13 | 0 |
2 | 38 | 2 | 215646 | 1 | 9 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 40 | 0 |
3 | 53 | 2 | 234721 | 2 | 7 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 40 | 0 |
4 | 28 | 2 | 338409 | 0 | 13 | 1 | 3 | 2 | 1 | 1 | 0 | 0 | 40 | 1 |
建模与评估
好了,以上我们已经将所有特征进行了量化处理,下面就可以使用KNN算法进行建模了
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# 创建模型
knn = KNeighborsClassifier(n_neighbors=8)
# 划分训练集与测试集
x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.01)
# 对模型进行训练
knn.fit(x_train,y_train)
# 使用测试集查看模型的准确度
knn.score(x_test,y_test)
0.7822085889570553
模型优化
我们可以看到,如果不对上述所有的特征数值进行处理,直接使用KNN模型进行训练的话,模型的准确率仅为78%
。
下面我们对特征数据进行归一化处理
,然后再使用KNN模型进行建模与测试,看看结果如何。
# 把所有的数据归一化
# 创建归一化函数
def func(x):
return (x-min(x))/(max(x)-min(x))
# 对特征数据进行归一化处理
data[data.columns] = data[data.columns].transform(func)
data.head()
age | workclass | final_weight | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.301370 | 0.000 | 0.044302 | 0.000000 | 0.800000 | 0.000000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.02174 | 0.0 | 0.397959 | 0.00000 |
1 | 0.452055 | 0.125 | 0.048238 | 0.000000 | 0.800000 | 0.166667 | 0.071429 | 0.2 | 0.00 | 0.0 | 0.00000 | 0.0 | 0.122449 | 0.00000 |
2 | 0.287671 | 0.250 | 0.138113 | 0.066667 | 0.533333 | 0.333333 | 0.142857 | 0.0 | 0.00 | 0.0 | 0.00000 | 0.0 | 0.397959 | 0.00000 |
3 | 0.493151 | 0.250 | 0.151068 | 0.133333 | 0.400000 | 0.166667 | 0.142857 | 0.2 | 0.25 | 0.0 | 0.00000 | 0.0 | 0.397959 | 0.00000 |
4 | 0.150685 | 0.250 | 0.221488 | 0.000000 | 0.800000 | 0.166667 | 0.214286 | 0.4 | 0.25 | 1.0 | 0.00000 | 0.0 | 0.397959 | 0.02439 |
# 划分训练集与测试集
x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.01)
# 创建模型
knn = KNeighborsClassifier(n_neighbors=8)
# 训练模型
knn.fit(x_train,y_train)
# 使用测试集查看模型的准确度
knn.score(x_test,y_test)
0.8374233128834356
我们可以发现,将所有数据进行归一化处理后,准确率从78%提升到了84%
,还是比较不错的。
当然还有一些其他的处理方式对模型进行优化,后续博文会持续更新,欢迎关注。
总结
这篇文章主要介绍了以下几点内容:
- 如何对字符串类型的数据进行量化处理
- 使用KNN模型对人的年收入进行预测
- 模型优化:对数据进行归一化处理之后,有利于提高模型准确度。
如果内容对你有帮助,感谢点赞+关注哦!
更多干货内容持续更新中…