github代码学习 --- 乳腺癌分类预测
先放代码出处:
https://github.com/Jean-njoroge/Breast-cancer-risk-prediction
分析分为四个部分,保存在本仓库的juypter notebooks中
1.识别问题和数据源
2.探索性数据分析
3.预处理数据
4.建立模型以预测乳腺细胞组织是恶性还是良性
- 作者分了6个jupyter notebook
Notebook 01: 加载数据集,识别分析问题
乳腺癌是女性最常见的恶性肿瘤,占美国女性确诊癌症的近三分之一,是女性癌症死亡的第二大原因。 乳腺癌是乳房组织细胞异常生长的结果,通常称为肿瘤。 肿瘤并不意味着癌症——肿瘤可以是良性(非癌性)、恶性前(癌前)或恶性(癌性)。 MRI、乳房X光检查、超声波和活组织检查等测试通常用于诊断所进行的乳腺癌。
1.1 了解背景
原理:乳房细针抽吸 (FNA) 测试鉴定乳腺癌(这是一种快速且简单的程序,该程序可以从乳房病变或囊肿(肿块、溃疡或肿胀)中取出一些液体或细胞,用类似于 血样针)。
通过检测数据和标签构建模型,实现对乳腺癌肿瘤进行分类:
- 1 = 恶性 (癌性)
- 0 = 良性 (非癌性)
很明显,这是一个二分类问题。
1.2 认识数据
乳腺癌数据集是由加州大学欧文分校维护的可用机器学习存储库(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)。 该数据集包含 569 个恶性和良性肿瘤细胞样本。
- 数据集中的前两列分别存储样本的唯一 ID 号和相应的诊断(M=恶性,B=良性)。
- 第 3-32 列包含 30 个实值特征,这些特征是根据细胞核的数字化图像计算得出的,可用于构建模型来预测肿瘤是良性还是恶性。
为每个细胞核计算十个实值特征:
a) 半径(从中心到周边点的平均距离)
b) 纹理(灰度值的标准偏差)
c) 周长
d) 面积
e) 平滑度(半径长度的局部变化)
f) 紧凑性(周长^2/面积 - 1.0)
g) 凹度(轮廓凹入部分的严重程度)
h) 凹点(轮廓凹入部分的数量)
i) 对称性
j) 分形维数(“海岸线近似” - 1)
#load libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
data = pd.read_csv('data/data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB
- 总共30个特征,分别是对10个实值特征计算,mean, se, worst
- diagnosis 列为标签
- 数据无空值
# 查看数据前两行
data.head()
# 对标签进行统计
data.diagnosis.value_counts().plot(kind = "bar")
diagnosis
- 良性:恶性 大约为2:1. 在机器学习中最好是正负样本1:1,但是2:1也可以进行正常的分类预测。
#check for missing variables
data.isnull().any()
data.isnull().any().sum()
o
- 数据无缺失
Notebook 02: EDA 数据探索性分析
探索性数据分析(EDA)是一个非常重要的步骤,应该在任何建模之前完成。这是因为数据科学家能够在不做假设的情况下理解数据的性质。数据探索主要是掌握,数据的结构,值的分布,在数据集中是否存在异常值,特征间相互关系。
主要包括:
- 描述性统计分析
- 数据可视化
2.1 Descriptive statistics
%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd
from scipy.stats import norm
import seaborn as sns # visualization
plt.rcParams['figure.figsize'] = (15,8)
plt.rcParams['axes.titlesize'] = 'large'
data = pd.read_csv('data/data.csv')
#basic descriptive statistics
data.iloc[:,2:32].describe()
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500
# Group by diagnosis and review the output.
# 一般用于组内聚合统计,如计算组间的均值,中位数等
diag_gr = data.groupby('diagnosis', axis=0)
diag_gr.median()
diag_gr.size() # 等同于 data.diagnosis.value_counts()
df.groupby('A')
2.2 Data Visualizations
- 直方图
- 密度图
- 箱线图
- 热图
# 统一设置 图片背景和图片尺寸
sns.set_style("white")
sns.set_context({"figure.figsize": (10, 8)})
## 直方图
sns.countplot(data['diagnosis'],label='Count',palette="Set3", order=["B","M"]) # order 指定画图顺序
diagnosis
将特征分为三组:mean, se, worst
#For a merge + slice:
data_mean=data.iloc[:,2:12]
data_se=data.iloc[:,12:22]
data_worst=data.iloc[:,22:]
print(data_mean.columns)
print(data_se.columns)
print(data_worst.columns)
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean'],
dtype='object')
Index(['radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se'],
dtype='object')
Index(['radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
各组特征可视化 -- 直方图
#Plot histograms of CUT1 variables
data_mean.hist(bins=10, figsize=(15, 10),grid=False,color = "pink")
data_se.hist(bins=10, figsize=(15, 10),grid=False,color = "orange")
data_worst.hist(bins=10, figsize=(15, 10),grid=False,color = "blue")
data_mean
data_se
data_worst
- 我们可以看到,也许属性 凹度, 凹点 可能具有指数分布。 我们还可以看到,纹理,平滑,对称属性可能具有高斯或接近高斯分布。许多机器学习技术假设输入变量的高斯单变量分布。
概率密度曲线
#Density Plots
plt = data_mean.plot(kind= 'density', subplots=True,
layout=(4,3), sharex=False,
sharey=False, fontsize=15, figsize=(15,10))
plt = data_se.plot(kind= 'density', subplots=True,
layout=(4,3), sharex=False,
sharey=False, fontsize=15, figsize=(15,10))
plt = data_worst.plot(kind= 'density', subplots=True,
layout=(4,3), sharex=False,
sharey=False, fontsize=15, figsize=(15,10))
data_mean
data_se
data_worst
- 周长、半径、面积、凹度、密度可能具有指数分布; 纹理、平滑、对称属性可能具有高斯或接近高斯分布。
中心极限定理告诉我们当样本数趋向于无穷大时,样本的分布会接近正态分布,但有些变量本身的分布就不是正态的,那么对于一些有正态假设的检验,估计的模型来说,就需要事先对变量做分布变换
另一方面极大或极小的值经过变换后跟正常值差距缩小,减少了极值对模型的扰动
- 指数分布 的特征经对数(log())变换之后可以呈高斯分布,
# transform exponential distribution to Gaussian univariate distribution
data_mean['area_mean'].plot(kind = "hist", figsize=(8,6))
np.log1p(data_mean['area_mean']).plot(kind = "hist", figsize=(8,6))
np.log10(data_mean['area_mean']).plot(kind = "hist", figsize=(8,6))
area_mean
np.log1p(data_mean)
np.log10(data_mean)
通过箱线图可视化数据分布情况和异常值
# box and whisker plots
plt=data_mean.plot(kind= 'box' , subplots=True, layout=(4,4),
sharex=False, sharey=False,fontsize=12)
plt=data_se.plot(kind= 'box' , subplots=True, layout=(4,4),
sharex=False, sharey=False,fontsize=12)
plt=data_worst.plot(kind= 'box' , subplots=True, layout=(4,4),
sharex=False, sharey=False,fontsize=12)
data_mean
data_se
data_worst
2.3 Multimodal Data Visualizations
- Scatter plots
- Correlation matrix
# plot correlation matrix
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')
sns.set_style("white")
# Compute the correlation matrix
corr = data_mean.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
data, ax = plt.subplots(figsize=(8, 8))
plt.title('Breast Cancer Feature Correlation')
# Generate a custom diverging colormap
cmap = sns.diverging_palette(260, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, vmax=1.2, square='square', cmap=cmap, mask=mask,
ax=ax,annot=True, fmt='.2g',linewidths=2)
corr
- 我们可以看到平均值参数在 1-0.75 之间存在很强的正相关关系;。
组织核的平均面积与半径和参数的均值呈强正相关;
一些参数中度正相关(r在0.5-0.75之间)是凹度和面积,凹度和周长等; 同样,我们看到 fractal_dimension 与半径、纹理、参数平均值之间存在一些强烈的负相关。
data = pd.read_csv("data/data.csv")
g = sns.PairGrid(data[data.columns.tolist()[1:12]],
hue ='diagnosis')
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter, s = 3)
PairGrid
可以看到,大多数特征对于肿瘤良恶性的区分度还是很大的。
小结:
- 细胞半径、周长、面积、紧密度、凹度和凹点的平均值可用于癌症的分类。 这些参数的较大值倾向于显示与恶性肿瘤的相关性。
- 质地、平滑度、对称性或分维数的平均值并未显示出较好的诊断偏好。
- 在任何直方图中,都没有明显的异常值需要进一步清理。
Notebook 03: 预处理与特征工程
数据加载
%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd
import numpy as np
from scipy.stats import norm
# visualization
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set_style("white")
plt.rcParams['figure.figsize'] = (8,4)
#plt.rcParams['axes.titlesize'] = 'large'
data = pd.read_csv('data/data.csv', index_col=False)
划分训练集和测试集
#Assign predictors to a variable of ndarray (matrix) type
X = data.iloc[:,2:32]
y = data.iloc[:,1].apply(lambda x: 1 if x == "M" else 0)
from sklearn.model_selection import train_test_split
##Split data set in train 70% and test 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((426, 30), (426,), (143, 30), (143,))
数据标准化
from sklearn.preprocessing import StandardScaler
# Normalize the data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)
PCA降维
from sklearn.decomposition import PCA
# 从 30维 降到 10维
pca = PCA(n_components=10)
fit = pca.fit(Xs)
X_pca = pca.transform(Xs)
取前两个PC 进行画图,查看降维后特征的区分度
PCA_df = pd.DataFrame()
PCA_df['PCA_1'] = X_pca[:,0]
PCA_df['PCA_2'] = X_pca[:,1]
## 可视化
plt.figure(figsize=(8,6))
plt.plot(PCA_df['PCA_1'][data.diagnosis == 'M'],
PCA_df['PCA_2'][data.diagnosis == 'M'],
'o', alpha = 0.7, color = 'r')
plt.plot(PCA_df['PCA_1'][data.diagnosis == 'B'],
PCA_df['PCA_2'][data.diagnosis == 'B'],
'o', alpha = 0.7, color = 'b')
plt.xlabel('PCA_1')
plt.ylabel('PCA_2')
plt.legend(['Malignant','Benign'])
plt.show()
PCA
通过拐点,确定选择前几个主成分用于后续建模
#The amount of variance that each PC explains
var = pca.explained_variance_ratio_
### 通过拐点确定选择前几个PC
plt.plot(var)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
leg = plt.legend(['Eigenvalues from PCA'],
loc='best',
borderpad=0.3,
shadow=False,
markerscale=0.4)
leg.get_frame().set_alpha(0.4)
leg.draggable(state=True)
plt.show()
elbow plot
Notebook 04 利用SVM建模
支持向量机 (SVM) 学习算法将用于构建预测模型。 SVM 是最流行的分类算法之一,并且具有转换非线性数据的优雅方式,因此可以使用线性算法将线性模型拟合到数据(Cortes 和 Vapnik 1995)
支持向量机的和函数非常强大,使得模型在各种数据集上表现良好。
- SVM 允许复杂的决策边界,即使数据只有几个特征。
- 它们在低维和高维数据(即很少和很多特征)上工作得很好,但在大样本上不能很好地扩展。
在包含多达 10,000 个样本的数据上运行 SVM 可能效果很好,但处理大小为 100,000 或更大的数据集在运行时和内存使用方面可能具有挑战性.
- SVM 需要进行很好地数据预处理和调整SVM参数。 这就是为什么如今大多数人在许多应用中转而使用基于树的模型,例如随机森林或梯度提升数(几乎不需要预处理)。
- SVM 模型难以理解; 可能很难理解为什么做出特定预测,模型的可解释性可能不太好。
4.1 SVM 的重要参数
SVM 中的重要参数是
- 正则化系数: C,
- 核的选择: 线性(linear)、径向基函数(rbf)或多项式(poly)
- RBF 特定的参数:
gamma 和 C 用于控制模型的复杂性,两者中较大的值会导致模型更复杂。 因此,两个参数的良好设置通常是强相关的,C 和 gamma 应该一起调整。
4.2 数据处理
加载模块与数据集
# load package
%matplotlib inline
import matplotlib.pyplot as plt
#Load libraries for data processing
import pandas as pd
import numpy as np
from scipy.stats import norm
## Supervised learning.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn import metrics, preprocessing
from sklearn.metrics import classification_report
# visualization
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set_style("white")
plt.rcParams['figure.figsize'] = (8,4)
# load dataset
data = pd.read_csv('data/data.csv')
数据预处理
# split features and label
X = data.iloc[:,2:32] # features
y = data.iloc[:,1] # label
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
y = le.fit_transform(y)
# Normalize the data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)
4.2 交叉验证
训练集:测试集 = 7:3
# 5. Divide records in training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(Xs, y, stratify=y,
test_size=0.3,
random_state=33)
# 6. Create an SVM classifier and train it on 70% of the data set.
clf = SVC(probability=True)
clf.fit(X_train, y_train)
#7. Analyze accuracy of predictions on 30% of the holdout test sample.
classifier_score = clf.score(X_test, y_test)*100
print ('The classifier accuracy score is {:03.2f}% \n'.format(classifier_score))
The classifier accuracy score is 96.49%
交叉验证
交叉验证
n_folds = 5
cv_error = np.average(cross_val_score(SVC(), Xs, y, cv=n_folds)) * 100
print('The {}-fold cross-validation accuracy score for this classifier is {:.2f} % \n'.format(n_folds, cv_error))
The 5-fold cross-validation accuracy score for this classifier is 97.36 %
- 可以看到交叉验证的结果要比随机划分的结果略好,说明数据的选择对模型还是很重要的
SVM pipline
之前已经知道前3个PC可以很好地预测肿瘤的良恶性,所以,可以把特征选择与模型串联到一起,组成一个pipline,便于进行模型的训练和预测。
from sklearn.feature_selection import SelectKBest, f_regression
# clf2 is a pipline
clf2 = make_pipeline(SelectKBest(f_regression, k=3),
SVC(probability=True))
scores = cross_val_score(clf2, Xs, y, cv=3)
# Get average of 3-fold cross-validation score using an SVC estimator.
n_folds = 3
cv_error = np.average(cross_val_score(SVC(), Xs, y, cv=n_folds)) * 100
print('The {}-fold cross-validation accuracy score for this classifier is {:.2f} %\n'.format(n_folds, cv_error))
The 3-fold cross-validation accuracy score for this classifier is 97.36 %
4.3 模型评估
-
Accuracy: Overall, how often is the classifier correct?
- Accuracy = (TP+TN)/total
-
Misclassification Rate: Overall, how often is it wrong?
- Error Rate = (FP+FN)/total
-
True Positive Rate: When it's actually yes, how often does it predict 1?
- TPR = TP/actual yes, also known as "Sensitivity"or "Recall"
-
False Positive Rate: When it's actually 0, how often does it predict 1?
- FPR = FP/actual no
-
Specificity: When it's actually 0, how often does it predict 0? also know as true positive rate
- Specificity = TN/actual no = 1 - FPR
-
Precision: When it predicts 1, how often is it correct?
- Precision = TP/predicted yes
-
Prevalence: How often does the yes condition actually occur in our sample?
- Prevalence = actual yes/total
-
ROC 曲线
def ROC_plot(y, yproba):
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=(10,8))
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label='ROC fold (area = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.axes().set_aspect(1)
probas_ = clf.predict_proba(X_test)
ROC_plot(y, probas_[:1])
ROC
Notebook 05 svm 调参
跟前面一样,数据读取,特征和target分割,数据标准化
data = pd.read_csv('data/data.csv', index_col=False)
X = data.iloc[:,2:32] # features
y = data.iloc[:,1] # label
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
y = le.fit_transform(y)
# Normalize the data (center around 0 and scale to remove the variance).
scaler =StandardScaler()
Xs = scaler.fit_transform(X)
在这里,作者利用PCA来进行降维
原数据有30个特征,这里选取前10个主成分
from sklearn.decomposition import PCA
# feature extraction
pca = PCA(n_components=10)
fit = pca.fit(Xs)
X_pca = pca.transform(Xs)
训练集和 验证集分割,模型训练,模型评估
# Divide records in training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X_pca, y,
test_size=0.3,
random_state=2,
stratify=y)
# Create an SVM classifier and train it on 70% of the data set.
clf = SVC(probability=True)
clf.fit(X_train, y_train)
y_pred = clf.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred ))
classification_report
测试集预测结果的混淆矩阵可视化
## plot confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.3)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(x=j, y=i,
s=cm[i, j],
va='center', ha='center')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()
confusion matrix
gridsearchcv 选择参数组合
# Train classifiers.
kernel_values = ['linear','rbf']
param_grid = {'C': np.logspace(-3, 1, 100),
'gamma': np.logspace(-3, 2, 100),
'kernel': kernel_values}
grid = GridSearchCV(SVC(), scoring="roc_auc",
param_grid=param_grid,
cv=5)
grid.fit(X_train, y_train)
print("The best parameters are %s with a score of %0.2f"
% (grid.best_params_, grid.best_score_))
最佳参数组合
The best parameters are {'C': 10.0, 'gamma': 0.01830738280295368, 'kernel': 'rbf'} with a score of 1.00
最佳参数下,模型性能评估
clf = SVC(**grid.best_params_,
probability=True,
random_state=33)
y_pred = clf.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred ))
SVM 不同核函数可视化
def meshgrid(feat1,feat2):
x_min, x_max = feat1.min() - 1, feat1.max() + 1
y_min, y_max = feat2.min() - 1, feat2.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
return xx, yy
Xtrain = X_train[:, :2]
xx,yy = meshgrid(Xtrain[:, 0],Xtrain[:, 1])
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# title for the plots
titles = ['SVC with linear kernel',
'SVC with RBF kernel',
'SVC with polynomial (degree 3) kernel']
svm = SVC(kernel='linear',C=1,random_state=0).fit(Xtrain, y_train)
rbf_svc = SVC(kernel='rbf',gamma=0.7, C=1, random_state=0).fit(Xtrain, y_train)
poly_svc = SVC(kernel='poly',degree=3, C=1, random_state=0).fit(Xtrain, y_train)
for i, clf in enumerate((svm, rbf_svc, poly_svc)):
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.1, hspace=0.1)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=y_train, cmap=plt.cm.coolwarm)
plt.xlabel('radius_mean')
plt.ylabel('texture_mean')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
svm
Notebook 05 不同模型之间的比较
def bxplots(results,names):
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
def piplinecompare(models, X_train, y_train):
results = []
names = []
for name, model in models:
kfold = KFold(n=len(X_train), n_folds=10, random_state=7)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc')
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
return results,names
比较不同模型之间的性能(原始数据/归一化后的数据)
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC()))
# Standardize the dataset
pipelines = []
pipelines.append(( 'ScaledLR' , Pipeline([( 'Scaler' , StandardScaler()),( 'LR' ,
LogisticRegression())])))
pipelines.append(( 'ScaledLDA' , Pipeline([( 'Scaler' , StandardScaler()),( 'LDA' ,
LinearDiscriminantAnalysis())])))
pipelines.append(( 'ScaledKNN' , Pipeline([( 'Scaler' , StandardScaler()),( 'KNN' ,
KNeighborsClassifier())])))
pipelines.append(( 'ScaledCART' , Pipeline([( 'Scaler' , StandardScaler()),( 'CART' ,
DecisionTreeClassifier())])))
pipelines.append(( 'ScaledNB' , Pipeline([( 'Scaler' , StandardScaler()),( 'NB' ,
GaussianNB())])))
pipelines.append(( 'ScaledSVM' , Pipeline([( 'Scaler' , StandardScaler()),( 'SVM' , SVC())])))
results,names = piplinecompare(models, X_train, y_train)
bxplots(results,names)
results1,names1 = piplinecompare(pipelines, X_train, y_train)
bxplots(results1,names1)
model
scalered
- 可以发现,树模型对与数据的是否标准化无影响
- LDA,NB 算法 有轻微影响
- LR, KNN,SVM 在进行建模之前,必须要进行合理的数据标准化,因为这对于模型训练有很大的影响.