scikit-learn机器学习:逻辑回归
前面讨论的模型都是泛线性模型,现在看一下逻辑回归。
使用逻辑回归进行二元分类
普通的线性回归假设响应变量符合正态分布。在逻辑回归中,响应变量描述了结果是正向情况的概率。如果响应变量等于或者超出了一个区分阈值,则被预测为正向类,否则将被预测为负向类。响应变量使用逻辑函数建模为一个特征的线性组合函数。
逻辑函数总是返回一个位于0~1之间的值,公式如下所示,其中e是欧拉常数,约等于2.718。
对于逻辑回归,t等于解释变量的线性组合,公式如下:
垃圾邮件过滤
下面看一个使用逻辑回归二元分类的任务:垃圾邮件过滤。数据集来在UCI机器学习仓库。地址为 http://archive.ics.uci.edu/ml/datasets/sms+spam+collection 。
import pandas as pd
df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
print(df.head())
0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
print('Number of spam messages: %s' % df[df[0]=='spam'][0].count())
print('Number of ham messages: %s' % df[df[0]=='ham'][0].count())
Number of spam messages: 747
Number of ham messages: 4825
数据集的每一行由一个二元标签(ham代表非垃圾邮件,spam代表垃圾邮件)和一个文本信息组成。其中包含4827条信息是非垃圾短信,747条信息是垃圾短信。下面我们使用scikit-learn类库的LogisticRegression类来进行一些预测。
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
X = df[1].values
y = df[0].values
# 首先将标签转换为0和1
y = [1 if yy == 'spam' else 0 for yy in y]
# 划分训练集和测试集
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
# 转换文本
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
# 训练模型并预测
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))
Predicted: 0, message: R u over scratching it?
Predicted: 0, message: Babe! How goes that day ? What are you up to ? I miss you already, my Love ... * loving kiss* ... I hope everything goes well.
Predicted: 0, message: I'm going 2 orchard now laready me reaching soon. U reaching?
Predicted: 0, message: ... Are you in the pub?
Predicted: 0, message: I dont thnk its a wrong calling between us
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
二元分类评价指标
对二元分类器进行评价的指标包括准确率、精准率、召回率、F1值和ROC AUC得分,这些衡量方式都是基于真阳性、真阴性、假阳性和假阴性的概念。阴性和阳性指代类,真和假表示预测和实际是否相同。
概念解释如下:
- 真阳性(True Positive,TP):样本的真实类别是正例,并且模型预测的结果也是正例
- 真阴性(True Negative,TN):样本的真实类别是负例,并且模型将其预测成为负例
- 假阳性(False Positive,FP):样本的真实类别是负例,但是模型将其预测成为正例
- 假阴性(False Negative,FN):样本的真实类别是正例,但是模型将其预测成为负例
混淆矩阵(confusion matrix)可以对其进行可视化,下面看一个简单的例子。
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test1 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred1 = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
confusion_matrix = confusion_matrix(y_test1, y_pred1)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[4 1]
[2 3]]
output_6_1.png
准确率
准确率用来衡量分类器预测正确的比例。LogisticRegression.score方法使用准确率来给一个测试集的标签进行云测和打分。
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
Accuracies: [0.95101553 0.95221027 0.94850299 0.96167665 0.95449102]
Mean accuracy: 0.9535792930268496
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
虽然准确率衡量了分类器的整体正确性,它并不能区分假阳性和假阴性。
精准率和召回率
精准率代表是阳性预测结果为正确的比例,召回率代表真实阳性实例被分类器辨认出的比例。
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Mean Precision: %s' % np.mean(precisions))
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Mean recall: %s' % np.mean(recalls))
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
Mean Precision: 0.991777693186144
Mean recall: 0.6476554536187563
F1值
F1值是精准率和召回率的调和平均值。
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('Mean recall: %s' % np.mean(f1s))
Mean recall: 0.7829760388268829
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
ROC AUC
受试者操作特征(ROC)曲线,可以对一个分类器的指标进行可视化。ROC曲线描绘了分类器召回率和衰退之间的关系。AUC是ROC曲线以下部分的面积,它将ROC曲线归纳为一个用来标识分类器预计效果的值。
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
output_14_0.png
使用网格搜索微调模型
在scikit-learn库中,超参数通过估计器和转换器的构造函数设置。在前面的例子中,我们没有设置LogisticRegression类的任何参数,对于所有的超参数我们都使用了默认值。
网格搜索是一种选择能产生最优模型的超参数的常用方法。它接受一个包含所有应该被微调的超参数的可能取值集合,并评估在该集合的笛卡尔乘积的每一个元素上训练的模型的效果,也就是,网格搜索是一种穷举搜索,它在指定超参数值的每一种可能的组合上对模型进行训练和评估。
我们可以使用scikit-learn库中的GridSearchCV类来找出较好的超参数值,GridSearchV接受一个估计器、一个参数空间和一个衡量指标。
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
X = df[1].values
y = df[0].values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print('Accuarcy: ', accuracy_score(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 10.4s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 32.6s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 1.9min
[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 3.2min
[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 4.8min
[Parallel(n_jobs=-1)]: Done 2442 tasks | elapsed: 6.9min
[Parallel(n_jobs=-1)]: Done 3192 tasks | elapsed: 8.5min
[Parallel(n_jobs=-1)]: Done 4042 tasks | elapsed: 13.8min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 15.0min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
Best score: 0.984
Best parameters set:
clf__C: 10
clf__penalty: 'l2'
vect__max_df: 0.5
vect__max_features: 5000
vect__ngram_range: (1, 2)
vect__norm: 'l2'
vect__stop_words: None
vect__use_idf: True
Accuarcy: 0.9856424982053122
Precision: 0.9748427672955975
Recall: 0.9064327485380117
多类别分类
在许多分类问题中,类别会多于两类,这时scikit-learn类库会使用一对剩余的策略来支持多类别分类。LogisticRegression类本身就支持使用一对剩余策略支持多类别分类。
下面看一个kaggle上根据烂番茄数据库中影评的情绪短语进行分类的例子(entiment-analysis-on-movie-reviews)。每个短语被分为以下几种情绪:负向、略负向、中立、略正向、正向。
df = pd.read_csv('./sentiment-analysis-on-movie-reviews/train.tsv', header=0, delimiter='\t')
print(df.count())
PhraseId 156060
SentenceId 156060
Phrase 156060
Sentiment 156060
dtype: int64
print(df.head())
PhraseId SentenceId Phrase \
0 1 1 A series of escapades demonstrating the adage ...
1 2 1 A series of escapades demonstrating the adage ...
2 3 1 A series
3 4 1 A
4 5 1 series
Sentiment
0 1
1 2
2 2
3 2
4 2
print(df['Phrase'].head(10))
0 A series of escapades demonstrating the adage ...
1 A series of escapades demonstrating the adage ...
2 A series
3 A
4 series
5 of escapades demonstrating the adage that what...
6 of
7 escapades demonstrating the adage that what is...
8 escapades
9 demonstrating the adage that what is good for ...
Name: Phrase, dtype: object
print(df['Sentiment'].describe())
count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64
print(df['Sentiment'].value_counts())
2 79582
3 32927
1 27273
4 9206
0 7072
Name: Sentiment, dtype: int64
print(df['Sentiment'].value_counts() / df['Sentiment'].count())
2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float64
可以看到其中接近一般都是中立,下面使用scikit-learn类库训练一个分类器。
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
E:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
"""Entry point for launching an IPython kernel.
E:\python\python36\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 72 out of 72 | elapsed: 3.9min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
"this warning.", FutureWarning)
Best score: 0.620
Best parameters set:
clf__C: 10
vect__max_df: 0.25
vect__ngram_range: (1, 2)
vect__use_idf: False
和二元分类一样,混淆矩阵对于可视化分类器的错误非常有用。精准率、召回率和F1值也可以针对每个类别进行计算,对于所有预测的准确率也会计算。
from sklearn.metrics import classification_report, confusion_matrix
predictions = grid_search.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
Accuracy: 0.6364603357682942
Confusion Matrix:
[[ 1136 1734 597 71 1]
[ 904 6027 6070 552 21]
[ 231 3116 32634 3535 160]
[ 28 402 6732 8156 1351]
[ 7 34 549 2272 1710]]
Classification Report:
precision recall f1-score support
0 0.49 0.32 0.39 3539
1 0.53 0.44 0.48 13574
2 0.70 0.82 0.76 39676
3 0.56 0.49 0.52 16669
4 0.53 0.37 0.44 4572
accuracy 0.64 78030
macro avg 0.56 0.49 0.52 78030
weighted avg 0.62 0.64 0.62 78030
多标签分类和问题转换
前面的分类中,每个实例必须分配给一个类集合的中的一个类,而多标签分类,其中每个实例可以被分配给类别集合的一个子集。比如论坛中的帖子可以分配多个标签。
对于多标签分类,有2种解决办法:
- 第一种问题转换方法是一种将原多标签问题转换为一系列单标签分类问题的技巧,将训练数据中出现的每个标签集转换为单个标签。
- 第二种问题转换方法是对训练集中的每一个标签训练一个二元分类器。每一个分类器预测实例是否属于某个标签。