scikit-learn机器学习：逻辑回归

2020-02-06 本文已影响0人简单一点点

前面讨论的模型都是泛线性模型，现在看一下逻辑回归。

使用逻辑回归进行二元分类

普通的线性回归假设响应变量符合正态分布。在逻辑回归中，响应变量描述了结果是正向情况的概率。如果响应变量等于或者超出了一个区分阈值，则被预测为正向类，否则将被预测为负向类。响应变量使用逻辑函数建模为一个特征的线性组合函数。

逻辑函数总是返回一个位于0~1之间的值，公式如下所示，其中e是欧拉常数，约等于2.718。

${F \left( t \left) =\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-t}}}}\right. \right. }$

对于逻辑回归，t等于解释变量的线性组合，公式如下：

${F \left( x \left) =\frac{{1}}{{1+\mathop{{e}}\nolimits^{{- \left( \beta \mathop{{}}\nolimits_{{0}}+ \beta x \right) }}}}\right. \right. }$

垃圾邮件过滤

下面看一个使用逻辑回归二元分类的任务：垃圾邮件过滤。数据集来在UCI机器学习仓库。地址为 http://archive.ics.uci.edu/ml/datasets/sms+spam+collection 。

import pandas as pd
df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
print(df.head())

      0                                                  1
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

print('Number of spam messages: %s' % df[df[0]=='spam'][0].count())
print('Number of ham messages: %s' % df[df[0]=='ham'][0].count())

Number of spam messages: 747
Number of ham messages: 4825

数据集的每一行由一个二元标签（ham代表非垃圾邮件，spam代表垃圾邮件）和一个文本信息组成。其中包含4827条信息是非垃圾短信，747条信息是垃圾短信。下面我们使用scikit-learn类库的LogisticRegression类来进行一些预测。

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

X = df[1].values
y = df[0].values
# 首先将标签转换为0和1
y = [1 if yy == 'spam' else 0 for yy in y]
# 划分训练集和测试集
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
# 转换文本
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
# 训练模型并预测
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
    print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))

Predicted: 0, message: R u over scratching it?
Predicted: 0, message: Babe! How goes that day ? What are you up to ? I miss you already, my Love ... * loving kiss* ... I hope everything goes well.
Predicted: 0, message: I'm going 2 orchard now laready me reaching soon. U reaching?
Predicted: 0, message: ... Are you in the pub?
Predicted: 0, message: I dont thnk its a wrong calling between us


E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

二元分类评价指标

对二元分类器进行评价的指标包括准确率、精准率、召回率、F1值和ROC AUC得分，这些衡量方式都是基于真阳性、真阴性、假阳性和假阴性的概念。阴性和阳性指代类，真和假表示预测和实际是否相同。

概念解释如下：

真阳性（True Positive，TP）：样本的真实类别是正例，并且模型预测的结果也是正例
真阴性（True Negative，TN）：样本的真实类别是负例，并且模型将其预测成为负例
假阳性（False Positive，FP）：样本的真实类别是负例，但是模型将其预测成为正例
假阴性（False Negative，FN）：样本的真实类别是正例，但是模型将其预测成为负例

混淆矩阵（confusion matrix）可以对其进行可视化，下面看一个简单的例子。

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

y_test1 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred1 = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
confusion_matrix = confusion_matrix(y_test1, y_pred1)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

[[4 1]
 [2 3]]

output_6_1.png

准确率

准确率用来衡量分类器预测正确的比例。LogisticRegression.score方法使用准确率来给一个测试集的标签进行云测和打分。

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))

Accuracies: [0.95101553 0.95221027 0.94850299 0.96167665 0.95449102]
Mean accuracy: 0.9535792930268496


E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

虽然准确率衡量了分类器的整体正确性，它并不能区分假阳性和假阴性。

精准率和召回率

精准率代表是阳性预测结果为正确的比例，召回率代表真实阳性实例被分类器辨认出的比例。


precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Mean Precision: %s' % np.mean(precisions))
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Mean recall: %s' % np.mean(recalls))

E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


Mean Precision: 0.991777693186144
Mean recall: 0.6476554536187563

F1值

F1值是精准率和召回率的调和平均值。

f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('Mean recall: %s' % np.mean(f1s))

Mean recall: 0.7829760388268829


E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

ROC AUC

受试者操作特征（ROC）曲线，可以对一个分类器的指标进行可视化。ROC曲线描绘了分类器召回率和衰退之间的关系。AUC是ROC曲线以下部分的面积，它将ROC曲线归纳为一个用来标识分类器预计效果的值。

from sklearn.metrics import roc_curve
from sklearn.metrics import auc
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

output_14_0.png

使用网格搜索微调模型

在scikit-learn库中，超参数通过估计器和转换器的构造函数设置。在前面的例子中，我们没有设置LogisticRegression类的任何参数，对于所有的超参数我们都使用了默认值。

网格搜索是一种选择能产生最优模型的超参数的常用方法。它接受一个包含所有应该被微调的超参数的可能取值集合，并评估在该集合的笛卡尔乘积的每一个元素上训练的模型的效果，也就是，网格搜索是一种穷举搜索，它在指定超参数值的每一种可能的组合上对模型进行训练和评估。

我们可以使用scikit-learn库中的GridSearchCV类来找出较好的超参数值，GridSearchV接受一个估计器、一个参数空间和一个衡量指标。

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75),
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, 10000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'vect__norm': ('l1', 'l2'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}

df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
X = df[1].values
y = df[0].values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print('Accuarcy: ', accuracy_score(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))

Fitting 3 folds for each of 1536 candidates, totalling 4608 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   32.6s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 13.8min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 15.0min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


Best score: 0.984
Best parameters set:
    clf__C: 10
    clf__penalty: 'l2'
    vect__max_df: 0.5
    vect__max_features: 5000
    vect__ngram_range: (1, 2)
    vect__norm: 'l2'
    vect__stop_words: None
    vect__use_idf: True
Accuarcy:  0.9856424982053122
Precision:  0.9748427672955975
Recall:  0.9064327485380117

多类别分类

在许多分类问题中，类别会多于两类，这时scikit-learn类库会使用一对剩余的策略来支持多类别分类。LogisticRegression类本身就支持使用一对剩余策略支持多类别分类。

下面看一个kaggle上根据烂番茄数据库中影评的情绪短语进行分类的例子（entiment-analysis-on-movie-reviews）。每个短语被分为以下几种情绪：负向、略负向、中立、略正向、正向。

df = pd.read_csv('./sentiment-analysis-on-movie-reviews/train.tsv', header=0, delimiter='\t')
print(df.count())

PhraseId      156060
SentenceId    156060
Phrase        156060
Sentiment     156060
dtype: int64

print(df.head())

   PhraseId  SentenceId                                             Phrase  \
0         1           1  A series of escapades demonstrating the adage ...   
1         2           1  A series of escapades demonstrating the adage ...   
2         3           1                                           A series   
3         4           1                                                  A   
4         5           1                                             series   

   Sentiment  
0          1  
1          2  
2          2  
3          2  
4          2

print(df['Phrase'].head(10))

0    A series of escapades demonstrating the adage ...
1    A series of escapades demonstrating the adage ...
2                                             A series
3                                                    A
4                                               series
5    of escapades demonstrating the adage that what...
6                                                   of
7    escapades demonstrating the adage that what is...
8                                            escapades
9    demonstrating the adage that what is good for ...
Name: Phrase, dtype: object

print(df['Sentiment'].describe())

count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64

print(df['Sentiment'].value_counts())

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

print(df['Sentiment'].value_counts() / df['Sentiment'].count())

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

可以看到其中接近一般都是中立，下面使用scikit-learn类库训练一个分类器。

X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))

E:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  """Entry point for launching an IPython kernel.
E:\python\python36\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  3.9min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)


Best score: 0.620
Best parameters set:
    clf__C: 10
    vect__max_df: 0.25
    vect__ngram_range: (1, 2)
    vect__use_idf: False

和二元分类一样，混淆矩阵对于可视化分类器的错误非常有用。精准率、召回率和F1值也可以针对每个类别进行计算，对于所有预测的准确率也会计算。

from sklearn.metrics import classification_report, confusion_matrix
predictions = grid_search.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))

Accuracy: 0.6364603357682942
Confusion Matrix:
[[ 1136  1734   597    71     1]
 [  904  6027  6070   552    21]
 [  231  3116 32634  3535   160]
 [   28   402  6732  8156  1351]
 [    7    34   549  2272  1710]]
Classification Report:
              precision    recall  f1-score   support

           0       0.49      0.32      0.39      3539
           1       0.53      0.44      0.48     13574
           2       0.70      0.82      0.76     39676
           3       0.56      0.49      0.52     16669
           4       0.53      0.37      0.44      4572

    accuracy                           0.64     78030
   macro avg       0.56      0.49      0.52     78030
weighted avg       0.62      0.64      0.62     78030

多标签分类和问题转换

前面的分类中，每个实例必须分配给一个类集合的中的一个类，而多标签分类，其中每个实例可以被分配给类别集合的一个子集。比如论坛中的帖子可以分配多个标签。

对于多标签分类，有2种解决办法：

第一种问题转换方法是一种将原多标签问题转换为一系列单标签分类问题的技巧，将训练数据中出现的每个标签集转换为单个标签。
第二种问题转换方法是对训练集中的每一个标签训练一个二元分类器。每一个分类器预测实例是否属于某个标签。