逻辑回归算法的实现

2019-01-30 本文已影响24人此间不留白

前言

在之前的学习中，我们已经学习了逻辑回归算法的具体数学原理及其简单的推导过程，现在，我们可以用python实现逻辑回归的算法了。

环境

python3.6
jupyter-notebook

绘制数据散点图

根据已有的训练数据，我们首先需要加载训练样本，并且绘制训练样本的散点图，以便明确数据特点，选择合适的算法。

首先，导入编程环境所需要的库

import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt

按列加载数据
根据ex2data1.txt的数据特点，分别选取每列数据，其中，X表示的是数据特征，y表示的是正负类，如下代码所示：

data = np.loadtxt('ex2data1.txt', delimiter=',')
X = data[:, 0:2]    #选取前两列数据
y = data[:, 2]        #选取第三列数据

绘制散点图
根据样本正类和负类的特点，绘制散点图的函数如下所示：

def plotData(X,y):
  
  plt.figure()
  pos = np.where(y == 1)[0]   #选取正类
  neg = np.where(y == 0)[0]   #选取负类

  plt.scatter(X[pos, 0], X[pos, 1], marker="+", c='b')   #正类散点图
  plt.scatter(X[neg, 0], X[neg, 1], marker="o", c='y')   #负类散点图

调用plotData()函数之后，散点图如下所示，通过散点图，可以清楚的看到数据特点。

梯度下降算法的实现

我们已经详细了解了逻辑回归代价函数以及梯度下降算法的数学公式（参考逻辑回归），如下，我们通过python代码来实现具体算法

初始化参数设置

(m, n) = X.shape   #获取X矩阵维数
X = np.c_[np.ones(m), X]    #X矩阵增加一列
initial_theta = np.zeros(n + 1)  #初始化theta矩阵

定义sigmoid()函数
sigmoid函数的数学公式如下所示： $g(z)=\frac{1}{1+e^{-z}}$ ，根据公式，python代码实现如下所示：

def sigmoid(z):
    g = np.zeros(z.size)
    g = 1 / (1 + np.exp(-z))
    return g

代价函数的实现
根据代价函数公式，代价函数python代码如下所示：

def cost_function(theta, X, y):
    m = y.size
    cost = 0
    grad = np.zeros(theta.shape)
    hypothesis = sigmoid(np.dot(X, theta))
    cost = np.sum(-y * np.log(hypothesis) - (1 - y) * np.log(1 - hypothesis)) / m
    grad = np.dot(X.T, (hypothesis - y)) / m
    return cost, grad

根据以上算法和设置的初始值，求得cost值和grad的值，如下所示：

绘制决策边界

以上，我们已经实现了逻辑回归算法的代价函数的算法，可以根据以上算法绘制决策边界的代码如下所示：

def map_feature(x1, x2):
    degree = 6

    x1 = x1.reshape((x1.size, 1))
    x2 = x2.reshape((x2.size, 1))
    result = np.ones(x1[:, 0].shape)

    for i in range(1, degree + 1):
        for j in range(0, i + 1):
            result = np.c_[result, (x1**(i-j)) * (x2**j)]

    return result

def plot_decision_boundary(theta, X, y):
    plotData(X[:, 1:3], y)

    if X.shape[1] <= 3:
     
        plot_x = np.array([np.min(X[:, 1]) - 2, np.max(X[:, 1]) + 2])

        # Calculate the decision boundary line
        plot_y = (-1/theta[2]) * (theta[1]*plot_x + theta[0])

        plt.plot(plot_x, plot_y)

        plt.legend(['Decision Boundary', 'Admitted', 'Not admitted'], loc=1)
        plt.axis([30, 100, 30, 100])
    else:
       
        u = np.linspace(-1, 1.5, 50)
        v = np.linspace(-1, 1.5, 50)

        z = np.zeros((u.size, v.size))

        # Evaluate z = theta*x over the grid
        for i in range(0, u.size):
            for j in range(0, v.size):
                z[i, j] = np.dot(map_feature(u[i], v[j]), theta)

        z = z.T

        
        cs = plt.contour(u, v, z, levels=[0], colors='r', label='Decision Boundary')
        plt.legend([cs.collections[0]], ['Decision Boundary'])

运行plot_descion_boundary()函数，我们可以得到如下所示的决策边界：

数据分类

通过以上代码，我们已经实现了逻辑回归算法，现在可以用已经求得的参数和假设函数做简单的数据分类了，如下代码所示：

def predict(theta, X):
    m = X.shape[0]

    # Return the following variable correctly
    p = np.zeros(m)
    p = sigmoid(np.dot(X, theta))
    pos = np.where(p >= 0.5)
    neg = np.where(p < 0.5)

    p[pos] = 1
    p[neg] = 0
    return p

计算结果如下图所示：

逻辑回归算法的正则化实现

以上，我门已经实现了逻辑回归算法，但是，对于非线性的分类，可能会出现过拟合问题，为了应对此类问题，提出了正则化的逻辑回归算法（参考正则化）。通过以下代码实现逻辑回归算法的正则化。

加载新的数据
根据训练样本的特点，加载新的数据作为变量，并且绘制数据散点图

plt.ion()
column contains the label.
data = np.loadtxt('ex2data2.txt', delimiter=',')
X = data[:, 0:2]
y = data[:, 2]

plotData(X, y)

plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')
plt.legend(['y = 1', 'y = 0'])

input('Program paused. Press ENTER to continue')

如下所示，很明显可以看出来，是非线性的决策边界，为了避免出现过拟合，需要正则化处理。

代价函数正则化实现
根据算法要求，首先进行初始值设置，如下所示：

X = map_feature(X[:, 0], X[:, 1])
initial_theta = np.zeros(X.shape[1])
lmd = 1    #设置lambda参数为1

梯度下降算法的代码实现，如下所示：

def cost_function_reg(theta, X, y, lmd):
    m = y.size
    cost = 0
    grad = np.zeros(theta.shape)

    hypothesis = sigmoid(np.dot(X, theta))

    reg_theta = theta[1:]

    cost = np.sum(-y * np.log(hypothesis) - (1 - y) * np.log(1 - hypothesis)) / m \
           + (lmd / (2 * m)) * np.sum(reg_theta * reg_theta)

    normal_grad = (np.dot(X.T, hypothesis - y) / m).flatten()

    grad[0] = normal_grad[0]
    grad[1:] = normal_grad[1:] + reg_theta * (lmd / m)
    return cost, grad

根据设置的 $\theta$ 向量，运行结果如下所示：

绘制决策边界
可以通过以下代码，绘制决策边界，如下所示;

print('Plotting decision boundary ...')
plot_decision_boundary(theta, X, y)
plt.title('lambda = {}'.format(lmd))

plt.xlabel('Microchip Test 1')
plt.ylabel('Microchip Test 2')

绘制成功的决策边界图如下所示;

数据分类
以上，我们已经实现了逻辑回归算法的正则化，现在可以做一个简单的如下所示的数据分类了！

总结

以上，我们已经用python实现了逻辑回归算法，就逻辑回归算法而言，通过公式推导，可以看出其算法公式与线性回归算法非常相似，算法实现过程中，重点在于公式理解，能够运用矩阵思维将算法公式用代码实现，这就要求需要对算法做出深刻的理解并且能够熟练使用python代码以及其numpy库。