CS231n Spring 2019 Assignment 1—

2019-09-29 本文已影响0人赖子啊

距离上一篇的Assignment1中的KNN已经过去快半个月了，因为我中间先把Assignment3的博客先写完了，发现回过头来再看一遍的感觉真的会深入很多。因为上次的KNN在现实中是不会用的（只是给我们一个图像分类的直观感受），原因有两个：(1)在测试耗费太多时间;(2)在生图像素上的距离衡量指标不能有效提供信息(informative)。所以引入线性分类模型，指导思想就是找到合适的loss function和优化方法optimization来减小loss。需要看的是Linear classification: Support Vector Machine, Softmax，主要有两个需要完成：

svm.ipynb：Multiclass Support Vector Machine exercise
softmax.ipynb：Softmax exercise

现在主要按着作业的顺序来进行，更多关于内容理解可以看这里。

SVM

这一节里面要用朴素循环和向量化的两种方式来写前向传播计算loss和反向传播计算gradient。线性模型就是通过线性映射： $f:R^{D}\rightarrow R^{K}$ 得到分值，分类器输入样本 $x_{i}$ 后，对第 $j$ 个类别的分值score是： $s_{j}=f(x_{i},W)_{j}$ 【注意， $s_{j}$ 是一个长度为类别数的向量】，而对于多分类支持向量机损失(multiclass support vector machine loss，也叫作hinge loss)就是：
$L_{i}=\sum_{j \neq y_{i}} max(0,s_{j}-s_{y_{i}}+1)$ 其中 $L_{i}$ 就是样本 $x_{i}$ 对应的损失值， $s_{y_{i}}$ 是对应 $x_{i}$ 正确类别的分值，这里有个阈值为1，这个也是一个超参数(hyperparameter)。从该损失函数的定义可以看出,只有在正确分类的类别分值比其他类别分值高出一个阈值 ∆ 的时候,损失值才为0，否则损失值都会是一个正值，因为我们的目标就是让物体能够被正确分类。

反向传播的时候稍微麻烦一点，不过在Optimization:Stochastic Gradient Descent中已经给出了相应的梯度公式：
$\left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -\left(\sum_{j \ne y_i}\mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0)\right)x_i & j = y_i \\ \nabla_{w_j} L_i = & \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i & j \ne y_i \end{aligned}\right.$ 其中 $\mathbb{1}(\cdot)$ 是个逻辑算子，当其中为真时为返回１，否则返回０。
乍看这一公式挺复杂的，仔细分析一下也就不麻烦：这里是某个样本对应的损失 $L_{i}$ 对 $w_{y_i}/w_{j}$ (也就是 $w$ 的某一行（或某一列，具体视 $x$ 与 $w$ 相乘的顺序决定，作业里面对应的是一列）)的导数。因为针对一个数据样本，在求损失 $L_{i}$ 的时候就多次出现 ${s_{y_{i}}}$ ，所以对应得到 ${s_{y_{i}}}$ 的 ${w_{y_i}}$ 也就累加了，其余的 $w_{j}$ 就出现一次。要是直观理解了上面的公式了，代码应该就可以了，下面是svm_loss_naive()函数：

def svm_loss_naive(W, X, y, reg):
    """
    Structured SVM loss function, naive implementation (with loops).

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.

    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    dW = np.zeros(W.shape) # initialize the gradient as zero

    # compute the loss and the gradient
    num_classes = W.shape[1]
    num_train = X.shape[0]
    num_features = W.shape[0]
    loss = 0.0
    for i in range(num_train):
        scores = X[i].dot(W)
        correct_class_score = scores[y[i]]
        for j in range(num_classes):
            if j == y[i]:
                continue
            margin = scores[j] - correct_class_score + 1 # note delta = 1
            if margin > 0:
                loss += margin
                dW[:, y[i]] += -X[i, :]
                dW[:, j] += X[i, :]
                

    # Right now the loss is a sum over all training examples, but we want it
    # to be an average instead so we divide by num_train.
    loss /= num_train

    # Add regularization to the loss.
    loss += reg * np.sum(W * W)

    #############################################################################
    # TODO:                                                                     #
    # Compute the gradient of the loss function and store it dW.                #
    # Rather that first computing the loss and then computing the derivative,   #
    # it may be simpler to compute the derivative at the same time that the     #
    # loss is being computed. As a result you may need to modify some of the    #
    # code above to compute the gradient.                                       #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    dW /= num_train
    dW += 2 * reg * W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return loss, dW

对于向量化的写法，感觉计算loss那部分还是挺容易想到的，但是grad那部分还是有点难，但是看过这篇矩阵对矩阵的求导：linear backpropexample就会有点思路，总结来说就是若是对于一个线性操作： $f=xW$ (这里的 $x$ 和 $W$ 都是矩阵)，假设我们已知loss对于 $f$ 的梯度 $\frac{\partial L}{\partial f}$ ，那么就有[可以从维度匹配角度辅助记忆]： $\frac{\partial L}{\partial W}=x^{T} \frac{\partial L}{\partial f}$ $\frac{\partial L}{\partial x}=\frac{\partial L}{\partial f} W^{T}$ 所以我们可以先求出loss对于score的梯度margins，然后根据上面的式子求得 $dW$ :

def svm_loss_vectorized(W, X, y, reg):
    """
    Structured SVM loss function, vectorized implementation.

    Inputs and outputs are the same as svm_loss_naive.
    """
    loss = 0.0
    dW = np.zeros(W.shape) # initialize the gradient as zero

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the structured SVM loss, storing the    #
    # result in loss.                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    scores = X.dot(W)        
    num_classes = W.shape[1]
    num_train = X.shape[0]

    scores_correct = scores[np.arange(num_train), y]   # 1 by N
    scores_correct = np.reshape(scores_correct, (num_train, -1))  # N by 1
    margins = scores - scores_correct + 1    # N by C
    margins = np.maximum(0,margins)
    margins[np.arange(num_train), y] = 0
    loss += np.sum(margins) / num_train
    loss += reg * np.sum(W * W)


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the gradient for the structured SVM     #
    # loss, storing the result in dW.                                           #
    #                                                                           #
    # Hint: Instead of computing the gradient from scratch, it may be easier    #
    # to reuse some of the intermediate values that you used to compute the     #
    # loss.                                                                     #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # compute the gradient
    margins[margins > 0] = 1
    row_sum = np.sum(margins, axis=1)                  # 1 by N
    margins[np.arange(num_train), y] = -row_sum        
    dW += np.dot(X.T, margins)/num_train + 2 * reg * W     # D by C

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

之后还得完善一下LinearClassifier.train()部分，主要是利用SGD来更新一下参数 $W$ (其实用的还是mini-batch gradeint descent，因为SGD是batch_size=1的特例，主要的思想还是vanilla gradient descent)，现在就是把有用的部分贴出来：

sample_index = np.random.choice(num_train, batch_size, replace=False)
X_batch = X[sample_index, :]   # select the batch sample
y_batch = y[sample_index]      # select the batch label

# SGD
self.W += -learning_rate * grad

在这一部分里还得利用validation set去tune一下hyperparameters（主要是学习率和正则化强度），result采用字典的形式，键为一组参数组合，值为对应这组参数的训练和验证准确率，自己需要填写的部分：

for lr in learning_rates:
    for rs in regularization_strengths:
        mySvm = LinearSVM()
        loss_hist = mySvm.train(X_train, y_train, learning_rate=lr, reg=rs,
                      num_iters=1000, verbose=True)
        y_train_pred = mySvm.predict(X_train)
        train_accuracy = np.mean(y_train == y_train_pred)
        y_val_pred = mySvm.predict(X_val)
        val_accuracy = np.mean(y_val == y_val_pred)
        
        results[(lr, rs)] = (train_accuracy, val_accuracy)
        
        if val_accuracy > best_val:
            best_val = val_accuracy
            best_svm = mySvm

Softmax

softmax classifier与svm calssifier相比，就是把hinge loss改成了cross-entropy loss交叉熵损失，也是binary logistic regression classifier的泛化，先给出公式： $L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.05in} \text{或者等价于} \hspace{0.05in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$ 其中 $f$ 也就是我们之前所说的score，但是我们这里叫logits(unnormalized log probabilities)。直观上来说，就是为了使正确分类的概率最大化，也就是接近于1，loss就越小。从信息熵的角度说，因为交叉熵(cross-entropy)就是： $H(p,q) = - \sum_x p(x) \log q(x)=H(p) + D_{KL}(p||q)$ 其中 $p$ 指的是ground truth distribution，相当于标签； $q$ 是estimated distribution，也就是分类器输出。 $H(p)$ 是 $p$ 分布的熵，当其用one-hot编码时， $H(p)＝0$ ，所以 $H(p,q) = D_{KL}(p||q)$ ，也就是 $p$ ， $q$ 的Kullback-Leibler divergence，当损失减小时，预示分类器预估的分布靠近真实分布。

反向传播时朴素的loss计算方法需要推导一下，推导后可得出：
$\left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -x_i + \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} x_i & j = y_i \\ \nabla_{w_j} L_i = & \frac{e^{f_j}}{\sum_j e^{f_j}} x_i & j \ne y_i \end{aligned}\right.$
至于向量化的方法，我们还是可以先求出loss对于 $f$ 的梯度 $\frac{\partial L}{\partial f}$ (可参考这篇课程笔记：Putting it together: Minimal Neural Network Case Study，其实和上面两个公式是一致的)，然后就按之前svm的方法求得梯度 $\frac{\partial L}{\partial W}$ 。现一起给出：

def softmax_loss_naive(W, X, y, reg):
    """
    Softmax loss function, naive implementation (with loops)

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.

    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    #############################################################################
    # TODO: Compute the softmax loss and its gradient using explicit loops.     #
    # Store the loss in loss and the gradient in dW. If you are not careful     #
    # here, it is easy to run into numeric instability. Don't forget the        #
    # regularization!                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    num_classes = W.shape[1]
    num_train = X.shape[0]
    for i in range(num_train):
        scores = X[i].dot(W)
        # to keep numerical calculate stablly,minus maximum
        scores -= np.max(scores)
        exp_scores = np.exp(scores)
        prob = exp_scores / np.sum(exp_scores)
        for j in range(num_classes):
            if j == y[i]:
                correct_class_prob = prob[j]
                correct_class_logprob = -np.log(correct_class_prob)
                loss += correct_class_logprob
                dW[:,j] += (correct_class_prob - 1) * X[i,:]
            # 之前else没加，硬是找不出来这个bug
            else: 
                dW[:,j] += prob[j] * X[i,:]


    loss /= num_train 
    loss += 0.5 * reg *np.sum(W * W)

    dW /= num_train
    dW += reg * W

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW


def softmax_loss_vectorized(W, X, y, reg):
    """
    Softmax loss function, vectorized version.

    Inputs and outputs are the same as softmax_loss_naive.
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    #############################################################################
    # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
    # Store the loss in loss and the gradient in dW. If you are not careful     #
    # here, it is easy to run into numeric instability. Don't forget the        #
    # regularization!                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    num_classes = W.shape[1]
    num_train = X.shape[0]
    scores = X.dot(W)
    # to keep numerical calculate stablly,minus maximum
    scores = scores - np.max(scores,axis=1).reshape(-1,1)
    exp_scores = np.exp(scores)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    # compute the loss: average cross-entropy loss and regularization
    correct_logprobs = -np.log(probs[range(num_train),y])
    data_loss = np.sum(correct_logprobs)/num_train
    reg_loss = 0.5*reg*np.sum(W*W)
    loss = data_loss + reg_loss

    # compute the gradient on scores
    dscores = probs
    dscores[range(num_train),y] -= 1
    dscores /= num_train
    dW = np.dot(X.T, dscores) + reg * W


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

这个softmax的损失函数我们在之后会用到很多次，所以向量化写法一定要熟悉。至于tune hyperparameters部分和svm部分非常像，这里就不放了。

结果

注意这里之前我们做了一点bias trick，就是增加了训练集的一个维度，这样就只要单独优化W了（当然W也增加了一个维度，所以在可视化的时候，就得把bias这一维度去掉），这一过程可由下面这张图形象体现：

bias trick
这是我用svm classifier训练后对参数进行可视化后的效果，可以看到学习得到的权重矩阵其实就是类别的模板，因为矩阵相乘其实就是向量间的内积操作，要想矩阵相乘之后的score最大，只能两个向量比较接近，所以下面学习到的权重矩阵就是综合后的模板：

经svm classifier训练后每一类学习得到的权重矩阵可视化
这是我用softmax classifier训练后对参数进行可视化后的效果，感觉会有点不是那么平滑：

经softmax classifier训练后每一类学习得到的权重矩阵可视化

链接

前后面的作业博文请见：

上一次的博文：KNN
下一次的博文：two_layer_net/features

CS231n Spring 2019 Assignment 1—

SVM

Softmax

结果

链接

猜你喜欢

热点阅读