CS231n Spring 2019 Assignment 1—

2019-09-29  本文已影响0人  赖子啊

距离上一篇的Assignment1中的KNN已经过去快半个月了,因为我中间先把Assignment3的博客先写完了,发现回过头来再看一遍的感觉真的会深入很多。因为上次的KNN在现实中是不会用的(只是给我们一个图像分类的直观感受),原因有两个:(1)在测试耗费太多时间;(2)在生图像素上的距离衡量指标不能有效提供信息(informative)。所以引入线性分类模型,指导思想就是找到合适的loss function和优化方法optimization来减小loss。需要看的是Linear classification: Support Vector Machine, Softmax,主要有两个需要完成:

现在主要按着作业的顺序来进行,更多关于内容理解可以看这里


SVM

这一节里面要用朴素循环和向量化的两种方式来写前向传播计算loss和反向传播计算gradient。线性模型就是通过线性映射:f:R^{D}\rightarrow R^{K}得到分值,分类器输入样本x_{i}后,对第j个类别的分值score是:s_{j}=f(x_{i},W)_{j}注意s_{j}是一个长度为类别数的向量】,而对于多分类支持向量机损失(multiclass support vector machine loss,也叫作hinge loss)就是:
L_{i}=\sum_{j \neq y_{i}} max(0,s_{j}-s_{y_{i}}+1)其中L_{i}就是样本x_{i}对应的损失值,s_{y_{i}}是对应x_{i}正确类别的分值,这里有个阈值为1,这个也是一个超参数(hyperparameter)。从该损失函数的定义可以看出,只有在正确分类的类别分值比其他类别分值高出一个阈值 ∆ 的时候,损失值才为0,否则损失值都会是一个正值,因为我们的目标就是让物体能够被正确分类。

反向传播的时候稍微麻烦一点,不过在Optimization:Stochastic Gradient Descent中已经给出了相应的梯度公式:
\left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -\left(\sum_{j \ne y_i}\mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0)\right)x_i & j = y_i \\ \nabla_{w_j} L_i = & \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i & j \ne y_i \end{aligned}\right.其中\mathbb{1}(\cdot)是个逻辑算子,当其中为真时为返回1,否则返回0。
乍看这一公式挺复杂的,仔细分析一下也就不麻烦:这里是某个样本对应的损失L_{i}w_{y_i}/w_{j}(也就是w的某一行(或某一列,具体视xw相乘的顺序决定,作业里面对应的是一列))的导数。因为针对一个数据样本,在求损失L_{i}的时候就多次出现{s_{y_{i}}},所以对应得到{s_{y_{i}}}{w_{y_i}}也就累加了,其余的w_{j}就出现一次。要是直观理解了上面的公式了,代码应该就可以了,下面是svm_loss_naive()函数:

def svm_loss_naive(W, X, y, reg):
    """
    Structured SVM loss function, naive implementation (with loops).

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.

    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    dW = np.zeros(W.shape) # initialize the gradient as zero

    # compute the loss and the gradient
    num_classes = W.shape[1]
    num_train = X.shape[0]
    num_features = W.shape[0]
    loss = 0.0
    for i in range(num_train):
        scores = X[i].dot(W)
        correct_class_score = scores[y[i]]
        for j in range(num_classes):
            if j == y[i]:
                continue
            margin = scores[j] - correct_class_score + 1 # note delta = 1
            if margin > 0:
                loss += margin
                dW[:, y[i]] += -X[i, :]
                dW[:, j] += X[i, :]
                

    # Right now the loss is a sum over all training examples, but we want it
    # to be an average instead so we divide by num_train.
    loss /= num_train

    # Add regularization to the loss.
    loss += reg * np.sum(W * W)

    #############################################################################
    # TODO:                                                                     #
    # Compute the gradient of the loss function and store it dW.                #
    # Rather that first computing the loss and then computing the derivative,   #
    # it may be simpler to compute the derivative at the same time that the     #
    # loss is being computed. As a result you may need to modify some of the    #
    # code above to compute the gradient.                                       #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    dW /= num_train
    dW += 2 * reg * W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return loss, dW

对于向量化的写法,感觉计算loss那部分还是挺容易想到的,但是grad那部分还是有点难,但是看过这篇矩阵对矩阵的求导:linear backpropexample就会有点思路,总结来说就是若是对于一个线性操作:f=xW(这里的xW都是矩阵),假设我们已知loss对于f的梯度\frac{\partial L}{\partial f},那么就有[可以从维度匹配角度辅助记忆]:\frac{\partial L}{\partial W}=x^{T} \frac{\partial L}{\partial f} \frac{\partial L}{\partial x}=\frac{\partial L}{\partial f} W^{T}所以我们可以先求出loss对于score的梯度margins,然后根据上面的式子求得dW:

def svm_loss_vectorized(W, X, y, reg):
    """
    Structured SVM loss function, vectorized implementation.

    Inputs and outputs are the same as svm_loss_naive.
    """
    loss = 0.0
    dW = np.zeros(W.shape) # initialize the gradient as zero

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the structured SVM loss, storing the    #
    # result in loss.                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    scores = X.dot(W)        
    num_classes = W.shape[1]
    num_train = X.shape[0]

    scores_correct = scores[np.arange(num_train), y]   # 1 by N
    scores_correct = np.reshape(scores_correct, (num_train, -1))  # N by 1
    margins = scores - scores_correct + 1    # N by C
    margins = np.maximum(0,margins)
    margins[np.arange(num_train), y] = 0
    loss += np.sum(margins) / num_train
    loss += reg * np.sum(W * W)


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the gradient for the structured SVM     #
    # loss, storing the result in dW.                                           #
    #                                                                           #
    # Hint: Instead of computing the gradient from scratch, it may be easier    #
    # to reuse some of the intermediate values that you used to compute the     #
    # loss.                                                                     #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # compute the gradient
    margins[margins > 0] = 1
    row_sum = np.sum(margins, axis=1)                  # 1 by N
    margins[np.arange(num_train), y] = -row_sum        
    dW += np.dot(X.T, margins)/num_train + 2 * reg * W     # D by C

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

之后还得完善一下LinearClassifier.train()部分,主要是利用SGD来更新一下参数W(其实用的还是mini-batch gradeint descent,因为SGD是batch_size=1的特例,主要的思想还是vanilla gradient descent),现在就是把有用的部分贴出来:

sample_index = np.random.choice(num_train, batch_size, replace=False)
X_batch = X[sample_index, :]   # select the batch sample
y_batch = y[sample_index]      # select the batch label

# SGD
self.W += -learning_rate * grad

在这一部分里还得利用validation set去tune一下hyperparameters(主要是学习率和正则化强度),result采用字典的形式,键为一组参数组合,值为对应这组参数的训练和验证准确率,自己需要填写的部分:

for lr in learning_rates:
    for rs in regularization_strengths:
        mySvm = LinearSVM()
        loss_hist = mySvm.train(X_train, y_train, learning_rate=lr, reg=rs,
                      num_iters=1000, verbose=True)
        y_train_pred = mySvm.predict(X_train)
        train_accuracy = np.mean(y_train == y_train_pred)
        y_val_pred = mySvm.predict(X_val)
        val_accuracy = np.mean(y_val == y_val_pred)
        
        results[(lr, rs)] = (train_accuracy, val_accuracy)
        
        if val_accuracy > best_val:
            best_val = val_accuracy
            best_svm = mySvm

Softmax

softmax classifier与svm calssifier相比,就是把hinge loss改成了cross-entropy loss交叉熵损失,也是binary logistic regression classifier的泛化,先给出公式:L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.05in} \text{或者等价于} \hspace{0.05in} L_i = -f_{y_i} + \log\sum_j e^{f_j}其中f也就是我们之前所说的score,但是我们这里叫logits(unnormalized log probabilities)。直观上来说,就是为了使正确分类的概率最大化,也就是接近于1,loss就越小。从信息熵的角度说,因为交叉熵(cross-entropy)就是:H(p,q) = - \sum_x p(x) \log q(x)=H(p) + D_{KL}(p||q)其中p指的是ground truth distribution,相当于标签;q是estimated distribution,也就是分类器输出。H(p)p分布的熵,当其用one-hot编码时,H(p)=0,所以H(p,q) = D_{KL}(p||q),也就是pq的Kullback-Leibler divergence,当损失减小时,预示分类器预估的分布靠近真实分布。

反向传播时朴素的loss计算方法需要推导一下,推导后可得出:
\left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -x_i + \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} x_i & j = y_i \\ \nabla_{w_j} L_i = & \frac{e^{f_j}}{\sum_j e^{f_j}} x_i & j \ne y_i \end{aligned}\right.
至于向量化的方法,我们还是可以先求出loss对于f的梯度\frac{\partial L}{\partial f}(可参考这篇课程笔记:Putting it together: Minimal Neural Network Case Study,其实和上面两个公式是一致的),然后就按之前svm的方法求得梯度\frac{\partial L}{\partial W}。现一起给出:

def softmax_loss_naive(W, X, y, reg):
    """
    Softmax loss function, naive implementation (with loops)

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.

    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    #############################################################################
    # TODO: Compute the softmax loss and its gradient using explicit loops.     #
    # Store the loss in loss and the gradient in dW. If you are not careful     #
    # here, it is easy to run into numeric instability. Don't forget the        #
    # regularization!                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    num_classes = W.shape[1]
    num_train = X.shape[0]
    for i in range(num_train):
        scores = X[i].dot(W)
        # to keep numerical calculate stablly,minus maximum
        scores -= np.max(scores)
        exp_scores = np.exp(scores)
        prob = exp_scores / np.sum(exp_scores)
        for j in range(num_classes):
            if j == y[i]:
                correct_class_prob = prob[j]
                correct_class_logprob = -np.log(correct_class_prob)
                loss += correct_class_logprob
                dW[:,j] += (correct_class_prob - 1) * X[i,:]
            # 之前else没加,硬是找不出来这个bug
            else: 
                dW[:,j] += prob[j] * X[i,:]


    loss /= num_train 
    loss += 0.5 * reg *np.sum(W * W)

    dW /= num_train
    dW += reg * W

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW


def softmax_loss_vectorized(W, X, y, reg):
    """
    Softmax loss function, vectorized version.

    Inputs and outputs are the same as softmax_loss_naive.
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    #############################################################################
    # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
    # Store the loss in loss and the gradient in dW. If you are not careful     #
    # here, it is easy to run into numeric instability. Don't forget the        #
    # regularization!                                                           #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    num_classes = W.shape[1]
    num_train = X.shape[0]
    scores = X.dot(W)
    # to keep numerical calculate stablly,minus maximum
    scores = scores - np.max(scores,axis=1).reshape(-1,1)
    exp_scores = np.exp(scores)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    # compute the loss: average cross-entropy loss and regularization
    correct_logprobs = -np.log(probs[range(num_train),y])
    data_loss = np.sum(correct_logprobs)/num_train
    reg_loss = 0.5*reg*np.sum(W*W)
    loss = data_loss + reg_loss

    # compute the gradient on scores
    dscores = probs
    dscores[range(num_train),y] -= 1
    dscores /= num_train
    dW = np.dot(X.T, dscores) + reg * W


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

这个softmax的损失函数我们在之后会用到很多次,所以向量化写法一定要熟悉。至于tune hyperparameters部分和svm部分非常像,这里就不放了。

结果

注意这里之前我们做了一点bias trick,就是增加了训练集的一个维度,这样就只要单独优化W了(当然W也增加了一个维度,所以在可视化的时候,就得把bias这一维度去掉),这一过程可由下面这张图形象体现:

bias trick
这是我用svm classifier训练后对参数进行可视化后的效果,可以看到学习得到的权重矩阵其实就是类别的模板,因为矩阵相乘其实就是向量间的内积操作,要想矩阵相乘之后的score最大,只能两个向量比较接近,所以下面学习到的权重矩阵就是综合后的模板:
经svm classifier训练后每一类学习得到的权重矩阵可视化
这是我用softmax classifier训练后对参数进行可视化后的效果,感觉会有点不是那么平滑:
经softmax classifier训练后每一类学习得到的权重矩阵可视化

链接

前后面的作业博文请见:

上一篇下一篇

猜你喜欢

热点阅读