CS231n Spring 2019 Assignment 1—
距离上一篇的Assignment1中的KNN已经过去快半个月了,因为我中间先把Assignment3的博客先写完了,发现回过头来再看一遍的感觉真的会深入很多。因为上次的KNN在现实中是不会用的(只是给我们一个图像分类的直观感受),原因有两个:(1)在测试耗费太多时间;(2)在生图像素上的距离衡量指标不能有效提供信息(informative)。所以引入线性分类模型,指导思想就是找到合适的loss function和优化方法optimization来减小loss。需要看的是Linear classification: Support Vector Machine, Softmax,主要有两个需要完成:
- svm.ipynb:Multiclass Support Vector Machine exercise
- softmax.ipynb:Softmax exercise
现在主要按着作业的顺序来进行,更多关于内容理解可以看这里。
SVM
这一节里面要用朴素循环和向量化的两种方式来写前向传播计算loss和反向传播计算gradient。线性模型就是通过线性映射:得到分值,分类器输入样本后,对第个类别的分值score是:【注意,是一个长度为类别数的向量】,而对于多分类支持向量机损失(multiclass support vector machine loss,也叫作hinge loss)就是:
其中就是样本对应的损失值,是对应正确类别的分值,这里有个阈值为1,这个也是一个超参数(hyperparameter)。从该损失函数的定义可以看出,只有在正确分类的类别分值比其他类别分值高出一个阈值 ∆ 的时候,损失值才为0,否则损失值都会是一个正值,因为我们的目标就是让物体能够被正确分类。
反向传播的时候稍微麻烦一点,不过在Optimization:Stochastic Gradient Descent中已经给出了相应的梯度公式:
其中是个逻辑算子,当其中为真时为返回1,否则返回0。
乍看这一公式挺复杂的,仔细分析一下也就不麻烦:这里是某个样本对应的损失对(也就是的某一行(或某一列,具体视与相乘的顺序决定,作业里面对应的是一列))的导数。因为针对一个数据样本,在求损失的时候就多次出现,所以对应得到的也就累加了,其余的就出现一次。要是直观理解了上面的公式了,代码应该就可以了,下面是svm_loss_naive()函数:
def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
num_features = W.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:, y[i]] += -X[i, :]
dW[:, j] += X[i, :]
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train
# Add regularization to the loss.
loss += reg * np.sum(W * W)
#############################################################################
# TODO: #
# Compute the gradient of the loss function and store it dW. #
# Rather that first computing the loss and then computing the derivative, #
# it may be simpler to compute the derivative at the same time that the #
# loss is being computed. As a result you may need to modify some of the #
# code above to compute the gradient. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
dW /= num_train
dW += 2 * reg * W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
对于向量化的写法,感觉计算loss那部分还是挺容易想到的,但是grad那部分还是有点难,但是看过这篇矩阵对矩阵的求导:linear backpropexample就会有点思路,总结来说就是若是对于一个线性操作:(这里的和都是矩阵),假设我们已知loss对于的梯度,那么就有[可以从维度匹配角度辅助记忆]: 所以我们可以先求出loss对于score的梯度margins,然后根据上面的式子求得:
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.
Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
#############################################################################
# TODO: #
# Implement a vectorized version of the structured SVM loss, storing the #
# result in loss. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
scores = X.dot(W)
num_classes = W.shape[1]
num_train = X.shape[0]
scores_correct = scores[np.arange(num_train), y] # 1 by N
scores_correct = np.reshape(scores_correct, (num_train, -1)) # N by 1
margins = scores - scores_correct + 1 # N by C
margins = np.maximum(0,margins)
margins[np.arange(num_train), y] = 0
loss += np.sum(margins) / num_train
loss += reg * np.sum(W * W)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
#############################################################################
# TODO: #
# Implement a vectorized version of the gradient for the structured SVM #
# loss, storing the result in dW. #
# #
# Hint: Instead of computing the gradient from scratch, it may be easier #
# to reuse some of the intermediate values that you used to compute the #
# loss. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# compute the gradient
margins[margins > 0] = 1
row_sum = np.sum(margins, axis=1) # 1 by N
margins[np.arange(num_train), y] = -row_sum
dW += np.dot(X.T, margins)/num_train + 2 * reg * W # D by C
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
之后还得完善一下LinearClassifier.train()部分,主要是利用SGD来更新一下参数(其实用的还是mini-batch gradeint descent,因为SGD是batch_size=1的特例,主要的思想还是vanilla gradient descent),现在就是把有用的部分贴出来:
sample_index = np.random.choice(num_train, batch_size, replace=False)
X_batch = X[sample_index, :] # select the batch sample
y_batch = y[sample_index] # select the batch label
# SGD
self.W += -learning_rate * grad
在这一部分里还得利用validation set去tune一下hyperparameters(主要是学习率和正则化强度),result采用字典的形式,键为一组参数组合,值为对应这组参数的训练和验证准确率,自己需要填写的部分:
for lr in learning_rates:
for rs in regularization_strengths:
mySvm = LinearSVM()
loss_hist = mySvm.train(X_train, y_train, learning_rate=lr, reg=rs,
num_iters=1000, verbose=True)
y_train_pred = mySvm.predict(X_train)
train_accuracy = np.mean(y_train == y_train_pred)
y_val_pred = mySvm.predict(X_val)
val_accuracy = np.mean(y_val == y_val_pred)
results[(lr, rs)] = (train_accuracy, val_accuracy)
if val_accuracy > best_val:
best_val = val_accuracy
best_svm = mySvm
Softmax
softmax classifier与svm calssifier相比,就是把hinge loss改成了cross-entropy loss交叉熵损失,也是binary logistic regression classifier的泛化,先给出公式:其中也就是我们之前所说的score,但是我们这里叫logits(unnormalized log probabilities)。直观上来说,就是为了使正确分类的概率最大化,也就是接近于1,loss就越小。从信息熵的角度说,因为交叉熵(cross-entropy)就是:其中指的是ground truth distribution,相当于标签;是estimated distribution,也就是分类器输出。是分布的熵,当其用one-hot编码时,,所以,也就是,的Kullback-Leibler divergence,当损失减小时,预示分类器预估的分布靠近真实分布。
反向传播时朴素的loss计算方法需要推导一下,推导后可得出:
至于向量化的方法,我们还是可以先求出loss对于的梯度(可参考这篇课程笔记:Putting it together: Minimal Neural Network Case Study,其实和上面两个公式是一致的),然后就按之前svm的方法求得梯度。现一起给出:
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# TODO: Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
num_classes = W.shape[1]
num_train = X.shape[0]
for i in range(num_train):
scores = X[i].dot(W)
# to keep numerical calculate stablly,minus maximum
scores -= np.max(scores)
exp_scores = np.exp(scores)
prob = exp_scores / np.sum(exp_scores)
for j in range(num_classes):
if j == y[i]:
correct_class_prob = prob[j]
correct_class_logprob = -np.log(correct_class_prob)
loss += correct_class_logprob
dW[:,j] += (correct_class_prob - 1) * X[i,:]
# 之前else没加,硬是找不出来这个bug
else:
dW[:,j] += prob[j] * X[i,:]
loss /= num_train
loss += 0.5 * reg *np.sum(W * W)
dW /= num_train
dW += reg * W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# TODO: Compute the softmax loss and its gradient using no explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
num_classes = W.shape[1]
num_train = X.shape[0]
scores = X.dot(W)
# to keep numerical calculate stablly,minus maximum
scores = scores - np.max(scores,axis=1).reshape(-1,1)
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_train),y])
data_loss = np.sum(correct_logprobs)/num_train
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
# compute the gradient on scores
dscores = probs
dscores[range(num_train),y] -= 1
dscores /= num_train
dW = np.dot(X.T, dscores) + reg * W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
这个softmax的损失函数我们在之后会用到很多次,所以向量化写法一定要熟悉。至于tune hyperparameters部分和svm部分非常像,这里就不放了。
结果
注意这里之前我们做了一点bias trick,就是增加了训练集的一个维度,这样就只要单独优化W了(当然W也增加了一个维度,所以在可视化的时候,就得把bias这一维度去掉),这一过程可由下面这张图形象体现:
这是我用svm classifier训练后对参数进行可视化后的效果,可以看到学习得到的权重矩阵其实就是类别的模板,因为矩阵相乘其实就是向量间的内积操作,要想矩阵相乘之后的score最大,只能两个向量比较接近,所以下面学习到的权重矩阵就是综合后的模板:
经svm classifier训练后每一类学习得到的权重矩阵可视化
这是我用softmax classifier训练后对参数进行可视化后的效果,感觉会有点不是那么平滑:
经softmax classifier训练后每一类学习得到的权重矩阵可视化
链接
前后面的作业博文请见:
- 上一次的博文:KNN
- 下一次的博文:two_layer_net/features