Stanford cs231n Assignment #1 (c

2016-12-09 本文已影响0人麦兜胖胖次

这篇文章讲的是用softmax分类器来实现分类任务。其实softmax和svm的分类效果在很多情况下都是差不多的，只是它们的各自数学理解不大一样。softmax是以概率的形式来呈现，最后的输出都是百分比的形式。

有一篇比较好的文章是关于softmax和logistic的介绍和联系的，看完之后就可以理解，其实softmax只是在概率计算上和二分类的logistic不同罢了。http://blog.csdn.net/zhangliyao22/article/details/48379291
还有另外一篇：http://www.cnblogs.com/guyj/p/3800519.html

loss function:

Paste_Image.png

其实对于 f(x, W) = Wx 来说没有变，只是interpretion变为了以概率模型为解释，那么随之而来的是loss function的改变。

Paste_Image.png

个人比较喜欢的是用熵来解释softmax的意义：

Paste_Image.png

进一步解释的话就是说，对于一个数据，它分别计算出了相对n个类的score。但是“true"数据分布，其实就是一个[0,0,...0,1,0,...0,0]。根据截图中的公式，不难把softmax的公式和上图所示的交叉熵的公式互相转换并理解出来了。由于对于真实的分布，除了yi之外的所有类都应该为0，那么最后的公式就是 Li = - ( 1 * log (score(yi) / sum_score) ) 。。。易知分类正确的情况下这个loss function的值为0。也就是理想情况中的最小值，在训练时我们要最小化交叉熵代价函数。

gradient

此处我参考了一篇写的非常非常好的博客http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
结论：

Paste_Image.png
这个结果是对softmax函数求导，也就是loss function的equivalent 表达式的前半部分（没负号），而不是加了log之后的...其中Si,Sj指的都是由矩阵乘法Wx计算得到并由softmax函数化为0-1之间的概率的score，i代表的是当前正在计算这个输入数据对应于第i个类的概率，j代表的是当前数据求对应第j个a值的偏导。这个计算得到的表达式不是最终的dL/dW，这只是dP/dA的表达，这个A是直接由矩阵乘法Wx得到的还没有映射为概率的值，而P则是softmax函数表达式。本身由A映射为P是一个N->N的映射，Jacobian为一个N by N的矩阵。（N代表类别个数）根据chain rule，想得到dL/dW的话还需要求dL/dP和dP/dW。

详细推导如下：

softmax.png

代码如下：

def softmax_loss_naive(W, X, y, reg):
  """
  Softmax loss function, naive implementation (with loops)

  Inputs have dimension D, there are C classes, and we operate on minibatches
  of N examples.

  Inputs:
  - W: A numpy array of shape (D, C) containing weights.
  - X: A numpy array of shape (N, D) containing a minibatch of data.
  - y: A numpy array of shape (N,) containing training labels; y[i] = c means
    that X[i] has label c, where 0 <= c < C.
  - reg: (float) regularization strength

  Returns a tuple of:
  - loss as single float
  - gradient with respect to weights W; an array of same shape as W
  """
  # Initialize the loss and gradient to zero.
  loss = 0.0
  dW = np.zeros_like(W)

  num_train = X.shape[0]
  num_classes = W.shape[1]

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using explicit loops.     #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################
  for i in xrange(num_train):
    scores = X[i].dot(W)
    log_c = np.max(scores)           # this is for computation stability...
    p = []
    for j in xrange(num_classes):
      p.append(np.exp(scores[j] - log_c))
    loss += -np.log(p[y[i]]/np.sum(p))
    for j in xrange(num_classes):
      dW[:,j] += (p[j]/np.sum(p) - (j==y[i]))*X[i,:]
  
  loss /= num_train
  loss += 0.5 * reg * np.sum(W * W)
  dW /= num_train
  dW += reg*W
      
  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW


def softmax_loss_vectorized(W, X, y, reg):
  """
  Softmax loss function, vectorized version.

  Inputs and outputs are the same as softmax_loss_naive.
  """
  # Initialize the loss and gradient to zero.
  loss = 0.0
  dW = np.zeros_like(W)

  num_train = X.shape[0]
  num_classes = W.shape[1]

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################
  scores = X.dot(W)
  # log_c = np.max(scores, axis=1).reshape(X.shape[0], 1)
  scores -= np.max(scores, axis=1).reshape(X.shape[0], 1)
  scores_exp = np.exp(scores)
  sum_p = np.sum(scores_exp, axis=1).reshape(X.shape[0], 1)
  p_yi = scores_exp[:, y]
  p = scores_exp/sum_p
  loss = np.mean(-np.log( p_yi/sum_p))
  binary = np.zeros(p.shape)
  print binary.shape
  binary[range(binary.shape[0]), y] = 1
  print binary
  dW = np.dot(X.transpose(), p - binary)
  loss += 0.5 * reg * np.sum(W * W)
  dW /= num_train
  dW += reg*W
  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW

Stanford cs231n Assignment #1 (c

loss function:

gradient

猜你喜欢

热点阅读