CS224d Pset 1
title: CS224d Pset 1
date: 2017-04-04 20:02:21
mathjax: true
categories: NLP/CS224d
tags: [CS224d,NLP]
image.png
根据题意,
- h=sigmoid(xw_1+b_1)
- \widehat y=softmax(hw_2+b_2)
- m维度x输入以如下的矩阵形式,每行为一个样例
\begin{bmatrix} x_{D_1}\ x_{D_2}\ ...\ x_{D_m}\\ x_{D_1}\ x_{D_2}\ ...\ x_{D_m}\\ x_{D_1}\ x_{D_2}\ ...\ x_{D_m}\\ .\\ .\\ x_{D_1}\ x_{D_2}\ ...\ x_{D_m}\\ \end{bmatrix}
- 所以根据上述网络图
加入有样例数N,隐藏层D_h=h,输出层y有D_y=n维
输入X矩阵 Nm
W1矩阵mh
W2矩阵hn
最终得到Nn的矩阵输出
Softmax(10 points)
(a)求证softmax输入常数c不影响结果。
(softmax ( \mathbf{x} + c ))_i = \frac{\exp (\mathbf{x}_i + c ) }{\sum_{j=1}^{dim( \mathbf{x})}{\exp(\mathbf{x}_j+c )}} = \frac{\exp(c )\exp ( x_i)}{\exp ( c )\sum_{j=1}^{dim(x )}\exp( x_j)} = \frac{\exp ( x_i)}{\sum_{j=1}^{dim(x )}\exp( x_j)} = ( softmax ( \mathbf{x}) )_i
(b)实现softmax。见Github
Neural Network Basics (30 points)
(a)推导sigmoid函数的导数。
σ′(x)=σ(x)(1−σ(x))
(b)交叉熵的导数
首先,相比方差代价函数,交叉熵代价函数用于sigmoid会使得梯度更新更明显。
softmax:
softmax(x_i)=\frac{e^{x_i}}{\sum_k^m e^{x_k}}
softmax求导:
当j = i:
\frac{\partial S_j}{\partial x_i} =\frac{\partial }{\partial x_i}(\frac{e^{x_j}}{\sum_k^m e^{x_k}})\\ =\frac{\partial }{\partial x_i}(\frac{e^{x_j}}{e^{x_1}+e^{x_2}...+e^{x_k}})\\ =\frac{e^{x_i}*({e^{x_1}+e^{x_2}...+e^{x_k}})-e^{x_i}*e^{x_i}}{(e^{x_1}+e^{x_2}...+e^{x_k})^2}\\ =S_i-S_i^2\\ =S_i(1-S_i)
当j \neq i:
\frac{\partial S_j}{\partial x_i}=\frac{\partial }{\partial x_i}(\frac{e^{x_i}}{\sum_k^m e^{x_k}})\\ =\frac{\partial }{\partial x_i}(\frac{e^{x_j}}{e^{x_1}+e^{x_2}...+e^{x_k}})\\ =\frac{0*({e^{x_1}+e^{x_2}...+e^{x_k}})-e^{x_j}*e^{x_i}}{(e^{x_1}+e^{x_2}...+e^{x_k})^2}\\ =-S_jS_i\\
所以
\frac{\partial S_j}{\partial x_i} = \begin{cases} S_i(1 – S_i),&\quad i = j \\ -S_i S_j,&\quad i \neq j \end{cases}
CE(Cross-entropy):
CE(y,\widehat y)=-\sum_k y_k log({\widehat y_k})
其中 \widehat y_k = sigmoid(x_k)
求导:
\begin{split} \frac{\partial CE(y,\widehat y)}{\partial x_i} &=-\sum_k y_k\frac{\partial \log \widehat y_k}{\partial x_i} \\ &=-\sum_ky_k\frac{1}{\widehat y_k}\frac{\partial \widehat y_k}{\partial x_i} \\ \end{split}
其中 \sum_ky_k\frac{1}{\widehat y_k}\frac{\partial \widehat y_k}{\partial x_i} = \begin{cases} y_i(1-\widehat y_i),&\quad k = i \\ \sum_{k\neq i}y_k\frac{1}{\widehat y_k}({-\widehat y_k \widehat y_i}) ,&\quad k \neq i \end{cases}
\begin{split} 所以带入&=-y_i(1-\widehat y_i)+\sum_{k\neq i}y_k({\widehat y_i}) \\ &=-y_i+{y_i \widehat y_i+\sum_{k\neq i}y_k({\widehat y_i})} \\ &={\widehat y_i\left(\sum_ky_k\right)}-y_i \\ &=\widehat y_i-y_i \end{split}
其中因为有onehot,\sum_ky_k=1,y_i=1,y_{i \neq k}=0
(c)推导单隐层神经网络的梯度
image.png根据题意,
h=sigmoid(xw_1+b_1)
\widehat y=softmax(hw_2+b_2)
令:
z_1=xw_1+b_1
z_2=hw_2+b_2
- 该网络cost function的导数
\begin{split} \frac{\partial CE}{\partial x} &= \frac{\partial CE}{\partial z_2}*\frac{\partial z_2}{\partial h}*\frac{\partial h}{\partial z_1}*\frac{\partial z_1}{\partial x}\\ &= (\widehat y-y)*(w_2^T)*(sigmoid^{'}(z_1))*(w_1^T)\\ &= (\widehat y-y)*(w_2^T)*(sigmoid(z_1)(1−sigmoid(z_1)))*(w_1^T)\\ &= (\widehat y-y)*(w_2^T)*(sigmoid(xw_1+b_1)(1−sigmoid(xw_1+b_1)))*(w_1^T) \end{split}
(d)神经网络的参数数量计算
输入D_x=m,输出D_y=n
再加上h个b_1和D_y个b_2
共有参数(D_x+1)*h+(h+1)*D_y
(e)sigmoid的激活函数和梯度的代码
完成q2_sigmoid.py
(f)梯度检查代码
完成q2_gradcheck.py
(g)神经网络代码
完成q2 neural.py
word2vec(40 points + 5 bonus)
(a)
待
(b)
待
(c)
待
(d)
待
(e)
q3_word2vec.py
(f)
q3_sgd.py
(g)测试运行
python q3 run.py
(h)
在q3_word2vec.py中实现CBOW
情感分析
(a)
q4_softmaxreg.py
(b)
解释当分类语料少于三句时为什么要引入正则化(实际上在大多数机器学习任务都这样)
(c)
q4 sentiment.py
(d)
绘图
Appendix
Pset1 tutorial
Pset1 assignment1
Pset1 solutions Code
Pset1 solutions
斯坦福cs224d 大作业测验1与解答
import numpy as np
if __name__ == "__main__":
## 实验说明axis是tuple作用是对某维度的投影,max则是和投影平行最大的面
## 二维数组 0 对x的投影 最大行; 1对y投影最大列
## keepdims值为True则是保持维度
y = np.array([[1,2],
[3,4]])
## print len(x.shape)
print np.max(y,axis=0,keepdims=True)
print np.max(y,axis=0,keepdims=False)
print np.max(y,axis=1,keepdims=True)
x = np.array([[[1,2],
[3,4]],
[[5,6],
[7,8]]])
print np.max(x,axis=(0,1),keepdims=True)