CS231nassignment3LSTM

target

implement LSTM update rule and use it for image captioning

LSTM

  • 在vanilla RNN里面,可以根据vanilla来计算长的sequence,但是同样会因为不停的乘矩阵导致gradient爆炸的问题,LSTM主要就是用了一个其他的update rule解决了这个问题
  • 和vanilla RNN差不多,现在这个step的x,前一个hidden state。LSTM保持着H-d的cell state,所以也会从前一个接收到前一个的cell state。
    • LSTM会学一个input-to-hidden的矩阵 4HxD, hidden-to-hidden的矩阵 4HxH, bias 4H
  • 在每一部都会先计算被激活之后的函数(4H),然后把这个结果a分成四个部分,每个部分的大小是H
    • 根据这四个部分计算input gate,forget gate,output gate,block gate
    • 前三个都用sigmoid激活,最后一个用tanh激活
  • 然后用上面的四个参数计算下一个cell state和hidden state

step forward

  • 输入的大小是D,hidden的大小是H,minibatch的大小是N
  • input
    • x (N,D)
    • prev_h (N,H)
    • prev_c (N,H)
    • Wx input 2 hidden (D,4H)
    • Wh hidden 2 hidden (H,4H)
    • bias, (4H)
  • output
    • next_h (N,H)
    • next_c (N,H)
    • cache
  • 按照之前给的公式直接计算就行了,其实就是把原来求出来的值分成了四个部分,分别求出来了四个新的值,用这四个新的值的公式可以得到下一个状态的c和h

step backward

  • input
    • dnext_h (N,H) 都是上面一个回来的
    • dnext_c (N,H)
    • cache
  • output
    • dx(N,D)
    • dprev_h (N,H)
    • dprev_c (N,H)
    • dWx (D,4H)
    • dWh (H,4H)
    • db (4H)
  • 按着正方向计算的顺序back回去就可以了,注意这里有个问题就是因为next_c被用来计算next h了,所以dnext_c需要再求一下关于next h的导数,并且把求出来的新的值加在以前的东西上面
  • 后面的矩阵计算尺寸
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
"""
Forward pass for a single timestep of an LSTM.

The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.

Note that a sigmoid() function has already been provided for you in this file.

Inputs:
- x: Input data, of shape (N, D)
- prev_h: Previous hidden state, of shape (N, H)
- prev_c: previous cell state, of shape (N, H)
- Wx: Input-to-hidden weights, of shape (D, 4H)
- Wh: Hidden-to-hidden weights, of shape (H, 4H)
- b: Biases, of shape (4H,)

Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- next_c: Next cell state, of shape (N, H)
- cache: Tuple of values needed for backward pass.
"""
next_h, next_c, cache = None, None, None
#############################################################################
# TODO: Implement the forward pass for a single timestep of an LSTM. #
# You may want to use the numerically stable sigmoid implementation above. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N, H = prev_h.shape
# -> N,4H
a = x.dot(Wx) + prev_h.dot(Wh) + b
a_i = a[:, :H]
a_f = a[:, H:2 * H]
a_o = a[:, 2 * H:3 * H]
a_g = a[:, 3 * H:]

i = sigmoid(a_i)
f = sigmoid(a_f)
o = sigmoid(a_o)
g = np.tanh(a_g)

next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)

cache = (x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, next_h)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################

return next_h, next_c, cache


def lstm_step_backward(dnext_h, dnext_c, cache):
"""
Backward pass for a single timestep of an LSTM.

Inputs:
- dnext_h: Gradients of next hidden state, of shape (N, H)
- dnext_c: Gradients of next cell state, of shape (N, H)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient of input data, of shape (N, D)
- dprev_h: Gradient of previous hidden state, of shape (N, H)
- dprev_c: Gradient of previous cell state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None
#############################################################################
# TODO: Implement the backward pass for a single timestep of an LSTM. #
# #
# HINT: For sigmoid and tanh you can compute local derivatives in terms of #
# the output value from the nonlinearity. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N, H = dnext_h.shape

x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, next_h = cache

# dnext_h/do -> do
do = np.tanh(next_c) * dnext_h
# dnext_h/dnext_c
dnext_c = o * (1 - np.tanh(next_c) ** 2) * dnext_h + dnext_c
# dnext_c/df -> df
df = prev_c * dnext_c
# dnext_c/dprev_c
dprev_c = f * dnext_c
# dnext_c/di
di = g * dnext_c
# dnext_c/dg
dg = i * dnext_c

da = np.zeros((N, 4 * H))

# sigmoid i
da[:, :H] = i * (1 - i) * di
da[:, H:2 * H] = f * (1 - f) * df
da[:, 2 * H:3 * H] = o * (1 - o) * do
da[:, 3 * H:] = (1 - g * g) * dg

# a = x.dot(Wx) + prev_h.dot(Wh) + b
# N,4H D,4H
dx = da.dot(Wx.T)
# N,D N,4H
dWx = x.T.dot(da)
# N,4H H,4H
dprev_h = da.dot(Wh.T)
# N,H N,4H
dWh = prev_h.T.dot(da)
# da N,4H
db = np.sum(da, axis=0)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################

return dx, dprev_h, dprev_c, dWx, dWh, db

forward

  • 输入了一大串data,假设输入的data包括了T个vector,每个的dim是D,用的hidden的大小是H,在N的minibatch上面进行,返回对于所有time step的hidden state
  • 初始化的cell是0,不会return cell state,只是LSTM自己的变量
  • 输入
    • x (N,T,D)
    • h0, (N,H)
    • Wx (D,4H)
    • Wh (H,4H)
    • b (4H)
  • out
    • h (N,T,D)
    • cache
  • 注意h是需要初始化为0的,每次for里面拿出来的是h里面的一部分来赋值

backward

  • 和之前的差不多,注意W和b都是要积累的,之前都是要初始化的
  • 而且back的时候要用reversed的顺序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
def lstm_forward(x, h0, Wx, Wh, b):
"""
Forward pass for an LSTM over an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the LSTM forward, we return the hidden states for all timesteps.

Note that the initial cell state is passed as input, but the initial cell
state is set to zero. Also note that the cell state is not returned; it is
an internal variable to the LSTM and is not accessed from outside.
Inputs:

- x: Input data of shape (N, T, D)
- h0: Initial hidden state of shape (N, H)
- Wx: Weights for input-to-hidden connections, of shape (D, 4H)
- Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
- b: Biases of shape (4H,)

Returns a tuple of:
- h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
- cache: Values needed for the backward pass.
"""
h, cache = None, None
#############################################################################
# TODO: Implement the forward pass for an LSTM over an entire timeseries. #
# You should use the lstm_step_forward function that you just defined. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N, T, D = x.shape
N, H = h0.shape

prev_h = h0
prev_c = np.zeros((N, H))
cache = {}
h = np.zeros((N, T, H))
for step in range(T):
prev_h, prev_c, cache_step = lstm_step_forward(
x[:, step, :], prev_h, prev_c, Wx, Wh, b)
h[:, step, :] = prev_h
cache[step] = cache_step

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################

return h, cache


def lstm_backward(dh, cache):
"""
Backward pass for an LSTM over an entire sequence of data.]

Inputs:
- dh: Upstream gradients of hidden states, of shape (N, T, H)
- cache: Values from the forward pass

Returns a tuple of:
- dx: Gradient of input data of shape (N, T, D)
- dh0: Gradient of initial hidden state of shape (N, H)
- dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
- dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
- db: Gradient of biases, of shape (4H,)
"""
dx, dh0, dWx, dWh, db = None, None, None, None, None
#############################################################################
# TODO: Implement the backward pass for an LSTM over an entire timeseries. #
# You should use the lstm_step_backward function that you just defined. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

x = cache[0][0]
# 注意这是一个step里面的x,大小是N,D
N, D = x.shape
_, T, H = dh.shape

dx = np.zeros((N, T, D))
dprev_h = np.zeros((N, H))
dprev_c = np.zeros((N, H))

dh0 = np.zeros((N, H))
dWx = np.zeros((D, 4 * H))
dWh = np.zeros((H, 4 * H))
db = np.zeros(4 * H)

for step in reversed(range(T)):
dnext_h = dh[:, step, :] + dprev_h
dnext_c = dprev_c

dx[:, step, :], dprev_h, dprev_c, dWx_temp, dWh_temp, db_temp = lstm_step_backward(
dnext_h, dnext_c, cache[step])

dWx += dWx_temp
dWh += dWh_temp
db += db_temp
dh0 = dprev_h

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################

return dx, dh0, dWx, dWh, db

剩下的和RNN部分没有什么区别了,主要就是把代码的选项里面加上lstm的部分