CS231Nassignment3RNN

assignment3

target

In this exercise you will implement a vanilla recurrent neural networks and use them it to train a model that can generate novel captions for images.

Microsoft COCO

  • 在这次的作业里用的是Microsoft的coco dataset,已经是一个很常用的给文字配上说明文(captioning)的dataset了,有80000个训练和40000个val,每个图片包含一个五个字的注释
  • 在这个作业里已经preprocess了data,每个图片已经从VGG-16(ImageNet pretrain)layer 7提取了feature,存在了train2014_vgg16_fc7.h5val2014_vgg16_fc7.h5
  • 为了减少处理的时间和内存,feature的特征从4096降到了512
  • 真实的图片太大了,所以把图片的url存在了txt里面,这样在vis的时候可以直接下载这些图片(必须联网)
  • 直接处理string的效率太低了,所以在caption的一个encoded版本上面进行处理,这样可以把string表示成一串int。在dataset里面也有这两个之间转换的信息 -> 在转换的时候也加了更多的tokens

事先看了一下图片和对应的语句

RNN

  • 在这章要用rnn language model来进行image captioning
  • cs231n/rnn_layers.py

step forward

vanilla RNN的single timestep,用tanh来激活。输入data的大小是D,hidden layer的大小是H,minibatch的大小是N

  • 输入
    • x(N,D)
    • prev_h:前一个timestep的hidden (N,H)
    • Wx:input- to- hidden connections (D,H)
    • Wh:hidden-to-hidden connections (H,H)
    • b:bias,(H,)
  • 返回(tuple):
    • next_h:下一个hidden state,(N,H)
    • cache:back需要的数据
  • 构成:
    • RNN用的就是上一个的h,这一个的x同时乘以不同的参数,合在一起预测这一次的h
    • 对于某个时间点上的输入,还需要上一个的state h,参数W,乘在一起得到新的state
    • 这个参数的W无论在哪个步骤里面使用,一直都是一样的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def rnn_step_forward(x, prev_h, Wx, Wh, b):
"""
Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
activation function.

The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.

Inputs:
- x: Input data for this timestep, of shape (N, D).
- prev_h: Hidden state from previous timestep, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- cache: Tuple of values needed for the backward pass.
"""
next_h, cache = None, None
##############################################################################
# TODO: Implement a single forward step for the vanilla RNN. Store the next #
# hidden state and any values you need for the backward pass in the next_h #
# and cache variables respectively. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

x_1 = x.dot(Wx)
h_1 = prev_h.dot(Wh)
x_raw = x_1 + h_1 + b
next_h = np.tanh(x_raw)
cache = (x, prev_h, Wx, Wh, x_raw, next_h)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return next_h, cache

step backward

  • 输入:
    • dnext,下一个state的loss的gradient,(N,H)
    • cache
  • 输出:
    • dx:input的gradient,(N,D)
    • dprev_h:前一个hidden state的gradient,(N,H)
    • dWx:Wx的gradient,(D,H)
    • dWh:Wh的gradient,(H,H)
    • db:bias的gradient,(H,)
  • 其实这个求起来gradient更简单了,因为每一个的导数都很好求,搞对了矩阵的形状就可以了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def rnn_step_backward(dnext_h, cache):
"""
Backward pass for a single timestep of a vanilla RNN.

Inputs:
- dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
- cache: Cache object from the forward pass

Returns a tuple of:
- dx: Gradients of input data, of shape (N, D)
- dprev_h: Gradients of previous hidden state, of shape (N, H)
- dWx: Gradients of input-to-hidden weights, of shape (D, H)
- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
- db: Gradients of bias vector, of shape (H,)
"""
dx, dprev_h, dWx, dWh, db = None, None, None, None, None
##############################################################################
# TODO: Implement the backward pass for a single step of a vanilla RNN. #
# #
# HINT: For the tanh function, you can compute the local derivative in terms #
# of the output value from tanh. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

x, prev_h, Wx, Wh, x_raw, next_h = cache
# d(tanh) = 1 - tanh * tanh

# NxH
dx_raw = (1 - next_h * next_h) * dnext_h
# H,
db = np.sum(dx_raw, axis=0)
# N,D .T x N,H -> DxH
dWx = x.T.dot(dx_raw)
# N H x D,H
dx = dx_raw.dot(Wx.T)
# N,H .T x N,H
dWh = prev_h.T.dot(dx_raw)
# N,H
dprev_h = dx_raw.dot(Wh.T)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return dx, dprev_h, dWx, dWh, db

forward + backwoard

  • 刚才只是实现了每一步的forward和backward,现在要实现整个的这个过程了

forward

  • 假设输入的是一系列由T个vector组成的,每个的大小是D
  • minibatch的大小是N,hidden的大小是H,返回整个timesetps里面的hidden state

  • 输入

    • 整个timestep里面的数据x(N,T,D)
    • h0,初始化的hidden state(N,H)
    • Wx (D,H)
    • Wh (H,H)
    • b (H,)
  • 输出
    • h整个timestep里面的states(N,T,H)
    • cache
  • 实际上就是首先设置了最开始的输入h0,然后在时间循环T里面不停的调用上面已经写好的step的函数,更新prev_h,把不同的值存在cache里面
  • 注意h需要初始化!!

backward

  • 输入了dh和cache,需要输出所有东西的gradient
  • 思路主要是每一个step里面是加的关系,所以对于dWx,dWh和db来说,需要在每次遍历里面加上之前的值,相当于每次都需要加上新的东西
  • back的时候需要next的时候来求现在的,然后在下一轮把next更新成现在的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def rnn_forward(x, h0, Wx, Wh, b):
"""
Run a vanilla RNN forward on an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The RNN uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the RNN forward, we return the hidden states for all timesteps.

Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass
"""
h, cache = None, None
##############################################################################
# TODO: Implement forward pass for a vanilla RNN running on a sequence of #
# input data. You should use the rnn_step_forward function that you defined #
# above. You can use a for loop to help compute the forward pass. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N, T, D = x.shape
N, H = h0.shape

h = np.zeros((N, T, H))
prev_h = h0

cache = {}

for i in range(T):
prev_h, cache_i = rnn_step_forward(x[:, i, :], prev_h, Wx, Wh, b)
h[:, i, :] = prev_h
cache[i] = cache_i

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return h, cache


def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.

Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H).

NOTE: 'dh' contains the upstream gradients produced by the
individual loss functions at each timestep, *not* the gradients
being passed between timesteps (which you'll have to compute yourself
by calling rnn_step_backward in a loop).

Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
dx, dh0, dWx, dWh, db = None, None, None, None, None
##############################################################################
# TODO: Implement the backward pass for a vanilla RNN running an entire #
# sequence of data. You should use the rnn_step_backward function that you #
# defined above. You can use a for loop to help compute the backward pass. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N, T, H = dh.shape
N, D = cache[0][0].shape

dx = np.zeros((N, T, D))
dh0 = np.zeros((N, H))
dWx = np.zeros((D, H))
dWh = np.zeros((H, H))
db = np.zeros(H)
dprev_h = np.zeros((N, H))

for i in reversed(range(T)):
cache_i = cache[i]
dnext_h = dh[:, i, :] + dprev_h
dx[:, i, :], dprev_h, dWx_tmp, dWh_tmp, db_tmp = rnn_step_backward(
dnext_h, cache_i)
dWx += dWx_tmp
dWh += dWh_tmp
db += db_tmp

dh0 = dprev_h

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return dx, dh0, dWx, dWh, db

word embedding

在深度学习的系统里面主要是用vector来表示单词的,字典里面的每一个都会关系到一个vector,然后这些vectors会和系统的其他部分一起学习
在这部分需要把int表示的单词转化成vectors

  • 理解
    • 在一句话里面,一个单词就是一个维度,而word embedding的核心就是降维
    • 把字组成段落,然后用段落来总结出来最后的核心内容

forward

  • 一个minibatch的大小是N,长度是T,把每个单词给到一个大小是D的vector
  • input
    • x (N,T)一个N个数据,每个数据里面T个单词,T给出来的是单词的indice
    • W (V,D)给所有word的vectors
  • return
    • out:(N,T,D)给所有单词一个D的vector
    • cache

backward

  • back的时候不能back到word(因为是int),所以只需要得到embedding mat的gradient
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def word_embedding_forward(x, W):
"""
Forward pass for word embeddings. We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
word to a vector of dimension D.

Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
of x muxt be in the range 0 <= idx < V.
- W: Weight matrix of shape (V, D) giving word vectors for all words.

Returns a tuple of:
- out: Array of shape (N, T, D) giving word vectors for all input words.
- cache: Values needed for the backward pass
"""
out, cache = None, None
##############################################################################
# TODO: Implement the forward pass for word embeddings. #
# #
# HINT: This can be done in one line using NumPy's array indexing. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

out = W[x, :]
cache = x, W

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return out, cache


def word_embedding_backward(dout, cache):
"""
Backward pass for word embeddings. We cannot back-propagate into the words
since they are integers, so we only return gradient for the word embedding
matrix.

HINT: Look up the function np.add.at

Inputs:
- dout: Upstream gradients of shape (N, T, D)
- cache: Values from the forward pass

Returns:
- dW: Gradient of word embedding matrix, of shape (V, D).
"""
dW = None
##############################################################################
# TODO: Implement the backward pass for word embeddings. #
# #
# Note that words can appear more than once in a sequence. #
# HINT: Look up the function np.add.at #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return dW

Temporal Affine layer

  • 在每个timestep的时候,我们需要一个affine来把RNN的hidden vector转换成每个单词在vocabulary里面的scores(原来是根据这个来评分的,然后每次选出来一个合适的单词)
  • 因为和之前做过的一样,所以直接提供了

Temporal SOftmax loss

  • 在RNN的结构里面,每个timestep会生成一个对于vocabulary里面所有单词的score(矩阵)
  • 在每步里面都知道ground truth,所以用softmax来计算每一步的loss和gradient,然后计算一个minibatch里面所有时间的平均loss
  • 因为每个句子不一定一样长,所以在里面加上了NULL的token,让所有东西一边长,但是在计算loss的时候不希望计算这个NULL。所以还会接收一个mask来告诉这个函数哪个地方需要算哪个地方不需要算

RNN for image captioning

  • cs231n/classifiers/rnn.py里面,现在只需要考虑vanialla RNN的问题
  • implement loss里面的forward和backward

IO

  • 输入
    • image features,大小是(N,D)
    • captions:gorund truth,大小是(N,T)其中每个元素应该都在 0-V之间
  • 输出
    • loss
    • grads

TODO

  • affine trans,从图片的特征计算初始化的hidden state,输出的大小是 (N,H) -> W_proj,b_proj -> 这一步初始化的是h0,也就是最开始的状态
  • word embedding,把输入句子的int(表示在voca里面的位置)转化成vector,输出结果是(N,T,W)
  • vanilla RNN(或者后面的LSTM)来计算中间的timestep里面hidden state的改变,输出结果(N,T,H)
  • temporal affine来把每一步的结果转化成在vocabulary上面的score,(N,T,V)
  • temporal softmax把score转化成loss,注意需要忽略mask上面没有的

  • 在back的时候需要计算loss关于所有参数的gradient,存在上面的dict里面

  • 实际上直接按照之前写好的一直操作就可以了!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
def loss(self, features, captions):
"""
Compute training-time loss for the RNN. We input image features and
ground-truth captions for those images, and use an RNN (or LSTM) to compute
loss and gradients on all parameters.

Inputs:
- features: Input image features, of shape (N, D)
- captions: Ground-truth captions; an integer array of shape (N, T) where
each element is in the range 0 <= y[i, t] < V

Returns a tuple of:
- loss: Scalar loss
- grads: Dictionary of gradients parallel to self.params
"""
# Cut captions into two pieces: captions_in has everything but the last word
# and will be input to the RNN; captions_out has everything but the first
# word and this is what we will expect the RNN to generate. These are offset
# by one relative to each other because the RNN should produce word (t+1)
# after receiving word t. The first element of captions_in will be the START
# token, and the first element of captions_out will be the first word.
captions_in = captions[:, :-1]
captions_out = captions[:, 1:]

# You'll need this
mask = (captions_out != self._null)

# Weight and bias for the affine transform from image features to initial
# hidden state
W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

# Word embedding matrix
W_embed = self.params['W_embed']

# Input-to-hidden, hidden-to-hidden, and biases for the RNN
Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

# Weight and bias for the hidden-to-vocab transformation.
W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

loss, grads = 0.0, {}
############################################################################
# TODO: Implement the forward and backward passes for the CaptioningRNN. #
# In the forward pass you will need to do the following: #
# (1) Use an affine transformation to compute the initial hidden state #
# from the image features. This should produce an array of shape (N, H)#
# (2) Use a word embedding layer to transform the words in captions_in #
# from indices to vectors, giving an array of shape (N, T, W). #
# (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to #
# process the sequence of input word vectors and produce hidden state #
# vectors for all timesteps, producing an array of shape (N, T, H). #
# (4) Use a (temporal) affine transformation to compute scores over the #
# vocabulary at every timestep using the hidden states, giving an #
# array of shape (N, T, V). #
# (5) Use (temporal) softmax to compute loss using captions_out, ignoring #
# the points where the output word is <NULL> using the mask above. #
# #
# In the backward pass you will need to compute the gradient of the loss #
# with respect to all model parameters. Use the loss and grads variables #
# defined above to store loss and gradients; grads[k] should give the #
# gradients for self.params[k]. #
# #
# Note also that you are allowed to make use of functions from layers.py #
# in your implementation, if needed. #
############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# N,D x D,H -> N,H
hidden_init, hidden_init_cache = affine_forward(
features, W_proj, b_proj)
# N,T -> N,T,D
embeding, embeding_cache = word_embedding_forward(captions_in, W_embed)
# RNN -> N,T,H
if self.cell_type == 'rnn':
hidden_state, hidden_cache = rnn_forward(
embeding, hidden_init, Wx, Wh, b)

# N,T,H x H,V -> N,T,V
scores, score_cache = temporal_affine_forward(
hidden_state, W_vocab, b_vocab)

# N,T,V -> loss
loss, dloss = temporal_softmax_loss(scores, captions_out, mask)

grads = {}

# gradient in temporal affine
daffine_x, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(
dloss, score_cache)

if self.cell_type == 'rnn':
drnn, dh_init, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(
daffine_x, hidden_cache)

grads['W_embed'] = word_embedding_backward(drnn, embeding_cache)

dfeatures, grads['W_proj'], grads['b_proj'] = affine_backward(
dh_init, hidden_init_cache)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
############################################################################
# END OF YOUR CODE #
############################################################################

return loss, grads

overfit small data

  • 和之前一样,写了一个solver来计算,包括了训练model的所有需要的东西,在optim..py里面有很多不同的update的方法
  • 可以接受train或者val的data和label,可以得到训练或者val的acc。在训练之后这个model里面会保存最好的参数,让val最低
  • 在这一步里面,载入了50个coco的训练数据,然后对一个model进行训练,最后得到的loss会小于0.1

test-time sampling

  • 和分类不同,RNN训练和测试得到的结果会非常不相同
    • 训练的时候,我们把ground-truth放进RNN
    • 测试的时候,我们会sample出来每个timestep的单词的分布,然后把这些分布再喂到下一个step里面

implement

在每次step里面,我们把现在的单词embed,和前一个hidden state一起输入进RNN里面,得到下一个hidden state,然后得到vocabulary上面的score,选择最有可能的单词然后根据这个单词得到下一个单词

  • 输入:
    • features (N,D) 还没有进行projection的数据
    • max_length:最长的caption的长度
  • 输出
    • captions (N,max_length),里面放的是0-V的int,第一个应该是
  • TODO
    • 需要把features初始化,然后第一个输入的单词应该是(最开始)
    • 在之后的每一步里面
      • 用已经学习好的参数,embed上一个单词
      • RNN step,从上一个hidden和现在的embed得到下一个hidden(需要call每一步的函数而不是完整的函数)
      • 把下一个转化成score
      • 在score里面选择最有可能的单词,写出来这个单词的index,
      • 为了简单,在出现之前不用停止
  • 注意:
    • 应该用的是affine来计算score而不是temporal,因为要计算的只是现在这个范围里面的score,计算出来的大小应该是(N,V),所以应该在每行找到最合适的
    • 每一步计算出来的最大值应该记在相应step的列上
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def sample(self, features, max_length=30):
"""
Run a test-time forward pass for the model, sampling captions for input
feature vectors.

At each timestep, we embed the current word, pass it and the previous hidden
state to the RNN to get the next hidden state, use the hidden state to get
scores for all vocab words, and choose the word with the highest score as
the next word. The initial hidden state is computed by applying an affine
transform to the input image features, and the initial word is the <START>
token.

For LSTMs you will also have to keep track of the cell state; in that case
the initial cell state should be zero.

Inputs:
- features: Array of input image features of shape (N, D).
- max_length: Maximum length T of generated captions.

Returns:
- captions: Array of shape (N, max_length) giving sampled captions,
where each element is an integer in the range [0, V). The first element
of captions should be the first sampled word, not the <START> token.
"""
N = features.shape[0]
captions = self._null * np.ones((N, max_length), dtype=np.int32)

# Unpack parameters
W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
W_embed = self.params['W_embed']
Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

###########################################################################
# TODO: Implement test-time sampling for the model. You will need to #
# initialize the hidden state of the RNN by applying the learned affine #
# transform to the input image features. The first word that you feed to #
# the RNN should be the <START> token; its value is stored in the #
# variable self._start. At each timestep you will need to do to: #
# (1) Embed the previous word using the learned word embeddings #
# (2) Make an RNN step using the previous hidden state and the embedded #
# current word to get the next hidden state. #
# (3) Apply the learned affine transformation to the next hidden state to #
# get scores for all words in the vocabulary #
# (4) Select the word with the highest score as the next word, writing it #
# (the word index) to the appropriate slot in the captions variable #
# #
# For simplicity, you do not need to stop generating after an <END> token #
# is sampled, but you can if you want to. #
# #
# HINT: You will not be able to use the rnn_forward or lstm_forward #
# functions; you'll need to call rnn_step_forward or lstm_step_forward in #
# a loop. #
# #
# NOTE: we are still working over minibatches in this function. Also if #
# you are using an LSTM, initialize the first cell state to zeros. #
###########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

hidden_init, _ = affine_forward(
features, W_proj, b_proj)
start_word, _ = word_embedding_forward(self._start, W_embed)

current_word = start_word
next_state = hidden_init

for step in range(max_length):
prev_state = next_state
if self.cell_type == 'rnn':
next_state, _ = rnn_step_forward(
current_word, prev_state, Wx, Wh, b)

step_scores, _ = affine_forward(
next_state, W_vocab, b_vocab)

captions[:, step] = np.argmax(step_scores, axis=1)

current_word, _ = word_embedding_forward(
captions[:, step], W_embed)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
############################################################################
# END OF YOUR CODE #
############################################################################
return captions