CS231nassignment2之Dropout

Target

  • regularization NN
  • randomly setting some features to 0 during forward pass

Geoffrey E. Hinton et al, “Improving neural networks by preventing co-adaptation of feature detectors”, arXiv 2012

Dropout forward + backward

in cs231n/layers.py

IO

  • input
    • x,input data, of any shape
    • dropout_params
      • p,每个neuron是不是保留的可能性是p
      • mode:’train’的时候会进行dropout,‘test’的时候会直接return input
      • seed:用来generate random number for dropout
  • output
    • out, 和x同样大小
    • cache, tuple(dropout_params, mask). In training, mask is used to multiply the input
  • 在实现中不推荐用vanilla的方法
1
2
3
4
5
6
NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
See http://cs231n.github.io/neural-networks-2/#reg for more details.

NOTE 2: Keep in mind that p is the probability of **keep** a neuron
output; this might be contrary to some sources, where it is referred to
as the probability of dropping a neuron output.

实现

  • 在训练的时候在hidd层都drop了一部分,如果愿意的话也可以在input层就drop
  • 在predict的时候不再drop了!但是需要根据drop的比例对output的数量进行scale -> 所以这样就会变得很麻烦(vanilla的方法)
    • 比如比例是p,drop之后剩下了px
    • 那在test的时候x的大小也应该变成px(x -> px)
  • inverted dropout,在训练的时候就对大小进行放缩,在test的时候不接触forward pass

    1
    2
    H1 = np.maximum(0, np.dot(W1, X) + b1)
    U1 = (np.random.rand(*H1.shape) < p) / p # /p!!!
  • back的实现更容易了,如果这个点被drop了的话对再往前的dx就没有影响,如果这个点没有被drop的话对之前的影响就是常数

code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def dropout_forward(x, dropout_param):
"""
Performs the forward pass for (inverted) dropout.

Inputs:
- x: Input data, of any shape
- dropout_param: A dictionary with the following keys:
- p: Dropout parameter. We keep each neuron output with probability p.
- mode: 'test' or 'train'. If the mode is train, then perform dropout;
if the mode is test, then just return the input.
- seed: Seed for the random number generator. Passing seed makes this
function deterministic, which is needed for gradient checking but not
in real networks.

Outputs:
- out: Array of the same shape as x.
- cache: tuple (dropout_param, mask). In training mode, mask is the dropout
mask that was used to multiply the input; in test mode, mask is None.

NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
See http://cs231n.github.io/neural-networks-2/#reg for more details.

NOTE 2: Keep in mind that p is the probability of **keep** a neuron
output; this might be contrary to some sources, where it is referred to
as the probability of dropping a neuron output.
"""
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
np.random.seed(dropout_param['seed'])

mask = None
out = None

if mode == 'train':
#######################################################################
# TODO: Implement training phase forward pass for inverted dropout. #
# Store the dropout mask in the mask variable. #
#######################################################################
mask = (np.random.randn(*x.shape) < p) / p
out = x * mask
#######################################################################
# END OF YOUR CODE #
#######################################################################
elif mode == 'test':
#######################################################################
# TODO: Implement the test phase forward pass for inverted dropout. #
#######################################################################
out = x
#######################################################################
# END OF YOUR CODE #
#######################################################################

cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)

return out, cache


def dropout_backward(dout, cache):
"""
Perform the backward pass for (inverted) dropout.

Inputs:
- dout: Upstream derivatives, of any shape
- cache: (dropout_param, mask) from dropout_forward.
"""
dropout_param, mask = cache
mode = dropout_param['mode']

dx = None
if mode == 'train':
#######################################################################
# TODO: Implement training phase backward pass for inverted dropout #
#######################################################################
dx = dout * mask
#######################################################################
# END OF YOUR CODE #
#######################################################################
elif mode == 'test':
dx = dout
return dx

FC with DP

  • 应该在每层的relu之后,增加dropout的部分
  • 在之前定义的function里面加上新的dropout部分,因为倔强的想加在定义好的函数里面,所以产生了一些奇怪的延伸问题
    • 如果想要可选参数,在def function里面直接定义好就行了
    • 如果返回值不需要,直接在返回的时候_就好了
    • 注意在fc_net里面如果dropout = 1 的话,实际上的flag是没有意义的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def affine_Normal_relu_dropout_forward(self, x, w, b, mode, gamma=None, beta=None, bn_params=None):
Normal_cache = None
dp_cache = None
a, fc_cache = affine_forward(x, w, b)
if mode == "batchnorm":
mid, Normal_cache = batchnorm_forward(a, gamma, beta, bn_params)
elif mode == "layernorm":
mid, Normal_cache = layernorm_forward(a, gamma, beta, bn_params)
else:
mid = a

dp, relu_cache = relu_forward(mid)
if self.use_dropout:
out, dp_cache = dropout_forward(dp, self.dropout_param)
else:
out = dp
cache = (fc_cache, Normal_cache, relu_cache, dp_cache)

return out, cache

def affine_Normal_relu_dropout_backward(self, dout, cache, mode):
fc_cache, Normal_cache, relu_cache, dp_cache = cache
dgamma = 0.0
dbeta = 0.0
if self.use_dropout:
ddp = dropout_backward(dout, dp_cache)
else:
ddp = dout
da = relu_backward(ddp, relu_cache)
if mode == "batchnorm":
dmid, dgamma, dbeta = batchnorm_backward_alt(da, Normal_cache)
elif mode == "layernorm":
dmid, dgamma, dbeta = layernorm_backward(da, Normal_cache)
else:
dmid = da
dx, dw, db = affine_backward(dmid, fc_cache)

return dx, dw, db, dgamma, dbeta

regularization experiment

  • 训练一个2层的网络,500个training,一个没有dropout,另一个0.25的dp
  • 并且可视化了最终的结果

image1

从结果上来看感觉,如果epoch比较少的话,dropout的效果会更好

加上dropout,normalization,的fc网络全部代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch/layer normalization as options. For a network with L layers,
the architecture will be

{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

where batch/layer normalization and dropout are optional, and the {...} block is
repeated L - 1 times.

Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""

def __init__(self, hidden_dims, input_dim=3 * 32 * 32, num_classes=10,
dropout=1, normalization=None, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
Initialize a new FullyConnectedNet.

Inputs:
- hidden_dims: A list of integers giving the size of each hidden layer.
- input_dim: An integer giving the size of the input.
- num_classes: An integer giving the number of classes to classify.
- dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
the network should not use dropout at all.
- normalization: What type of normalization the network should use. Valid values
are "batchnorm", "layernorm", or None for no normalization (the default).
- reg: Scalar giving L2 regularization strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- dtype: A numpy datatype object; all computations will be performed using
this datatype. float32 is faster but less accurate, so you should use
float64 for numeric gradient checking.
- seed: If not None, then pass this random seed to the dropout layers. This
will make the dropout layers deteriminstic so we can gradient check the
model.
"""
self.normalization = normalization
self.use_dropout = dropout != 1
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}

############################################################################
# TODO: Initialize the parameters of the network, storing all values in #
# the self.params dictionary. Store weights and biases for the first layer #
# in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
# initialized from a normal distribution centered at 0 with standard #
# deviation equal to weight_scale. Biases should be initialized to zero. #
# #
# When using batch normalization, store scale and shift parameters for the #
# first layer in gamma1 and beta1; for the second layer use gamma2 and #
# beta2, etc. Scale parameters should be initialized to ones and shift #
# parameters should be initialized to zeros. #
############################################################################
pr_num = input_dim

# can't use enumerate beacuse I need the number more than the size of hidden_dims
for layer in range(self.num_layers):
layer += 1
weights = 'W' + str(layer)
bias = 'b' + str(layer)

# 这时候是最后一层(the last layer)
if layer == self.num_layers:
self.params[weights] = np.random.randn(
hidden_dims[len(hidden_dims) - 1], num_classes) * weight_scale
self.params[bias] = np.zeros(num_classes)

# other layers
else:
hidd_num = hidden_dims[layer - 1]
self.params[weights] = np.random.randn(
pr_num, hidd_num) * weight_scale
self.params[bias] = np.zeros(hidd_num)
pr_num = hidd_num

if self.normalization in ["batchnorm", "layernorm"]:
self.params['gamma' + str(layer)] = np.ones(hidd_num)
self.params['beta' + str(layer)] = np.zeros(hidd_num)

# print(len(self.params))
# print(self.params)
############################################################################
# END OF YOUR CODE #
############################################################################

# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed

# With batch normalization we need to keep track of running means and
# variances, so we need to pass a special bn_param object to each batch
# normalization layer. You should pass self.bn_params[0] to the forward pass
# of the first batch normalization layer, self.bn_params[1] to the forward
# pass of the second batch normalization layer, etc.
self.bn_params = []
if self.normalization == 'batchnorm':
self.bn_params = [{'mode': 'train'}
for i in range(self.num_layers - 1)]
if self.normalization in ["batchnorm", "layernorm"]:
self.bn_params = [{} for i in range(self.num_layers - 1)]

# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)

def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.

Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'

# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.normalization == 'batchnorm':
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
############################################################################
# TODO: Implement the forward pass for the fully-connected net, computing #
# the class scores for X and storing them in the scores variable. #
# #
# When using dropout, you'll need to pass self.dropout_param to each #
# dropout forward pass. #
# #
# When using batch normalization, you'll need to pass self.bn_params[0] to #
# the forward pass for the first batch normalization layer, pass #
# self.bn_params[1] to the forward pass for the second batch normalization #
# layer, etc. #
############################################################################
cache = {}
temp_out = X
for i in range(self.num_layers):
w = self.params['W' + str(i + 1)]
b = self.params['b' + str(i + 1)]
if i == self.num_layers - 1:
scores, cache['cache' +
str(i + 1)] = affine_forward(temp_out, w, b)
else:
if self.normalization in ["batchnorm", "layernorm"]:
gamma = self.params['gamma' + str(i + 1)]
beta = self.params['beta' + str(i + 1)]
temp_out, cache['cache' + str(i + 1)] = self.affine_Normal_relu_dropout_forward(
temp_out, w, b, self.normalization, gamma, beta, self.bn_params[i])
else:
# temp_out, cache['cache' +
# str(i + 1)] = affine_relu_forward(temp_out, w, b)
temp_out, cache['cache' + str(i + 1)] = self.affine_Normal_relu_dropout_forward(
temp_out, w, b, mode=self.normalization)

############################################################################
# END OF YOUR CODE #
############################################################################

# If test mode return early
if mode == 'test':
return scores

loss, grads = 0.0, {}
############################################################################
# TODO: Implement the backward pass for the fully-connected net. Store the #
# loss in the loss variable and gradients in the grads dictionary. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
# #
# When using batch/layer normalization, you don't need to regularize the scale#
# and shift parameters. #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
loss, dscores = softmax_loss(scores, y)
reg_loss = 0.0
pre_dx = dscores
# dgamma = self.params['gamma']

for i in reversed(range(self.num_layers)):
i = i + 1
reg_loss = np.sum(np.square(self.params['W' + str(i)]))
loss += reg_loss * 0.5 * self.reg

# 最后一层
if i == self.num_layers:
pre_dx, dw, db = affine_backward(
pre_dx, cache['cache' + str(i)])
else:
if self.normalization in ["batchnorm", "layernorm"]:
pre_dx, dw, db, dgamma, dbeta = self.affine_Normal_relu_dropout_backward(
pre_dx, cache['cache' + str(i)], self.normalization)
grads['gamma' + str(i)] = dgamma
grads['beta' + str(i)] = dbeta
else:
pre_dx, dw, db, _, _ = self.affine_Normal_relu_dropout_backward(
pre_dx, cache['cache' + str(i)], self.normalization)

dw += self.reg * self.params['W' + str(i)]
db += self.reg * self.params['b' + str(i)]
grads['W' + str(i)] = dw
grads['b' + str(i)] = db

############################################################################
# END OF YOUR CODE #
############################################################################

return loss, grads

def affine_Normal_relu_dropout_forward(self, x, w, b, mode, gamma=None, beta=None, bn_params=None):
Normal_cache = None
dp_cache = None
a, fc_cache = affine_forward(x, w, b)
if mode == "batchnorm":
mid, Normal_cache = batchnorm_forward(a, gamma, beta, bn_params)
elif mode == "layernorm":
mid, Normal_cache = layernorm_forward(a, gamma, beta, bn_params)
else:
mid = a

dp, relu_cache = relu_forward(mid)
if self.use_dropout:
out, dp_cache = dropout_forward(dp, self.dropout_param)
else:
out = dp
cache = (fc_cache, Normal_cache, relu_cache, dp_cache)

return out, cache

def affine_Normal_relu_dropout_backward(self, dout, cache, mode):
fc_cache, Normal_cache, relu_cache, dp_cache = cache
dgamma = 0.0
dbeta = 0.0
if self.use_dropout:
ddp = dropout_backward(dout, dp_cache)
else:
ddp = dout
da = relu_backward(ddp, relu_cache)
if mode == "batchnorm":
dmid, dgamma, dbeta = batchnorm_backward_alt(da, Normal_cache)
elif mode == "layernorm":
dmid, dgamma, dbeta = layernorm_backward(da, Normal_cache)
else:
dmid = da
dx, dw, db = affine_backward(dmid, fc_cache)

return dx, dw, db, dgamma, dbeta