CS231nassignment2之Dropout

Target

regularization NN
randomly setting some features to 0 during forward pass

Geoffrey E. Hinton et al, “Improving neural networks by preventing co-adaptation of feature detectors”, arXiv 2012

Dropout forward + backward

in cs231n/layers.py

IO

input
- x,input data, of any shape
- dropout_params
  - p，每个neuron是不是保留的可能性是p
  - mode：’train’的时候会进行dropout，‘test’的时候会直接return input
  - seed：用来generate random number for dropout
output
- out, 和x同样大小
- cache, tuple(dropout_params, mask). In training, mask is used to multiply the input
在实现中不推荐用vanilla的方法

NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
See http://cs231n.github.io/neural-networks-2/#reg for more details.

NOTE 2: Keep in mind that p is the probability of **keep** a neuron
output; this might be contrary to some sources, where it is referred to
as the probability of dropping a neuron output.

实现

在训练的时候在hidd层都drop了一部分，如果愿意的话也可以在input层就drop
在predict的时候不再drop了！但是需要根据drop的比例对output的数量进行scale -> 所以这样就会变得很麻烦（vanilla的方法）
- 比如比例是p，drop之后剩下了px
- 那在test的时候x的大小也应该变成px(x -> px)
inverted dropout，在训练的时候就对大小进行放缩，在test的时候不接触forward pass
1
2
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # /p!!!
back的实现更容易了，如果这个点被drop了的话对再往前的dx就没有影响，如果这个点没有被drop的话对之前的影响就是常数

code

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.

    NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.

    NOTE 2: Keep in mind that p is the probability of **keep** a neuron
    output; this might be contrary to some sources, where it is referred to
    as the probability of dropping a neuron output.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        mask = (np.random.randn(*x.shape) < p) / p
        out = x * mask
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        out = x
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        dx = dout * mask
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == 'test':
        dx = dout
    return dx

FC with DP

应该在每层的relu之后，增加dropout的部分
在之前定义的function里面加上新的dropout部分，因为倔强的想加在定义好的函数里面，所以产生了一些奇怪的延伸问题
- 如果想要可选参数，在def function里面直接定义好就行了
- 如果返回值不需要，直接在返回的时候_就好了
- 注意在fc_net里面如果dropout = 1 的话，实际上的flag是没有意义的

def affine_Normal_relu_dropout_forward(self, x, w, b, mode, gamma=None, beta=None, bn_params=None):
    Normal_cache = None
    dp_cache = None
    a, fc_cache = affine_forward(x, w, b)
    if mode == "batchnorm":
        mid, Normal_cache = batchnorm_forward(a, gamma, beta, bn_params)
    elif mode == "layernorm":
        mid, Normal_cache = layernorm_forward(a, gamma, beta, bn_params)
    else:
        mid = a

    dp, relu_cache = relu_forward(mid)
    if self.use_dropout:
        out, dp_cache = dropout_forward(dp, self.dropout_param)
    else:
        out = dp
    cache = (fc_cache, Normal_cache, relu_cache, dp_cache)

    return out, cache

def affine_Normal_relu_dropout_backward(self, dout, cache, mode):
    fc_cache, Normal_cache, relu_cache, dp_cache = cache
    dgamma = 0.0
    dbeta = 0.0
    if self.use_dropout:
        ddp = dropout_backward(dout, dp_cache)
    else:
        ddp = dout
    da = relu_backward(ddp, relu_cache)
    if mode == "batchnorm":
        dmid, dgamma, dbeta = batchnorm_backward_alt(da, Normal_cache)
    elif mode == "layernorm":
        dmid, dgamma, dbeta = layernorm_backward(da, Normal_cache)
    else:
        dmid = da
    dx, dw, db = affine_backward(dmid, fc_cache)

    return dx, dw, db, dgamma, dbeta

regularization experiment

训练一个2层的网络，500个training，一个没有dropout，另一个0.25的dp
并且可视化了最终的结果

从结果上来看感觉，如果epoch比较少的话，dropout的效果会更好

加上dropout，normalization，的fc网络全部代码

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3 * 32 * 32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        pr_num = input_dim

        # can't use enumerate beacuse I need the number more than the size of hidden_dims
        for layer in range(self.num_layers):
            layer += 1
            weights = 'W' + str(layer)
            bias = 'b' + str(layer)

            # 这时候是最后一层(the last layer)
            if layer == self.num_layers:
                self.params[weights] = np.random.randn(
                    hidden_dims[len(hidden_dims) - 1], num_classes) * weight_scale
                self.params[bias] = np.zeros(num_classes)

                # other layers
            else:
                hidd_num = hidden_dims[layer - 1]
                self.params[weights] = np.random.randn(
                    pr_num, hidd_num) * weight_scale
                self.params[bias] = np.zeros(hidd_num)
                pr_num = hidd_num

                if self.normalization in ["batchnorm", "layernorm"]:
                    self.params['gamma' + str(layer)] = np.ones(hidd_num)
                    self.params['beta' + str(layer)] = np.zeros(hidd_num)

        # print(len(self.params))
        # print(self.params)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == 'batchnorm':
            self.bn_params = [{'mode': 'train'}
                              for i in range(self.num_layers - 1)]
        if self.normalization in ["batchnorm", "layernorm"]:
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization == 'batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        cache = {}
        temp_out = X
        for i in range(self.num_layers):
            w = self.params['W' + str(i + 1)]
            b = self.params['b' + str(i + 1)]
            if i == self.num_layers - 1:
                scores, cache['cache' +
                              str(i + 1)] = affine_forward(temp_out, w, b)
            else:
                if self.normalization in ["batchnorm", "layernorm"]:
                    gamma = self.params['gamma' + str(i + 1)]
                    beta = self.params['beta' + str(i + 1)]
                    temp_out, cache['cache' + str(i + 1)] = self.affine_Normal_relu_dropout_forward(
                        temp_out, w, b, self.normalization, gamma, beta, self.bn_params[i])
                else:
                    # temp_out, cache['cache' +
                    #                 str(i + 1)] = affine_relu_forward(temp_out, w, b)
                    temp_out, cache['cache' + str(i + 1)] = self.affine_Normal_relu_dropout_forward(
                        temp_out, w, b, mode=self.normalization)

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the scale#
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss, dscores = softmax_loss(scores, y)
        reg_loss = 0.0
        pre_dx = dscores
        # dgamma = self.params['gamma']

        for i in reversed(range(self.num_layers)):
            i = i + 1
            reg_loss = np.sum(np.square(self.params['W' + str(i)]))
            loss += reg_loss * 0.5 * self.reg

            # 最后一层
            if i == self.num_layers:
                pre_dx, dw, db = affine_backward(
                    pre_dx, cache['cache' + str(i)])
            else:
                if self.normalization in ["batchnorm", "layernorm"]:
                    pre_dx, dw, db, dgamma, dbeta = self.affine_Normal_relu_dropout_backward(
                        pre_dx, cache['cache' + str(i)], self.normalization)
                    grads['gamma' + str(i)] = dgamma
                    grads['beta' + str(i)] = dbeta
                else:
                    pre_dx, dw, db, _, _ = self.affine_Normal_relu_dropout_backward(
                        pre_dx, cache['cache' + str(i)], self.normalization)

            dw += self.reg * self.params['W' + str(i)]
            db += self.reg * self.params['b' + str(i)]
            grads['W' + str(i)] = dw
            grads['b' + str(i)] = db

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

    def affine_Normal_relu_dropout_forward(self, x, w, b, mode, gamma=None, beta=None, bn_params=None):
        Normal_cache = None
        dp_cache = None
        a, fc_cache = affine_forward(x, w, b)
        if mode == "batchnorm":
            mid, Normal_cache = batchnorm_forward(a, gamma, beta, bn_params)
        elif mode == "layernorm":
            mid, Normal_cache = layernorm_forward(a, gamma, beta, bn_params)
        else:
            mid = a

        dp, relu_cache = relu_forward(mid)
        if self.use_dropout:
            out, dp_cache = dropout_forward(dp, self.dropout_param)
        else:
            out = dp
        cache = (fc_cache, Normal_cache, relu_cache, dp_cache)

        return out, cache

    def affine_Normal_relu_dropout_backward(self, dout, cache, mode):
        fc_cache, Normal_cache, relu_cache, dp_cache = cache
        dgamma = 0.0
        dbeta = 0.0
        if self.use_dropout:
            ddp = dropout_backward(dout, dp_cache)
        else:
            ddp = dout
        da = relu_backward(ddp, relu_cache)
        if mode == "batchnorm":
            dmid, dgamma, dbeta = batchnorm_backward_alt(da, Normal_cache)
        elif mode == "layernorm":
            dmid, dgamma, dbeta = layernorm_backward(da, Normal_cache)
        else:
            dmid = da
        dx, dw, db = affine_backward(dmid, fc_cache)

        return dx, dw, db, dgamma, dbeta