pytorch learning white frame (6) - Select Model (K-fold cross-validation), underfitting, overfitting (weight decay (= L2-norm regularization), discarding process), the forward propagation and reverse propagation

The book are basically the following to say on the content of "hands-on learning deep learning" this flower book, figure also adopted

First, that is the training error (model in the training data set to show the error) and generalization error (model in any of the test data set to show a sample of the expected error )

Model selection

  Validation data set (validation data set), also known as validation set (validation set), means for selecting the model of the train set and test set a small portion of the data set other than reserved

  If enough training data, validation set aside is also a luxury. The method often used for K-fold cross validation . Principle is: the train set is divided into k sub-data sets do not overlap (SubDataset), then do the k model training and validation. In each training, with a SubDataset as a validation set, the rest of the k-1 SubDataset a train set. Finally, the k training error and validation error averaging (mean)

Underfitting : models can not get a lower training error, which can not reduce the training error

Overfitting : model training error is much smaller than the error on the test set

Solve under fitting and over-fitting method is twofold: First, select the appropriate data for the complexity of the model ( model complexity is too high, prone to over-fitting; otherwise prone underfitting ). Second, the training data set size ( Train the SET is too small, it is easy to over-fitting. Otherwise no )

torch.pow () : required power of the tensor ( POW is a power (with a mean power of) the abbreviation ), A tensor find such squared, then torch.pow (a, 2)

torch.cat ((A, B), Dim) : CAT is concatenate (spliced together) abbreviation reference blog  https://www.cnblogs.com/JeasonIsCoding/p/10162356.html   explained well, thank bloggers. I pay more a : connecting tensor A and B, and is amplified dim-dimensional, two such matrices, dim = 1, then the amplification of the column, i.e. sideways splicing   

. torch.utils.data TensorDataset (x, Y) : probably means that the integration of x and y, so as to correspond. I.e., each row corresponding to each row of x y.

. torch.utils.data DatasetLoader (the DataSet = the DataSet, batch_size = batch_size, shuffle = True, num_workers 2 =) : the DataSet (by TensorDataset integration); batch_size (batch size); shuffle (whether or upset); num_workers (the number of threads )

Attenuation weight (weight decay)

  The weight decay called L 2 norm regularization , is added based on the original loss function L on the 2 -norm penalty term.
  Formula norm || X || _ {$} = P (\ sum_. 1 = {I}} ^ {n-| X_ {I} | ^ {P}) ^ {. 1 / P $} L 2 norm of: $ || x || _ {2} = (\ sum_ {i = 1} ^ {n} | x_ {i} | ^ {2}) ^ {1/2} $

  The new loss function with L2 norm penalty term is: $ \ IOTA ({W_. 1}, {2} W_, B) + \ FRAC {\ the lambda 2N} {} || X || ^ 2 $ Torch. NORM ( INPUT, P = ) find norm

Discard method (Dropout)

  Hidden units using certain probability discarded. Discard method using the recalculated new hidden unit formula

  $ h_{i}^{'} = \frac{\xi _{i}}{1-p}h_{i} $

  Wherein $ h_ {i} $ hidden unit $ h_ {i} = \ O (x_ {1} w_ {1i} + x_ {2} w_ {2i} + x_ {3} w_ {3i} + x_ {4} w_ {4i} + b_ {i }) $, random variables $ \ xi_ {i} $ value of 0 (probability p) and 1 (with probability 1-p)

def dropout(X, drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1   #drop_prob的值必须在0-1之间,和数据库中的断言一个意思
    #这种情况下把全部元素丢弃
    if keep_prob == 0:   #keep_prob=0等价于1-p=0,这是$\xi_{i}$值为1的概率为0
        return torch.zeros_like(X)
    mask = (torch.rand(X.shape) < keep_prob).float()  #torch.rand()均匀分布,小于号<判别,若真,返回1,否则返回0
    return mask * X / keep_prob  # 重新计算新的隐藏单元的公式实现

model.train():启用BatchNormalization和Dropout

model.eval():禁用BatchNormalization和Dropout

正向传播和反向传播

  在深度学习模型训练时,正向传播和反向传播之间相互依赖。下面1和2看不懂的可先看《动手学深度学习》3.14.1和3.14.2

  1.正向传播的计算可能依赖模型参数的当前值,而这些模型参数是在反向传播梯度计算后通过优化算法迭代的。

    如正则化项$ s = ({\lambda }/{2})(\left \| W^{(1)} \right \|_{F}^{2} + \left \| W^{(2)} \right \|_{F}^{2}) $依赖模型参数$W^{(1)}$和$W^{(2)}$的当前值,而这些当前值是优化算法最近一次根据反向传播算出梯度后迭代得到的。

  2.反向传播的梯度计算可能依赖于各变量的当前值,而这些变量的当前值是通过正向传播计算得到的。

    如参数梯度$ \frac{\partial J}{\partial W^{(2))}} = (\frac{\partial J}{\partial o}h^{T} + \lambda W^{(2)}) $的计算需要依赖隐藏层变量的当前值h。这个当前值是通过从输入层到输出层的正向传播计算并存储得到的。

Guess you like

Origin www.cnblogs.com/JadenFK3326/p/12142974.html