In layman's terms, let's talk about various Normalization and implement Layer Normalization with deeplearning4j

1. What is Normalization

    In a word, Normalization is to use a method to press a set of data to a normal distribution with a mean of 0 and a variance of 1. The specific method is to subtract the mean of each element of the data set and divide it by the standard deviation. The formula is as follows: (Please ignore the parameter g, the problem of g is very strange, I will say it later)

    This formula is more straightforward to say that each a is translated and scaled to get a new value. One reason for this is that translation and zooming will not change the distribution of the original data. The original largest is still the largest, and the original smallest is still the smallest.

    There are many Normalization methods in Deeplearning, such as BN, LN, IN, GN, etc. Each Normalization formula is the same, but the axis is different. BN is along the minibatch direction, and LN is along the shadow layer. The output vector dimension direction, for example, for a four-dimensional tensor [minibatch, depth, height, width], that is, along the depth direction, reduce the height and width dimensions.

Second, talk about Layer Normalization

    Paper address: https://arxiv.org/pdf/1607.06450.pdf

    Layer Normalization works wonders for time series data. Let's cut the original text of the paper. This is using Layer Normalization on RNN

   

        Briefly talk about the variable meaning of the paper, a represents the pre-output value of rnn at time t (it has not yet passed the activation function), and h represents the output of a hidden layer of rnn at time t.

        So, which dimension is Normalization here? Assume that the input tensor of RNN is [minibatch, layerSize, timesteps], so here Normalization is the dimension of layerSize. This may still be too abstract, see the image below:

    

 

    Here, Normalization is the vector of each dimension pointed to by the red arrow. For each time step of a piece of data, a mean and variance are obtained, transformed, and so on for the next time step. Then there are multiple means and variances for multiple time steps.

Third, focus on the parameters g and b

    Here, the dimensions of g and b should be the same as the dimension of h, which is the dimension of values ​​per time step in the above figure, that is, the size of the layer size. Here g and b are learned with the network parameters. At the beginning of the implementation, all values ​​of g are generally initialized to 1, and all values ​​of b are initialized to 0. As the training progresses, g and b will be modified to arbitrary values. Then Normalization will not have the effect of normal distribution. It is equivalent to multiplying the layer size dimension by a random vector. Note that this is the vector dot product, then it is equivalent to giving a random noise. problem of interpretation. It's a bit unreasonable, but it's so unbelievable that it actually works.

Fourth, deeplearning4j's automatic differentiation to achieve Layer Normalization

import java.util.Map;

import org.deeplearning4j.nn.conf.inputs.InputType;
import org.deeplearning4j.nn.conf.layers.samediff.SDLayerParams;
import org.deeplearning4j.nn.conf.layers.samediff.SameDiffLayer;
import org.nd4j.autodiff.samediff.SDVariable;
import org.nd4j.autodiff.samediff.SameDiff;
import org.nd4j.linalg.api.ndarray.INDArray;

public class LayerNormaliztion extends SameDiffLayer {

	// gain * standardize(x) + bias

	private double eps = 1e-5;

	private static String GAIN = "gain";
	private static String BIAS = "bias";

	private int nOut;
	private int timeStep;

	public LayerNormaliztion(int nOut, int timeStep) {
		this.timeStep = timeStep;
		this.nOut = nOut;
	}

	protected LayerNormaliztion() {

	}

	@Override
	public InputType getOutputType(int layerIndex, InputType inputType) {
		return InputType.recurrent(nOut);
	}

	@Override
	public void defineParameters(SDLayerParams params) {
		params.addWeightParam(GAIN, 1, nOut, 1);
		params.addWeightParam(BIAS, 1, nOut, 1);
	}

	@Override
	public SDVariable defineLayer(SameDiff sd, SDVariable layerInput, Map<String, SDVariable> paramTable,
			SDVariable mask) {
		SDVariable gain = paramTable.get(GAIN);//论文中的g
		SDVariable bias = paramTable.get(BIAS);//论文中的b
		SDVariable mean = layerInput.mean("mean", true, 1);//均值
		SDVariable variance = sd.math().square(layerInput.sub(mean)).sum(true, 1).div(layerInput.getShape()[1]);//平方差
		SDVariable standardDeviation = sd.math().sqrt("standardDeviation", variance.add(eps));//标准差,加上eps 防止分母为0
		long[] maskShape = mask.getShape();
		return gain.mul(layerInput.sub(mean).div(standardDeviation)).add(bias)
				.mul(mask.reshape(maskShape[0], 1, timeStep));//掩码掩掉多余长度

	}

	@Override
	public void initializeParameters(Map<String, INDArray> params) {
		params.get(GAIN).assign(1);
		params.get(BIAS).assign(0);
	}

	public int getNOut() {
		return nOut;
	}

	public void setNOut(int nOut) {
		this.nOut = nOut;
	}

	public int getTimeStep() {
		return timeStep;
	}

	public void setTimeStep(int timeStep) {
		this.timeStep = timeStep;
	}

}

    V. Results of the actual combat

    As far as using RNN for text classification, with LN, the convergence will be more stable, but the accuracy rate is greatly reduced.

    In the world of deeplearning, any method is proposed, just to solve the current problem, for each specific problem, it must be case by case.

 

Happiness comes from sharing.

   This blog is original by the author, please indicate the source for reprinting

 

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324164089&siteId=291194637