Neural Networks for Speech and Language Processing

       The above sentence explains one thing very well, that is, "strength produces miracles". When the number of neurons is large enough, the things that the machine can do are very complicated and difficult to understand. Is this true? Explain that the success of chatgpt is also due to its big size?

       A modern neural network is a network of small computational units that each accept a vector of input values ​​and produce an output value. The structure we introduce is called a feed-forward network because computation proceeds layer-by-layer from units in one layer to units in the next. The use of modern neural networks is often referred to as deep learning because modern networks are usually deep (with many layers). Neural networks share many of the same mathematical principles as logistic regression. But a neural network is a much more powerful classifier than logistic regression, and in fact, a minimal neural network (technically just one "hidden layer") can learn any function.

       Neural network classifiers differ from logistic regression in another way. In logistic regression, we apply regression classifiers to many different tasks by developing multiple rich feature templates based on domain knowledge. When using neural networks, it is more common to avoid most use of rich hand-derived features, and instead build neural networks that take raw words as input and learn inductive features as part of the process of learning to classify. Deep networks are particularly good at representation learning. Therefore, deep neural networks are suitable tools for solving large-scale problems because these problems provide enough data to automatically learn features. That is to say, in the era of Luo Ji's return, the features need to be found manually, and then different feature weights are trained. In the era of deep learning, the features need to be found by the model itself, saving the trouble of manual labor.

1. Neurons

       The building block of a neural network is a single computational unit. A unit takes a set of real numbers as input, performs some computation on them, and produces an output. At its core, a neural unit takes a weighted sum of its inputs and adds an additional term to the sum called a bias term. Given a set of inputs x1...Xn, a unit has a corresponding set of weights w1...wn and bias b, so the weighted sum z can be expressed as:

       It is often more convenient to represent this weighted sum in terms of a vector; recall that in linear algebra, a vector is essentially a list or array of numbers. Therefore, we will represent z by the weight vector w, the scalar bias b and the input vector x, and we will replace this sum with the convenient dot product:

       Finally, instead of using z, the linear function calculated for x is used as the output. We call the output of this function the activation value of unit a for the neural unit. Since we are only modeling a single unit, the activation of a node is actually the final output of the network, which we usually call y. So the value of y is defined as:

      We will discuss three popular nonlinear functions f below (sigmoid, tanh and rectified linear ReLU), but it is convenient to start with the sigmoid function:

 

 The above is the curve of the sigmoid function. It can be seen that it has very good compression performance, and the formula is further refined:

In the figure below, 3 input values ​​x1, x2 and x3, and to compute a weighted sum, multiply each value by a weight (w1, w2 and w3 respectively), add them to a bias term b, and Pass the resulting sum through the sigmoid function to get a number between 0 and 1:

 

 

In practice, sigmoid is not often used as an activation function. A very similar but almost always better function tanh, which ranges from -1 to +1.

 In addition, there are relus:

       Early in the history of neural networks, it was realized that the ability of neural networks to work, like the real neurons that inspired them, came from combining these units into larger networks. 

 2. Neural network

       As we all know, a single neuron cannot divide the XOR problem (if you don't know, you can search by yourself).

       While the XOR function cannot be computed by a single perceptron, it can be computed by a hierarchical network of units. But it is possible to use two layers of relu-based units to compute XOR. The middle layer (call it h) has two units, and the output layer (call it y) has one unit. A set of weights and biases for each ReLU that correctly computes the XOR function is shown.

       Let's see what happens when we enter x = [0 0]. If we multiply each input value by the appropriate weight sum and then add the bias b, we get the vector [0 -1], then we apply the rectified linear transformation so that the output of layer h is [0 0]. Now, we multiply the weights again, sum them, and add the bias (in this case 0), resulting in 0. The reader should see that the input [0 1] and [1 0] results in a y value of 1, and that the input [0 0] and [1 1] results in a y value of 0 by counting the remaining 3 possible input pairs.

       It is also instructive to look at the intermediate results (the output of the two hidden nodes h0 and h1). In the previous paragraph, we showed that the h vector for the input x = [0 0] is [0 0]. Note that the hidden representations of two input points x = [0 1] and x = [10 0] (both cases of XOR output = 1) are merged into a single point h = [10 0]. Merging makes it easy to linearly separate the positive and negative cases of XOR. In other words, we can think of the hidden layers of the network as a representation of the input.

 3. Feedforward neural network

       Now let's introduce a little formality to the simplest neural network, the feedforward network. A feed-forward network is a multi-layer network in which there are no recurrent connections between units; the output of units in each layer is passed to the units in the next layer, and no output is passed back to the lower layer. At the heart of a neural network is a hidden layer made up of hidden units, each of which is a neural unit whose inputs are weighted and summed and then nonlinearities are applied. In standard architectures, each layer is fully connected, which means that each unit in each layer takes as input the output of all units in the previous layer, and each pair of units from two adjacent layers There is a link between them. Therefore, each hidden unit sums over all input units.

 

Guess you like

Origin blog.csdn.net/qq_23953717/article/details/130637254