These three concepts are backpropagation, gradient descent, and loss function.
The basic idea of training in neural networks is to first "guess" a result, called the predicted result
In neural network training, "guessing" is called initialization. This can be random or be given an initial value based on previous experience. Even "guessing" is scientific.
These three concepts are closely linked, and when you discuss one, you will inevitably want to discuss the others. As the loss function is more complicated, we will introduce it in detail in the next chapter.
Here are a few examples to illustrate these three concepts intuitively.
A and B play a game of guessing numbers. The range of numbers is
A: I'm going to guess 5
B: Too low
A: 50
B: That's a bit high
A: 30
B: Low
......
In this game:
- Purpose: B thinks of a number at random, and A needs to guess it.
- Initialization: A guesses 5;
- Forward calculation: A guesses the new number each time;
- Loss function: B compares the number guessed by A with the number in his mind and comes to the conclusion that A's guess is "higher" or "lower" than his own number;
- Backpropagation: B tells A "too high" or "too low";
- Gradient descent: A adjusts the guess value for the next round according to the meaning in B's feedback.
What is the loss function here? It's just "too low", "a bit high", which is very imprecise! This "so-called" loss function gives two pieces of information:
- Direction: bigger or smaller
- Degree: "too", "a little bit", but very vague
Suppose there is a black box, as shown in Figure 2-1.
Figure 2-1 Black box
We can only see the input and output values, but not the inside of the black box. When the input is 1, the output is 2.334, and the black box displays a message: I need the output value to be 4. Then we try to input 2, and the output becomes 5.332, which is much larger than 4. Then our first loss value is
Here, our loss function is a simple subtraction, subtracting the target value from the actual value, but tells you two pieces of information: 1) the direction, whether it is larger or smaller; 2) the difference, 0.1 or 1.1. All these give us the basis for the next guess.
- Purpose: to guess an input value so that the output of the black box is 4;
- Initialization: input 1;
- Forward calculation: the mathematical logic inside the black box;
- Loss function: at the output, subtract four from the output value;
- Backpropagation: tell the guesser the difference and the direction of the difference;
- Gradient descent: At the input, according to the difference, determine the next guess value.
Xiao Ming takes a rifle and shoots a target 100 meters away. This rifle does not have a front sight, or there is a problem with the front sight, or Xiao Ming's eyesight is not good enough to see the target, or it is foggy, or windy, or the lateral gravitational field is abnormal due to the influence of Jupiter..... . Anyway, there are all kinds of potential interferences.
After the first test shot, Xiao Ming pulls the target back and sees that his shot skewed left. So in the second test, Xiao Ming consciously shifts the gun a few millimeters to the right, shoots, and then checks the bullet hole on the target. By repeating this procedure several times, Xiao Ming will master the temper of this rifle. Figure 2-2 shows Xiao Ming's 5 test shots.
Figure 2-2 The bullet point record of shooting
In supervised learning, the difference between the neural network output and the expected output needs to be measured. This error function should reflect a quantized degree of inconsistency between the current network output and the actual results. That is to say, the larger the function value is, the less accurate the outcome predicted by the model will be.
In this example, Xiao Ming's goal is to hit the center of the target. The outermost circle is 1 point. The closer to the center of the target, the higher the score--2, 3, 4, points respectively for each successive ring, and 10 points for hitting the bullseye.
- The gap between the bullet hole and the bullseye of each test shot is called the error, which can be represented by an error function, such as the absolute value of the gap, as shown in the red line in the figure.
- Xiao Ming shot a total of five times, which means there were five iterations of training.
- After each shot, the process of pulling the target back and looking at the impact point and then adjusting the shooting angle for the next shot is called backpropagation. Note that there is a fundamental difference between pulling the target back and looking at it versus running to the front to observe the target. The latter is life-threatening because there might be other shooters at the range. A rough analogy to mathematical concepts is that running in front of a target to look at it is called forward differentiation; pulling the target back to look at it is called reverse differentiation.
- The value and direction of each adjustment angle is called gradient. For example, adjusting 1 mm to the right or 2 mm to the bottom left, as shown in the green vector line.
The figure above shows Xiao Ming only firing one bullet at a time, which means the number of samples per training is one. In actual neural network training, we usually need multiple samples to do batch training in order to avoid the error caused by the sampling of a single sample itself. In this example, multiple samples can be described as a burst of shots, assuming that three bullets can be fired in succession at a time, each with a similar degree of dispersion, as shown in Figure 2-3.
Figure 2-3 Record of bullet bursts
- If three bullets are fired in succession, we divide the sum of the difference between the impact point of three bullets and the bullseye by three and call it "loss", which can be represented by a loss function.
What is the difference between the result of each shot of Xiao Ming and the target? In this example, if measured by score, Xiao Ming's feedback results range from 9 points below target, to 8 points, to 2 points, to 1 point, to 0 points. This is a way to express the gap between Xiao Ming's shooting and the target with a quantitative result. That is the role of the error function. Because there is only one sample at a time, the term error function is used here. If there are multiple samples at a time, it's called a loss function.
Shooting is not that simple. If working with a long-range sniper rifle, air resistance and wind speed must also be considered. In the neural network, air resistance and wind speed can correspond to the concept of a hidden layer.
In this example:
- Purpose: to hit the bullseye;
- Initialization: take a shot at will, as long as you can hit the target, but remember the posture of the rifle at that time;
- Forward calculation: let the bullet fly for a while and hit the target; -Loss function: number of rings, deviation angle;
- Backpropagation: pull the target back to observe;
- Gradient descent: adjust the shooting angle of the rifle according to the deviation this time.
The loss function is described as follows:
- 1st ring, 45 degrees to the upper left of bullseye;
- 6th ring, 15 degrees to the upper left of bullseye;
- 7th ring, left of bullseye;
- 8th ring, 15 degrees to the lower left of bullseye;
- bullseye.
The loss function here has two pieces of information:
- Distance;
- Direction.
So, the gradient is a vector! It should tell us both the direction and the value.
The above three examples are relatively simple and easy to understand. Let's revisit the black box: the real purpose of the black box is not to guess for what input the output will be 4. Its practical significance is for us to crack its code! Therefore, we will have the following decoding process:
- Record all input and output values, as shown in Table 2-1.
Table 2-1 Sample data table
Sample ID | Input (eigenvalue) | Output (label) |
---|---|---|
1 | 1 | 2.21 |
2 | 1.1 | 2.431 |
3 | 1.2 | 2.652 |
4 | 2 | 4.42 |
- Build a neural network and give it the initial weight value. Let us first assume that the logic of this black box is:
$z = x + x^2$ ; - Input 1, and the output is 2 according to
$z=x + x^2$ , and the actual output value is 2.21, the error value is$2-2.21=-0.21$ , which is small; - Adjust the weight value, to
$z=1.5x+x^2$ , for instance. Input 1.1 and obtain 2.86 as the output. The actual output is 2.431, so the error value is$2.86-2.431=0.429$ , which is large; - Adjust the weight value, such as
$z=1.2x+x^2$ , then input 1.2... - Adjust the weight value, and then input 2...
- Traverse all the samples once and calculate the average loss function value;
- Repeat steps 3 - 6 until the loss function value is less than a specific indicator, such as
$0.001$ . We can think that the network is trained and the black box is "cracked", but it is actually only copied. The neural network cannot get the real function body in the black box, but only an approximate simulation.
From the above process, we can see that if the error value is positive, we should reduce the weight; if the error value is negative, we should increase the weight.
To briefly summarize the basic working principles of backpropagation and gradient descent:
- Initialization;
- Forward calculation
- The loss function provides us with a method to calculate the loss;
- Gradient descent is based on the loss function and approaches the point with the smallest loss, thus guiding the direction of network weight adjustment.
- Backpropagation passes the loss value back to each layer of the neural network so that each layer adjusts the weight according to the loss value in the reverse direction;
- Return to step 2 until the accuracy is sufficient (for example, the loss function value is less than 0.001).