Artificial intelligence – depth learning – RNN’s defects and LSTM solution

1.1 RNN forward process

1.2 RNN reverse visual gradient process

BPTT (Back-Propagation Through Time) Algorithm is a commonly used method of training RNN, in fact, is also a BP algorithm, only RNN processing time series data, so based on time reverse propagation with time.

The central idea of ??BPTT is the same as the BP algorithm, and it is constantly looking for better points until the negative direction of the negative gradient direction that needs to be optimized. In summary, the BPTT algorithm is essentially the BP algorithm. The BP algorithm is essentially a gradient drop method, so the gradient of each parameter is the core of this algorithm.

It can be observed that at a moment of deflection on W or U, it is necessary to trace the information before this moment. This is just a minimum of a time, and it is also accumulated above, then the entire loss Functions will be very cumbersome of the biasing number of W and u. Although so but good, it is still tracking, we can write L of the W and u deflection numbers on the T time according to the above two oysters:

The overall bias formula is to add it again at all times.

As mentioned earlier, the activation function is nesting, if we put the activation function, take out the part of the middle:

Another expression method:

We will find that the multiplication will lead to the multiplication of the activation of the function derivative, which will cause the "gradient disappearance" and "gradient explosion".

1.3 gradient explosion (further a little more, N days, you will take off)

As long as the gradients in each stage are greater than 1, they will pass through N-wheels (such as n=365), and the last gradient will be very large.

The method of the gradient explosion should be simple:

As long as the gradient iteration is performed, the iterative gradient is checked. If it is greater than a value, the iteration is now in a certain value.

Note: The limitation here is: the value of the gradient during each iteration, not the value of the parameter W.

1.4 gradient disperse / disappearance (a little bit every day, N days, you will completely end)

As long as the gradients per level are less than 1, the N-wheels (such as n=365) are passed, and the last gradient will be 0.

1.5 RNN network gradient disappearance

This is a function diagram and derivative diagram of the Sigmoid function.

This is a function diagram and derivative diagram of the TANH function.

The two of them are similar, and the output is compressed within a range. Their derivative images are also very similar, we can observe that the derivative range of the Sigmoid function is (0,0.25], the derivative range of TACH functions is (0, 1], and their derivative is not more than 1.

This will result in a problem. In the process of multiplying the SIGMOID function as the activation function, then it is inevitable to make a multiplication, the less multiply. As time sequences continue to deepen, the decimal amount will lead to a gradient that is getting smaller and smaller until 0, which is the "gradient disappearance" phenomenon.

In fact, the RNN’s time series is very similar to the deep neural network. It uses the Sigmoid function in a relatively deep neural network to do activation functions, which will cause the gradient to disappear in reverse transmission. The gradient disappears means disappearing the parameters of that layer is no longer updated. Then the hidden layer turns a simple mapping layer, it is meaningless, so in the deep neural network, sometimes the number of neurons may be better than the increase in depth.

You may make an objection, RNN is different from deep neural networks, and RNN’s parameters are shared, and the gradient of a certain moment is accumulated at this time and before, even if it does not pass the deepest, the shallow layer is also gradient. This is of course pair, but if we update the more layers of shared parameters according to the gradient of the finite layer, there must be a problem, because the limited information is used as the optimization, and all information will not be found.

I have said that we have used Tanh functions as an activation function. The tanh function is also the biggest, and it is impossible to get all the values. It is equivalent to a pile of decimal in summary, or there will be "gradient disappearance. ", Why should I use it to do activation? The reason is that the Tanh function has a large gradient relative to the Sigmoid function, and the convergence speed is faster and the gradient disappears is slower.

One reason is that the Sigmoid function has a shortcoming, and the Sigmoid function output is not zero center symmetry. The output of Sigmoid is greater than 0, which makes the output are not 0 mean, referred to as offset, which will result in a signal that the neuron of the last layer will be output as an input of the non-0 mean signal. About the original symmetrical input and central symmetry output, the network will be better.

The characteristics of RNN were originally able to "chase trace" using historical data, and now tell me that the historical data available is limited, which is very uncomfortable, solving "gradient disappearance" is very necessary.

1.6 Solving the "gradient disappearance" method mainly:

(1) Select a better activation function

Generally, the RELU function is used as an activation function, and the image of the RELU function is:

The left orientation of the RELU function is 0, the right derivative is 1, which avoids the occurrence of "gradient disappearance".

However, the derivative of constant 1 is prone to "gradient explosion", but the appropriate threshold can solve this problem. Another point is that if the derivative of the left side of the left is likely to cause neurons to die, but the appropriate step size (learning brigade) can also effectively avoid the occurrence of this problem.

(2) Change the propagation structure

The LSTM network is the case of alleviating the gradient disappearance by changing the network structure.

1.7 RNN network function defect

(1) Unable to support long sequences

The RNN disappears due to the gradient, resulting in a time series of the RNN network, and therefore, the RNN network can only have short-term memories and cannot support the memory of the normal sequence.

This means that the RNN network may only process short sentences, unable to handle long text sequences composed of long text.

Although the RNN network structure seems to support long text.

This requires a new network structure to support long-term sequence input.

The so-called "long sequence" cannot be remembered, and only the context of "short", refers to the disappearance of the "long sequence" gradient.

"Braised ribs" appears in the beginning of the text, when entering the last string is, the RNN network, may have forgotten the most important word "red burn ribs" of this sequence, ….. The practice is similar to the hot chicken, finally It is possible to predict "hot chicken".

(2) The effect of the different words of the RNN network on the current output depends entirely on "time"

RNN network, at time is a series relationship, the more far from the remaining hidden layer, the smaller the output of the current hidden layer. RNN cannot affect the current output according to the importance of different words itself.

(3) RNN network is equal to all sequence inputs

RNN has the ability to extract all inputs with equivalent feature, and all words in a sentence will extract their features and as the input of the next feature. That is to say, the RRN network, remembers all information, is not distinguishing which is useful information, what is useless information, which is auxiliary information.

However, in the actual language, although different words are characterized, the role of the target is different, some words, in fact, there is no meaning, which is a modified word. The characteristics of some words, the role of decisive pairs, these words require memory within the scope of articles, that is, the importance of different words, not only depends on the physical distance between words, but also on The meaning of the word itself.

This requires a network that can discard and memorize the importance of the term.

Drop invalid information, remember that effective information is:

  • Enter the sequence can be growing, and the gradient disappearance is not easy.
  • Make effective information even if the current input distance is far from the current input distance, it can also produce a large influence.
  • Realize long-term memory

The network that meets the above functions is an upgrade version of the RNN network, the LSTM network.

2.1 LSTM Overview

LSTM (Long Short-Term Memory).

A long short-term memory network is a variant of RNN. The LSTM network combines short-term memory with long-term memory through exquisite door control, and solves the problem of gradient disappearance to some extent.

LSTM was raised by Hochreiter & SchmidHuber in 1997. Later, after being improved and promoted by Alex Graves, LSTM has achieved considerable success in many problems, so it has been wide in the scene entered by the sequence input. use.

LSTM achieves long-term memory functionality by deliberately designed basic components cell Cells.

2.2 RNN unit structure

In this configuration, there is only one memory state HT-1 that produces a new memory state HT together with the current input information.

The so-called memory, here is the value of HT-1, HT. The weight matrix of H correspondence is the extraction of memory.

The current input is XT. The weight matrix corresponding to the XT is instantaneous memory.

it’s here,

  • HT-1 is the memory of all historical inputs, that is, no discardment, no selective memory.
  • The farther history input is far from the current time, the smaller the current output HT, that is, long-term memory is forgotten.

HT=TANH (XT * WX + HT-1 * WH) is a history of history + instantaneous memory produces new memories.

2.3 Basic Composition Unit of LSTM Network: Cell

Remember that long-term information is the ability of the LSTM network natural, default, rather than the ability to pay large computing or memory costs. This is benefited from the LSTM unique Cell structure, that is, the basic composition unit of the LSTM network.

LSTM Cell is a memory unit with selective memory function.

(1) Trend of information flow:

  • This wheel timing information input: XT
  • Input merge: XT first merges with the current short memory HT-1, obtained (X, HT-1)
  • Forgotten door SIGMOD function switch: Through the WF matrix and (x, HT-1), determine if the output of this wheel needs to be used to remember, and how to use how to use how much long memory CT-1, the output of Sigmod is [0, 1]
  • Generate new information: The new information of this wheel is generated through the W matrix and (X, HT-1) and the TANH function.
  • Enter the door Sigmod switch: By Wi Matrix and (X, HT-1), it is indeed the new information generated ~ CT, superimposed to long memory CT-1, and how to use this new information, The output of Sigmod is between [0, 1].
  • Generate a new long memory: The long memory CT-1 at the last minute is superimposed with the new information ~ CT of this extracted to obtain CT=CT-1 + ~ CT.
  • Production of the output of this wheel: generates the output information of this wheel through the WO matrix and CT and Tanh functions, HT-1
  • Output door SIGMOD switch: By the WQ matrix and (x, HT-1), it is indeed necessary to output the output from the wheel to HT as a short time memory.
  • Timing information output of this wheel: HT and CT
  • (2) Two states of LSTM – external interface
  • RNN has only one hidden state (memory information) passes between adjacent units.
  • LSTM has two status information (memory information) to pass between two adjacent units.
  • A long time memory CT,
  • First, the short-term memory HT (instantaneous memory) at the last minute

(3) Special internal structure

There is only one additional operation and activation function inside the RNN.

The LSTM internal structure includes:

Sigmod activation function, play a role of analog switches, or called memory intensity, 0 means full forgotten, 1 means full memory, 0-1 represents partial memory.

  • Three multiplication: Multiplication is together with the Sigmod function, playing three "gate control". Phase 0 (close, forgotten), multiplied by 1 equal to input (open, memory).
  • A addition: Complete the current instantaneous input XT, short memory HT-1, long memory CT-1 output.
  • A Tanh activation function: The activation function required for the current period is output.

(4) Three doors of LSTM

  • Forget door, forgetting the door, the first Sigmod function: It decides that under the status of the current period and the current input, that is, this iterative output, you need to use the previous long-term memory.
  • Enter the door, which is the second Sigmod function: I decided how much the output of this iteration needs to be saved to the long-term memory of the small box, that is, what new content is needed. By entering the door, understand the input information of the current network.
  • Output door, but the third Sigmod function: Decided how to output information in the long-term memory small box.

2.4 Mathematical expressions of LSTM network composition unit

(1) The combination of XT and HT-1

[HT-1, XT]=Concatenate (HT-1, XT)

[HT-1, XT] is the input data generated by subsequent control doors and new information.

(2) Mathematical formula for forgetting the door

The Forcer Door is to indicate what kind of information we want to keep, what kind of information can pass, so the activation function here is Sigmoid.

(3) Enter the door

Establish an input door with Sigmoid to determine what value we will update, then build a candidate value vector with Tanh and add it to the state.

The above figure, in addition to the mathematical expression of the output door, the mathematical expression of this round of information is included in the new information ~ CT.

~ CT passes the gate, finally with long memory, CT-1 superposition (ADD), new long memory CT, 即 CT=CT-1 * forgotten gate control + ~ ct * input door control value, That is:

(4) Output door

The above figure, in addition to the mathematical expression of the output door, the final output HT of the wheel is further included.

2.5 Mathematical Expression

(1) Two inputs

[HT-1, XT]=Concatenate (HT-1, XT)

(2) Two outputs

(3) Intermediate unit

3.1 Series structure of RNN network

All RNNs have a chain form of repeating neural network modules. In standard RNN, this repeated module has only one very simple structure, such as a TANH layer.

3.2 Serial structure of LSTM network

LSTM is also such a structure, just repeated modules have a different structure.

Unlike a single neural network layer, the overall across H is flowing over time, and the cell state C is also flowing over time, and the cell state C represents long-term memories.

Copyright Notice: This article is the original article of CSDN blogger "Silicon Base Workshop", follows the CC 4.0 BY-SA copyright agreement, please attach the original link and this statement.