GRU Recurrent Neural Networks - A Smart Way to Predict Sequences in Python

Cho et al. (2014) presented the Gated Recurrent Unit (GRU), a kind of recurrent neural network (RNN), as a less complex option to Long Short-Term Memory (LSTM) networks. GRU is capable of processing sequential data, including audio, text, and time-series data, just as LSTM.

GRU's fundamental principle is to update the network's hidden state only on a chosen subset of time steps by use of gating methods. Information entering and leaving the network is managed by the gating mechanisms. The reset gate and the update gate are the two gating mechanisms of the GRU.

The update gate defines how much of the new input should be utilised to update the hidden state, while the reset gate indicates how much of the prior hidden state should be ignored. The updated hidden state serves as the basis for calculating the GRU's output.

The following formulas are used to determine a GRU's reset gate, update gate, and hidden state:

 
Reset gate: r_t = sigmoid(W_r * [h_{t-1}, x_t])
Update gate: z_t = sigmoid(W_z * [h_{t-1}, x_t])
Candidate hidden state: h_t' = tanh(W_h * [r_t * h_{t-1}, x_t])
Hidden state: h_t = (1 - z_t) * h_{t-1} + z_t * h_t'
where W_r, W_z, and W_h are learnable weight matrices, x_t is the input at time step t, h_{t-1} is the previous hidden state, and h_t is the current hidden state.   

To summarise, GRU networks are a kind of RNN that can efficiently represent sequential data because they employ gating techniques to selectively update the hidden state at each time step. Many natural language processing tasks, including speech recognition, machine translation, and language modelling, have demonstrated their effectiveness.

Many modifications were created to address the Vanishing-Exploding gradients problem that is frequently experienced during the operation of a basic Recurrent Neural Network. The Long Short Term Memory Network (LSTM) is one of the most well-known variants. The Gated Recurrent Unit Network is a less well-known but no less effective variant (GRU).

It has just three gates and doesn't keep an internal cell state as LSTM does. The hidden state of the Gated Recurrent Unit incorporates the data that is kept in the Internal Cell State of an LSTM recurrent unit. The following Gated Recurrent Unit receives this aggregate data. The following is a description of a GRU's several gates:

Update Gate(z): It establishes the amount of historical information that must be transmitted into the future. It is comparable to an LSTM recurrent unit's output gate.
Reset Gate(r): It chooses how much of the past should be forgotten. It is comparable to how the Input Gate and the Forget Gate work together in an LSTM recurrent unit.
Current Memory Gate( \overline{h}_{t} ): In a normal Gated Recurrent Unit Network talk, it is frequently ignored. It is utilised to add some non-linearity and make the input zero-mean, and it is integrated into the Reset Gate similarly to how the Input Modulation Gate is a sub-part of the Input Gate. Reducing the impact of past knowledge on current information being transferred into the future is another reason to include it as a sub-component of the reset gate.

When depicted, the basic workflow of a Gated Recurrent Unit Network and a basic Recurrent Neural Network are similar. The primary distinction between the two is how each recurrent unit functions internally, as Gated Recurrent Unit networks are made up of gates that modulate both the current input and the previous hidden state.

GRU Recurrent Neural Networks - A Smart Way to Predict Sequences in Python

Working of a Gated Recurrent Unit:

Assume two vectors as inputs: the current input and the prior concealed state.
Utilising the procedures listed below, determine the values of the three distinct gates:
1. By conducting element-wise multiplication (Hadamard Product) between the relevant vector and each gate's weight, you may get the parameterized current input and previously concealed state vectors for each gate.
2. On the parameterized vectors, apply the appropriate activation function for each gate element-by-element. The list of gates along with the activation function that has to be used for each gate is provided below.

 
Update Gate : Sigmoid Function
Reset Gate  : Sigmoid Function   

There is a slight variation in the method used to calculate the Current Memory Gate. First, the previously concealed state vector and the Reset Gate's Hadamard product are computed. After that, this vector is parameterized and added to the existing input vector that has been parameterized.
In order to determine the hidden state that is now in effect, one must first create a vector with the same dimensions as the input. This vector will be referred to as ones and represented by 1 in mathematics. First, find the update gate's and the previously concealed state vector's Hadamard Product. Subtract the update gate from ones to create a new vector, and then multiply the resultant vector's Hadamard product by the memory gate's current value. To obtain the state vector that is currently concealed, sum the two vectors at the end.

The above-stated working is stated as below:-

Take notice that element-wise multiplication is shown by the blue circles. Vector subtraction (vector addition with negative value) is shown by the negative sign in the circle, whereas vector addition is indicated by the positive sign. For each gate, the weight matrix W has distinct weights assigned to the prior hidden state and the current input vector.

A GRU network produces an output at each time step, just like a recurrent neural network, and uses gradient descent to train the network.

It should be noted that, similar to the workflow, the GRU network's training procedure is diagrammatically comparable to that of a simple recurrent neural network, with the exception of how each recurrent unit functions inside.

The main distinction between a Gated Recurrent Unit Network and a Long Short Term Memory Network's Back-Propagation Through Time Algorithm is in the differential chain construction.

Let the actual output at each time step be denoted by y_{t} and the expected output by \overline{y}_{t}. Next, each time step's mistake is provided by:

Thus, the sum of mistakes at all time steps yields the overall error.

Similarly, the value \frac{\partial E}{\partial W} can be calculated as the summation of the gradients at each time step.

Using the chain rule and using the fact that \overline{y}_{t} is a function of h_{t} and which indeed is a function of \overline{h}_{t}, the following expression arises:-

Consequently, the following provides the total error gradient: -

It should be noted that although the gradient equation's chain of \partial {h}_{t} resemble that of a simple recurrent neural network, it functions differently due to the way the derivatives of h_{t} are internally structured.

How are vanishing gradients resolved by Gated Recurrent Units?

The chain of derivatives beginning at \frac{\partial h_{t}}{\partial h_{t-1}} controls the value of the gradients. Remember the following formula for h_{t}: -

Using the above expression, the value for \frac{\partial {h}_{t}}{\partial {h}_{t-1}} is:-

Remember the following formula for \overline{h}_{t}:-

Using the above expression to calculate the value of \frac{\partial \overline{h_{t}}}{\partial h_{t-1}} :-

Since the sigmoid function serves as the activation function for both the update and reset gates, their values can only be 0 or 1.

Case 1 (z = 1):

In this case, irrespective of the value of r , the term \frac{\partial \overline{h_{t}}}{\partial h_{t-1}} is equal to z which in turn is equal to 1.

Case 2A (z=0 and r=0):

In this case, the term \frac{\partial \overline{h_{t}}}{\partial h_{t-1}} is equal to 0.

Case 2B (z=0 and r=1):

In this instance, (1-\overline{h}_{t}^{2})(W) is the value of the expression \frac{\partial \overline{h_{t}}}{\partial h_{t-1}}. The network learns to modify the weights in order to bring the term \frac{\partial \overline{h_{t}}}{\partial h_{t-1}} closer to 1. This value is determined by the trainable weight matrix.

As a result, the Back-Propagation Through Time method modifies the corresponding weights to get the chain of derivatives' value as near as 1.

Next TopicHow to create a black image and a white image using opencv python

← prev next →