GRU And LSTM

Recall the structure of an RNN — its hidden layer is composed of neurons. GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) can be seen as upgraded versions of a neuron. Within the overall RNN architecture, GRU and LSTM play the same role as a neuron: they take as input the previous hidden state $h_{t-1}$ and the current input $x_t$, and produce the current hidden state $h_t$ as output.

How do the GRU and LSTM differ from a normal neuron?

In a vanilla RNN, we know that $h_t$ depends on both $h_{t-1}$ and $x_t$ by $h_t = \psi(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$. However, not all information from $h_{t-1}$ and $x_t$ is preserved in $h_t$, since the activation function may suppress or distort part of it. This is information loss: it is passive and not explicitly controlled. The model does not deliberately decide what to keep or forget.

In contrast, LSTMs and GRUs introduce gates that enable deliberate selective control. These gates determine how much information is carried over from $h_{t-1}$ and how much from $x_t$. The intuition behind information control is that not every hidden state & input is equally important, so the model should be able to selectively forget certain information.

A gate is just a vector like the hidden state. We'll see how the gates in GRU and LSTM are computed in this notes, and how can they control the info flow.

Interestingly, LSTM is invented earlier than GRU, but it’s much more complex. GRU (2 gates) is a simplified version of LSTM (3 gates) that combines some gates, but it can achieve almost the same performance as LSTM. So we recommend GRU.

Takeaways:

  1. What is the difference between a normal RNN neuron and GRU, LSTM?
  2. What gates are used in a GRU, and why are they named that way?
  3. What gates are used in a LSTM, and why are they named that way?
  4. How to implement GRU, LSTM in pytorch?

--- GRU Architecture¶

GRU is a block of recurrent neural network (RNN) that uses gates to control the flow of information. The mathematical formula of a GRU involves the following steps at each time step $t$:

  1. First, the update gate $z_t$ and reset gate $r_t$ are calculated using the current input $x_t$ and the previous hidden state $h_{t-1}$: $$z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)$$ $$r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)$$ where $\sigma$ is the sigmoid function, and $W$, $U$, and $b$ are weight matrices and bias vectors, respectively.

  2. Next, the candidate hidden state $\tilde{h}_t$ is calculated as $$\tilde{h}_t = \tanh(W_h x_t + r_t \odot (U_h h_{t-1}) + b_h)$$ where $\odot$ denotes element-wise multiplication. Finally, the new hidden state $h_t$ is computed as a linear interpolation between the previous hidden state and the candidate hidden state $$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$ This formulation allows the GRU to capture long-term dependencies more effectively than traditional RNNs.

Reset gate $r_t$ directly acts on $h_{t-1}$, influencing the extent to which the previous hidden state participates in the computation of the candidate hidden state.

  • When $r_t \approx 0$: it means resetting the memory — the current computation mainly considers $x_t$ while ignoring the past.
  • When $r_t \approx 1$: it means preserving the memory — allowing the past state $h_{t-1}$ to fully influence $\tilde{h}_t$.

Update gate $z_t$ controls how much of the past memory $h_{t-1}$ is retained in the current hidden state $h_t$, and how much of the new candidate state $\tilde{h}_t$ is introduced.

  • When $z_t \approx 1$: more updating → the model relies more on the new candidate state $\tilde{h}_t$.
  • When $z_t \approx 0$: less updating → the model tends to preserve the old state $h_{t-1}$.

💡 Example for Intuition

Suppose we are processing the sentence:

John lives in Paris.

and we use a GRU for Named Entity Recognition (NER), e.g., to determine whether "Paris" is a location.

When the model has seen "John lives in", the hidden state $h_{t-1}$ has already encoded some background information.

Now we come to "Paris":

  • For a new entity word like "Paris", we might want to ignore the previous context and rely mainly on the current word to recognize it:
    • Hence, reset gate $r_t \approx 0$, discarding the old state.
  • But we may still want to preserve some context (for example, knowing it follows "lives in"):
    • Thus, update gate $z_t \approx 1$, introducing the new candidate state and updating the overall state.

In this way, the GRU can dynamically decide whether to rely on historical information or the current input, depending on the context of the word.

--- LSTM Architecture¶

LSTM is a block of RNN which is designed to effectively learn long-term dependencies by using gates to regulate the flow of information. LSTM also introduced a memory cell $C_t$. The mathematical formula of a Long Short-Term Memory (LSTM) unit involves several steps at each time step $t$.

  1. First, the forget gate $f_t$ is computed as $$f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$$ where $\sigma$ is the sigmoid function, $x_t$ is the input at time $t$, $h_{t-1}$ is the previous hidden state, and $W_f$, $U_f$, and $b_f$ are weights and biases.

  2. The input gate $i_t$ and the candidate cell state $\tilde{C}_t$ are calculated as $$i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$$ $$\tilde{C}_t = \tanh(W_C x_t + U_C h_{t-1} + b_C)$$ The new cell state $C_t$ is then updated using the forget gate and the input gate ($\odot$ denotes element-wise multiplication) $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

  3. Finally, the output gate $o_t$ is computed as $$o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$$ and the new hidden state $h_t$ is given by $$h_t = o_t \odot \tanh(C_t)$$ These gating mechanisms allow the LSTM to regulate the flow of information and effectively capture long-term dependencies.

Input gate $i_t$ regulates how much of the new candidate memory $\tilde{C}_t$ is written into the current memory cell state $C_t$.

  • When $i_t \approx 0$: the new input is hardly written into the memory cell.
  • When $i_t \approx 1$: the new input is largely written into the memory cell.

Forget gate regulates how much of the previous memory cell state $C_{t-1}$ is retained at the current memory cell state $C_t$.

  • When $f_t \approx 0$: the past memory is almost completely forgotten.
  • When $f_t \approx 1$: the past memory is almost fully preserved.

Output gate regulates how the long-term memory $C_t$ is written into the current hidden state $h_t$.

  • When $o_t \approx 0$: almost nothing is output (the memory is “hidden”).
  • When $o_t \approx 1$: the current memory is fully output.

💡 Don't Be Serious About GRU and LSTM Gates

Obviously, $h_t$ only depends on $h_{t-1}$ and $x_t$. So why don't we define the hidden state as just a weighted sum of the previous state and the input, e.g. $h_t = \alpha h_{t-1} + \beta x_t$? Why do we need to stack so many layers on top of this? For each gate, it's still just manipulating $h_{t-1}$ and $x_t$. One gate already acts on $h_{t-1}$, then another gate goes and acts on $h_{t-1}$ again 😅.

In fact, $h_t = \alpha h_{t-1} + \beta x_t$? is exactly what a vanilla RNN does — but it quickly runs into the vanishing/exploding gradient problem. During long-term backpropagation, the gradient either vanishes or explodes exponentially. The reason is that the gradient is repeatedly multiplied along the time chain by Jacobians (which contain nonlinearities and weight matrices), causing the values to either shrink toward zero or grow without bound.

LSTMs and GRUs are never intended to be a "clean" mathematical design, but rather a historically temporary — yet effective — hack. These gates open and close controlled paths for information flow, ensuring that gradients can bypass nonlinear squashing and propagate over long sequences. The key idea to remember is simple: they are designed to stabilize gradients as they flow through time, but the meaning of the gates is something imposed by humans — giving them the feel of being held together with duct tape.

Before Transformers, LSTMs and GRUs ruled NLP; after Transformers, they’re mostly retired.

--- GRU & LSTM Implementation¶

In fact, GRUs and LSTMs are used in the same way as the RNN module. So we don’t need to write a separate network for them — we can simply replace the RNN module in the code of our previous RNN notes with a GRU or LSTM, and it will work right away.

Here is a summary of the I/O of the vanilla RNN, GRU, LSTM.

torch.nn.LSTM

output, (h_n, c_n) = lstm(input, (h_0, c_0))
  • output: shape (batch, seq_len, hidden_size)
    → All hidden states from the last layer, for all time steps.

  • h_n: shape (num_layers, batch, hidden_size)
    → Final hidden state (i.e., ( h_t )) from each layer at the last time step.

  • C_n: shape (num_layers, batch, hidden_size)
    → Final memory cell state (i.e., ( C_t )) from each layer at the last time step.

torch.nn.GRU and torch.nn.RNN

output, h_n = gru(input, h_0)
  • output: shape (batch, seq_len, hidden_size)
    → All hidden states from the last layer, for all time steps.

  • h_n: shape (num_layers, batch, hidden_size)
    → Final hidden state from each layer at the last time step (no cell state for GRU).