The family of RNN

Posted Oct 31, 2023

By YC 4 min read

Temporal analyses and prediction are of significant importance, and have boosted many statistical methods. Since the application of Artificial Intelligence, it has stepped into a data-driven era, especially in the area of deep learning models. A typical genre was based on the recurrent neural networks (RNNs), and was then improved as long short-term memory neural networks (LSTMs) for longer sequential data, simplified as Gated recurrent unit (GRU) and considered as Bi-LSTMs for bidirectional data flow.

The following structure is generally held for AI models, which could be a reference in building from scratch:

Initialize parameters
Run the optimization loop
Forward propagation to compute the loss function
Backward propagation to compute the gradients with respect to the loss function
Clip the gradients to avoid exploding gradients
Using the gradients, update your parameters with the gradient descent update rule.
Return the learned parameters

Recurrent neural networks

Generally, RNN could be used for different 1-D data, namely time-series data, language processing etc., but the logistics are identical. The basic rules are at time t, RNN not only gets information from input $x_{t}$, but also from early time. For an unrolled RNN:

\[a_{t}=g_{1}(w_{ax}x_{t}+w_{aa}a_{t-1}+b_{a})\] \[y_{t}=g_{2}(w_{ya}a_{t}+b_{y})\]

where $g_{1}$ could normaly be tanh or Relu function for avoiding gradients exploding and vanishing, and $g_{2}$ could be sigmoid etc.

Small tips in RNN and other family members.

Parameters $w_{aa}$, $w_{ax}$ and $w_{ya}$ are shared at each time t
The dimension of a could be given as you need, but that of x and y is fixed for certain case.

Different structures

Many-to-many: The first structure denotes actually a one-to-one relation between each $x_{t}$ and $y_{t}$. Specifically, when $T_{x}$ and $T_{y}$ are of different length, an encoder for $x_{1},…,x_{Tx}$ with no output, and a decoder for $y_{Ty},…,y_{T}$ with no input could be considered.
Many-to-one: The second type is like input a sentence of several words for one expression, all the time of each training example $x_{1},..,,x_{t}$ is used for generating one $y$.
one-to-one: Actually, It could be viewed as a special Many-to-many structure, or an ordinary neural networks.
one-to-many: Reverse the second type.

Personal view for the structures.

For music or language generation, these four structures are well-defined. However, in certain cases like climatic projection, theoretically, there could only be two types, the first structure, opposite to the second one.
In terms of input data, in my view, it contains lags and sliding windows two patterns, with a Many-to-many and Many-to-one in each RNN unit. This is not architectural difference.
The difference between sliding windows and batch is that the former overlappingly generates input data, but also admitting heterogeneity between each time step, while the latter regards multiple data as a homogenous block.

Sampling in randomly generating sequence

To avoid fixed ouput for each time step, RNNs use np.random.choice() to sample $\hat{y_{t}}$ by probabilities, which is then passed to be the input of the next time step.

Gated recurrent unit

Similar to RNNs, GRU¹² also receive activated information $\hat{a_{t}}$ from the last time step $a_{t-1}$, but it would be determined whether to be passed to the next time step by $\Gamma_{update}$.

\[\hat{a_{t}}=tanh(w_{ax}x_{t}+w_{aa}a_{t-1}+b_{a})\] \[a_{t}=\Gamma_{update}*\hat{a_{t}}+(1-\Gamma_{update})*{a_{t-1}}\]

where, $\Gamma_{update}=\sigma(w_{ux}x_{t}+w_{ua}a_{t-1}+b_{u})$, ranging from 0 to 1, $*$ is element-wise, and $\sigma$ is sigmoid function.

Sometimes, $\hat{a_{t}}$ will receive $\Gamma_{r}*a_{t-1}$, which is kind of method that enhance capability of memorizing long-term data, because $\Gamma_{r}=\sigma(w_{rx}x_{t}+w_{ra}a_{t-1}+b_{r})$ also ranging between 0 and 1.

Long short-term memory neural networks

Compared to GRU, LSTMs is a little complex but more usefule model. Here, $\Gamma_{update}$ is accompanied with $\Gamma_{forget}$ and $\Gamma_{output}$.

To decide the stored information in the current time t, $\Gamma_{forget}$ takes place of $(1-\Gamma_{update})$.

\[a_{t}=\Gamma_{update}*\hat{a_{t}}+(\Gamma_{forget})*{a_{t-1}}\]

And $\Gamma_{output}$ is used for $a_{t}$ before activated by softmax function for $\hat{y_{t}}$.

\[\hat{y_{t}}=softmax(W_{ya}(\Gamma_{output}*tanh(a_{t}))+b_{y})\]

Bidirectional-LSTM

The layer of Bi-LSTM could be GRU or LSTM, etc., and it only needs a mirrored model $b_{T}$ to learn information starting from the end of sequence.

For a basic bidirectional RNN unit, $t+T$ equals to the whole number of time steps of a training example.

\[y_{t}=softmax(w_{ya}a_{t}+w_{yb}b_{T}+b_{y})\]

Stacked layers

All the models above is single-layer model, which would be insufficient for more complex tasks. In order to enhance the capability, stacked layers are considered.

For a certain layer l, $a_{l,t}$ represents the information received at the lth layer and t time step.

\[a_{l,t}=w_{aa}a_{l,t-1}+w_{la}a_{l-1,t}+b_{l,t}\]

Attribution

The above content should be attributed to the Sequence model taugt by Andrew Wu on Coursera platform.

Detailed structures could be found in Github.

Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches[J]. ↩
Chung J, Gulcehre C, Cho K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. ↩

Time-series

This post is licensed under CC BY 4.0 by the author.