The family of RNN
Temporal analyses and prediction are of significant importance, and have boosted many statistical methods. Since the application of Artificial Intelligence, it has stepped into a data-driven era, especially in the area of deep learning models. A typical genre was based on the recurrent neural networks (RNNs), and was then improved as long short-term memory neural networks (LSTMs) for longer sequential data, simplified as Gated recurrent unit (GRU) and considered as Bi-LSTMs for bidirectional data flow.
The following structure is generally held for AI models, which could be a reference in building from scratch:
- Initialize parameters
- Run the optimization loop
- Forward propagation to compute the loss function
- Backward propagation to compute the gradients with respect to the loss function
- Clip the gradients to avoid exploding gradients
- Using the gradients, update your parameters with the gradient descent update rule.
- Return the learned parameters
Recurrent neural networks
Generally, RNN could be used for different 1-D data, namely time-series data, language processing etc., but the logistics are identical. The basic rules are at time t, RNN not only gets information from input $x_{t}$, but also from early time. For an unrolled RNN:
\[a_{t}=g_{1}(w_{ax}x_{t}+w_{aa}a_{t-1}+b_{a})\] \[y_{t}=g_{2}(w_{ya}a_{t}+b_{y})\]where $g_{1}$ could normaly be tanh
or Relu
function for avoiding gradients exploding and vanishing, and $g_{2}$ could be sigmoid etc.
Small tips in RNN and other family members.
- Parameters $w_{aa}$, $w_{ax}$ and $w_{ya}$ are shared at each time t
- The dimension of
a
could be given as you need, but that ofx
andy
is fixed for certain case.
Different structures
- Many-to-many: The first structure denotes actually a one-to-one relation between each $x_{t}$ and $y_{t}$. Specifically, when $T_{x}$ and $T_{y}$ are of different length, an encoder for $x_{1},…,x_{Tx}$ with no output, and a decoder for $y_{Ty},…,y_{T}$ with no input could be considered.
- Many-to-one: The second type is like input a sentence of several words for one expression, all the time of each training example $x_{1},..,,x_{t}$ is used for generating one $y$.
- one-to-one:
Actually, It could be viewed as a special
Many-to-many
structure, or an ordinary neural networks. - one-to-many: Reverse the second type.
Personal view for the structures.
-
For music or language generation, these four structures are well-defined. However, in certain cases like climatic projection, theoretically, there could only be two types, the
first
structure, opposite to thesecond
one. -
In terms of input data, in my view, it contains
lags
andsliding windows
two patterns, with aMany-to-many
andMany-to-one
in each RNN unit. This is not architectural difference. -
The difference between
sliding windows
andbatch
is that the former overlappingly generates input data, but also admitting heterogeneity between each time step, while the latter regards multiple data as a homogenous block.
Sampling in randomly generating sequence
To avoid fixed ouput for each time step, RNNs use np.random.choice()
to sample $\hat{y_{t}}$ by probabilities, which is then passed to be the input of the next time step.
Gated recurrent unit
Similar to RNNs, GRU12 also receive activated information $\hat{a_{t}}$ from the last time step $a_{t-1}$, but it would be determined whether to be passed to the next time step by $\Gamma_{update}$.
\[\hat{a_{t}}=tanh(w_{ax}x_{t}+w_{aa}a_{t-1}+b_{a})\] \[a_{t}=\Gamma_{update}*\hat{a_{t}}+(1-\Gamma_{update})*{a_{t-1}}\]where, $\Gamma_{update}=\sigma(w_{ux}x_{t}+w_{ua}a_{t-1}+b_{u})$, ranging from 0 to 1, $*$ is element-wise, and $\sigma$ is sigmoid
function.
Sometimes, $\hat{a_{t}}$ will receive $\Gamma_{r}*a_{t-1}$, which is kind of method that enhance capability of memorizing long-term data, because $\Gamma_{r}=\sigma(w_{rx}x_{t}+w_{ra}a_{t-1}+b_{r})$ also ranging between 0 and 1.
Long short-term memory neural networks
Compared to GRU, LSTMs is a little complex but more usefule model. Here, $\Gamma_{update}$ is accompanied with $\Gamma_{forget}$ and $\Gamma_{output}$.
To decide the stored information in the current time t, $\Gamma_{forget}$ takes place of $(1-\Gamma_{update})$.
\[a_{t}=\Gamma_{update}*\hat{a_{t}}+(\Gamma_{forget})*{a_{t-1}}\]And $\Gamma_{output}$ is used for $a_{t}$ before activated by softmax
function for $\hat{y_{t}}$.
Bidirectional-LSTM
The layer of Bi-LSTM could be GRU or LSTM, etc., and it only needs a mirrored model $b_{T}$ to learn information starting from the end of sequence.
For a basic bidirectional RNN unit, $t+T$ equals to the whole number of time steps of a training example.
\[y_{t}=softmax(w_{ya}a_{t}+w_{yb}b_{T}+b_{y})\]Stacked layers
All the models above is single-layer model, which would be insufficient for more complex tasks. In order to enhance the capability, stacked layers are considered.
For a certain layer l
, $a_{l,t}$ represents the information received at the lth
layer and t time step.
Attribution
The above content should be attributed to the Sequence model taugt by Andrew Wu on Coursera platform.
Detailed structures could be found in Github.