Neural Network properties Feedforward NN (FFNN): ● FFNN is a universal approximator: feed-forward network with a single hidden layer, which contains finite number of hidden neurons, can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function. ● Typical FFNNs have no inherent notion of order in time. They remember only training. Recurrent NN (RNN): ● RNNs are Turing-complete: they can compute anything that can be computed and have the capacity to simulate arbitrary procedures. ● RNNs possess a certain type of memory. They are much better suited to dealing with sequences, context modeling and time dependencies.
Another solution: Gated Recurrent Unit (GRU) GRU (Cho et al., 2014) is a bit simpler than LSTM (less weights)
Another useful thing: CTC Output Layer CTC (Connectionist Temporal Classification; Graves, Fernández, Gomez, Schmidhuber, 2006) was specifically designed for temporal classification tasks; that is, for sequence labelling problems where the alignment between the inputs and the target labels is unknown. CTC models all aspects of the sequence with a single neural network, and does not require the network to be combined with a hidden Markov model. It also does not require presegmented training data, or external post-processing to extract the label sequence from the network outputs. The CTC network predicts only the sequence of phonemes (typically as a series of spikes, separated by ‘blanks’, or null predictions), while the framewise network attempts to align them with the manual segmentation.
Example: CTC vs. Framewise classification
End of Intro So, further we will not make a distinction between RNN/GRU/LSTM, and will usually be using the word RNN for any kind of internal block. Typically most RNNs now are actually LSTMs. Significant part of the presentation is based on works of Alex Graves et al.
Some interesting generalizations of simple RNN architecture
#1 Directionality (BRNN/BLSTM)
Bidirectional RNN/LSTM There are many situations when you see the whole sequence at once (OCR, speech recognition, translation, caption generation, …). So you can scan the [1-d] sequence in both directions, forward and backward. Here comes BLSTM (Graves, Schmidhuber, 2005).
Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN
Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN
Typical result: BRNN>RNN, LSTM>RNN, BLSTM>BRNN
Example: BLSTM classifying the utterance “one oh five”
#2 Dimensionality (MDRNN/MDLSTM)
Multidimensional RNN/LSTM Standard RNNs are inherently one dimensional, and therefore poorly suited to multidimensional data (e.g. images). The basic idea of MDRNNs (Graves, Fernandez, Schmidhuber, 2007) is to replace the single recurrent connection found in standard RNNs with as many recurrent connections as there are dimensions in the data. It assumes some ordering on the multidimensional data.
MDRNN The basic idea of MDRNNs is to replace the single recurrent connection found in standard RNNs with as many recurrent connections as there are dimensions in the data.
Uni-directionality MDRNN assumes some ordering on the multidimensional data. And it’s not the only possible one.
#3 Directionality + Dimensionality (MDMDRNN?)
Multidirectional multidimensional RNN (MDMDRNN?) The previously mentioned ordering is not the only possible one. It might be OK for some tasks, but it is usually preferable for the network to have access to the surrounding context in all directions. This is particularly true for tasks where precise localisation is required, such as image segmentation. For one dimensional RNNs, the problem of multidirectional context was solved by the introduction of bidirectional recurrent neural networks (BRNNs). BRNNs contain two separate hidden layers that process the input sequence in the forward and reverse directions. BRNNs can be extended to n-dimensional data by using 2n separate hidden layers, each of which processes the sequence using the ordering defined above, but with a different choice of axes.
Multi-directionality As before, the hidden layers are connected to a single output layer, which now has access to all surrounding context
MDMDRNN example: Air Freight database (2007) A ray-traced colour image sequence that comes with a ground truth segmentation into the different textures mapped onto the 3-d models. The sequence is 455 frames (160x120 px) long and contains 155 distinct textures.
MDMDRNN example: Air Freight database (2007) Network structure: ● Multidirectional 2D LSTM. ● 4 layers (not levels! just 4 directional layers on a single level) consisted of 25 memory blocks, each containing 1 cell, 2 forget gates, 1 input gate, 1 output gate and 5 peephole weights. ● The input and output activation function of the cells was tanh, and the activation function for the gates was the logistic sigmoid. ● The input layer was size 3 (RGB) and the output layer (softmax) was size 155 (one unit for each texture). ● The network contained 43,257 trainable weights in total. ● The final pixel classification error rate, after 330 training epochs, was 7.1% on the test set.
MDMDRNN example: Air Freight database (2007)
MDMDRNN example: MNIST (2007) Additional evaluation on the warped dataset (not used in training at all)
MDMDRNN example: MNIST (2007)
#4 Hierarchical subsampling (HSRNN)
Hierarchical Subsampling Networks (HSRNN) So-called hierarchical subsampling is commonly used in fields such as computer vision where the volume of data is too great to be processed by a ‘flat’ architecture. As well as reducing computational cost, it also reduces the effective dispersal of the data, since inputs that are widely separated at the bottom of the hierarchy are transformed to features that are close together at the top. A hierarchical subsampling recurrent neural network (HSRNN, Graves and Schmidhuber, 2009) consists of an input layer, an output layer and multiple levels of recurrently connected hidden layers. The output sequence of each level in the hierarchy is used as the input sequence for the next level up. All input sequences are subsampled using subsampling windows of predetermined width. The structure is similar to that used by convolutional networks, except with recurrent, rather than feedforward, hidden layers.
HSRNN For each layer in the hierarchy, the forward pass equations are identical to those for a standard RNN, except that the sum over input units is replaced by a sum of sums over the subsampling window. A good rule of thumb is to choose the layer sizes so that each level consumes roughly half the processing time of the level below.
HSRNN Can be easily extended into multidimensional and multidirectional case. The problem is that each level of the hierarchy requires 2n hidden layers instead of one. To connect every layer at one level to every layer at the next therefore requires O(22n) weights. One way to reduce the number of weights is to separate the levels with nonlinear feedforward layers, which reduces the number of weights between the levels to O (2n)—the same as standard MDRNNs. As a rule of thumb, giving each feedforward layer between half and one times as many units as the combined hidden layers in the level below appears to work well in practice.
HSRNN example: Arabic handwriting recognition Network structure: ● The hierarchy contained three levels, multidirectional MDLSTM (so 4 hidden layers for 2D data). ● The three levels were separated by two feedforward layers with the tanh activation function. ● Subsampling windows were applied in three places: to the input sequence, to the output sequence of the first hidden level, and to the output sequence of the second hidden level.
Offline arabic handwriting recognition (2009) ● 32,492 black-and-white images of individual handwritten Tunisian town and village names, of which we used 30,000 for training, and 2,492 for validation ● Each image was supplied with a manual transcription for the individual characters, and the postcode of the corresponding town. There were 120 distinct characters in total ● The task was to identify the postcode, from a list of 937 town names and corresponding postcodes. Many of the town names had transcription variants, giving a total of 1,518 entries in the complete postcode lexicon. ● The test data (which is not published) was divided into sets ‘f’ and ‘s’. The main competition results were based on set ‘f’. Set ‘s’ contains data collected in the United Arab Emirates using the same forms; its purpose was to test the robustness of the recognisers to regional writing variations
Convolutional LSTM (CLSTM) (2015) Actually an LSTM over the last layers of CNN. “Among various models, multi-dimensional recurrent neural network, specifically multi-dimensional long-short term memory (MD-LSTM) has shown promising results and can be naturally integrated and trained ‘end-to-end’ fashion. However, when we try to learn the structure with very low level representation such as input pixel level, the dependency structure can be too noisy or spatially long-term dependency information can be vanished while training. Therefore, we propose to use 2D-LSTM layer on top of convolutional layers by taking advantage of convolution layers to extract high level representation of the image and 2D-LSTM layer to learn global spatial dependencies. We call this network as convolutional LSTM (CLSTM)”
Convolutional LSTM (CLSTM) (2015) “Our CLSTM models are constructed by replacing the last two convolution layer of CNN with two 2D LSTM layers. Since we used multidirectional 2D LSTM, there are 2^2 directional nodes for each location of feature map.”
Convolutional RNN (C-RNN) (2015) “The C-RNN is trained in an end-to-end manner from raw pixel images. CNN layers are firstly processed to generate middle level features. RNN layer is then learned to encode spatial dependencies.” “In , MDLSTM was proposed to solve the handwriting recognition problem by using RNN. Different from this work, we utilize quad-directional 1D RNN instead of their 2D RNN, our RNN is simpler and it has fewer parameters, but it can already cover the context from all directions. Moreover, our C-RNN make both use of the discriminative representation power of CNN and contextual information modeling capability of RNN, which is more powerful for solving large scale image classification problem.” Funny, it’s not an LSTM. Just simple RNN.
Convolutional RNN (C-RNN) (2015)
Convolutional RNN (C-RNN) (2015) “Our C-RNN had the same settings with Alex-net, except that it directly connects the output of the fifth convolutional layer to the sixth fully connected layer, while our C-RNN uses the RNN to connect the fifth convolutional layer and the fully connected layers”
Convolutional hierarchical RNN (C-HRNN) (2015) “In Hierarchical RNNs (HRNNs), each RNN layer focuses on modeling spatial dependencies among image regions from the same scale but different locations. While the cross RNN scale connections target on modeling scale dependencies among regions from the same location but different scales.” Finally with LSTM: “Specifically, we propose two recurrent neural network models: 1) hierarchical simple recurrent network (HSRN), which is fast and has low computational cost; and 2) hierarchical long-short term memory recurrent network (HLSTM), which performs better than HSRN with the price of more computational cost.”
Convolutional hierarchical RNN (C-HRNN) (2015)
Convolutional hierarchical RNN (C-HRNN) (2015) “Thus, inspired by , we generate “2D sequences” for images, and each element simultaneously receives spatial contextual references from its 2D neighborhood elements.”
ReNet (2015) “Our model relies on purely uni-dimensional RNNs coupled in a novel way, rather than on a multi-dimensional RNN. The basic idea behind the proposed ReNet architecture is to replace each convolutional layer (with convolution+pooling making up a layer) in the CNN with four RNNs that sweep over lower-layer features in different directions: (1) bottom to top, (2) top to bottom, (3) left to right and (4) right to left.” “The main difference between ReNet and the model of Graves and Schmidhuber  is that we use the usual sequence RNN, instead of the multidimensional RNN.“
ReNet (2015) “One important consequence of the proposed approach compared to the multidimensional RNN is that the number of RNNs at each layer scales now linearly with respect to the number of dimensions d of the input image (2d). A multidimensional RNN, on the other hand, requires the exponential number of RNNs at each layer (2d). Furthermore, the proposed variant is more easily parallelizable, as each RNN is dependent only along a horizontal or vertical sequence of patches. This architectural distinction results in our model being much more amenable to distributed computing than that of Graves and Schmidhuber ”. … But for d=2 2d == 2d “The main difference between ReNet and the model of Graves and Schmidhuber  is that we use the usual sequence RNN, instead of the multidimensional RNN.“
Example #5: “The Empire Strikes Back” PyraMiD-LSTM (2015) [Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, Juergen Schmidhuber] http://arxiv.org/abs/1506.07452
PyraMiD-LSTM (2015) “Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the entire spatio- temporal context of each pixel in a few sweeps through all pixels, especially when the RNN is a Long Short-Term Memory (LSTM). Despite these theoretical advantages, however, unlike CNNs, previous MD-LSTM variants were hard to parallelize on GPUs. Here we re-arrange the traditional cuboid order of computations in MD-LSTM in pyramidal fashion. The resulting PyraMiD-LSTM is easy to parallelize, especially for 3D data such as stacks of brain slice images.”
PyraMiD-LSTM (2015) “One of the striking differences between PyraMiD-LSTM and MD-LSTM is the shape of the scanned contexts. Each LSTM of an MD-LSTM scans rectangle- like contexts in 2D or cuboids in 3D. Each LSTM of a PyraMiD-LSTM scans triangles in 2D and pyramids in 3D. An MD-LSTM needs 8 LSTMs to scan a volume, while a PyraMiD-LSTM needs only 6, since it takes 8 cubes or 6 pyramids to fill a volume. Given dimension d, the number of LSTMs grows as 2d for an MD- LSTM (exponentially) and 2 × d for a PyraMiD-LSTM (linearly).”
PyraMiD-LSTM (2015) On the MR brain dataset, training took around three days, and testing per volume took around 2 minutes. Networks contain three PyraMiD-LSTM layers: 1. 16 hidden units + fully-connected layer with 25 hidden units; 2. 32 hidden units + fully-connected layer with 45 hidden units; 3. 64 hidden units + fully-connected output layer whose size #classes. “Previous MD-LSTM implementations, however, could not exploit the parallelism of modern GPU hardware. This has changed through our work presented here. Although our novel highly parallel PyraMiD-LSTM has already achieved state-of- the-art segmentation results in challenging benchmarks, we feel we have only scratched the surface of what will become possible with such PyraMiD- LSTM and other MD-RNNs.”
Grid LSTM (2016) “This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected between network layers as well as along the spatiotemporal dimensions of the data. The network provides a unified way of using LSTM for both deep and sequential computation.”
Grid LSTM (2016) “Deep networks suffer from exactly the same problems as recurrent networks applied to long sequences: namely that information from past computations rapidly attenuates as it progresses through the chain – the vanishing gradient problem (Hochreiter, 1991) – and that each layer cannot dynamically select or ignore its inputs. It therefore seems attractive to generalise the advantages of LSTM to deep computation.” Can be N-dimensional. N-dimensional Grid LSTM is called N-LSTM for short.
Grid LSTM (2016) One-dimensional Grid LSTM corresponds to a feed-forward network that uses LSTM cells in place of transfer functions such as tanh and ReLU. These networks are related to Highway Networks (Srivastava et al., 2015) where a gated transfer function is used to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dimension too. Grid LSTM with three or more dimensions is analogous to Multidimensional LSTM, but differs from it not just by having the cells along the depth dimension, but also by using the proposed mechanism for modulating the N-way interaction that is not prone to the instability present in Multidimesional LSTM.
Grid LSTM (2016)
Grid LSTM (2016)
Grid LSTM (2016)
Grid LSTM (2016)
Grid LSTM (2016) The difference with the Multidimensional LSTM is that we apply multiple layers of depth to the image, use three-dimensional blocks and concatenate the top output vectors before classification. The difference with the ReNet architecture is that the 3-LSTM processes the image according to the two inherent spatial dimensions; instead of stacking hidden layers as in the ReNet, the block also modulates directly what information is passed along the depth dimension.
Time for Discussion: RNN vs. CNN for Computer Vision
Resources (more recent) - ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, Yoshua Bengio http://arxiv.org/abs/1505.00393 - Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, Juergen Schmidhuber http://arxiv.org/abs/1506.07452 - Grid Long Short-Term Memory Nal Kalchbrenner, Ivo Danihelka, Alex Graves http://arxiv.org/abs/1507.01526