このページは http://www.slideshare.net/yuhuang/deep-learning-for-image-denoising-superresolution-27435126 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byYu Huang

約3年前 (2013/10/21)にアップロードinテクノロジー

deep learning, MLP, Convolutional Network, Deep Belief Nets, Deep Boltzmann Machine, Stacked Deno...

deep learning, MLP, Convolutional Network, Deep Belief Nets, Deep Boltzmann Machine, Stacked Denoising Auto-Encoder, Image Denoising, Image Superresolution

- Deep Learning for Image

Denoising and Super-resolution

Yu Huang

Sunnyvale, California

yu.huang07@gmail.com - Outline

• Deep learning

• Image Denoising by CNN

• Why deep learning?

• Image Denoising by MLPs

• State of Art deep learning

• Image Denoising by DBMs

• Parallel Deep Learning at Google

• Image Denoising by Deep GMRF

• Sparse coding

• Image Restoration by CNN

• Dictionary learning

• Image Super-resolution

• Multiple Layer NN (MLP)

• Example-based SR

• Convolutional Neural Network

• Sparse Coding for SR

• Stacked Denoising Auto-Encoder

• Frame Alignment-based SR

• Deep Belief Nets (DBN)

• Image Super-resolution by DBMs

• Deep Boltzmann Machines (DBM) • Image Super-resolution by DBNs

• Generative model: MRF

• Image SR by Cascaded SAE

• Deep Gated MRF

• Image SR by Deep CNN

• Image Denoising

• References

• Image Denoising by BM3D

• Appendix

• Image Denoising by K-SVD - Appendix

• PCA, AP & Spectral Clustering

• NMF & pLSA

• ISOMAP

• LLE

• Laplacian Eigenmaps

• Gaussian Mixture & EM

• Hidden Markov Model (HMM)

• Discriminative model: CRF

• Product of Experts

• Back propagation

• Stochastic gradient descent

• MCMC sampling for optimization approx.

• Mean field for optimization approx.

• Contrastive divergence for RBMs

• “Wake-sleep” algorithm for DBNs

• Two-stage pre-training for DBMs

• Greedy layer-wise unsupervised pre-training - Gartner Emerging Tech Hype Cycle 2012
- Deep Learning

• Representation learning attempts to automatically learn good features or

representations;

• Deep learning algorithms attempt to learn multiple levels of representation

of increasing complexity/abstraction (intermediate and high level features);

• Become effective via unsupervised pre-training + supervised fine tuning;

• Deep networks trained with back propagation (without unsupervised pre-

training) perform worse than shallow networks.

• Deal with the curse of dimensionality (smoothing & sparsity) and over-

fitting (unsupervised, regularizer);

• Semi-supervised: structure of manifold assumption;

• labeled data is scarce and unlabeled data is abundant. - Deep Net Architectures

• Feed-Forward: multilayer neural nets, convolutional neural nets

• Feed-Back: stacked sparse coding, deconvolutional nets

• Bi-Directional: deep Boltzmann machines, stacked auto-encoders - Why Deep Learning?

• Supervised training of deep models (e.g. many-layered Nets) is too hard

(optimization problem);

• Learn prior from unlabeled data;

• Shallow models are not for learning high-level abstractions;

• Ensembles or forests do not learn features first;

• Graphical models could be deep net, but mostly not.

• Unsupervised learning could be “local-learning”;

• Resemble boosting with each layer being like a weak learner

• Learning is weak in directed graphical models with many hidden

variables;

• Sparsity and regularizer.

• Traditional unsupervised learning methods aren’t easy to learn multiple

levels of representation.

• Layer-wised unsupervised learning is the solution.

• Multi-task learning (transfer learning and self taught learning);

• Other issues: scalability & parallelism with the burden from big data. - The Mammalian Visual Cortex is Hierarchical
- State-of-Art Deep Learning R&D

• Deep Learning as the hottest topic in speech recognition

• Performance records broken with deep learning methods

• Microsoft, Google: DL-based speech recognition products

• Deep Learning is the hottest topic in Computer Vision

• The record holders on ImageNet are convolutional nets

• Deep Learning is becoming hot in NLP

• Deep Learning/Feature Learning in Applied Mathematics

• sparse coding

• non-convex optimization

• stochastic gradient algorithms

• Transfer learning: inductive transfer, storing knowledge gained while

solving one problem and applying it to a different but related problem

• Transfer the classification knowledge, adapt the model or less annotate data.

• Self taught learning: generic unlabeled data improve the performance

on a supervised learning task.

• Relax the assumption about the unlabeled data;

• Use unlabeled data to learn the best represent. (dictionary) with sparse

coding. - Convolutional Neural Network’s Progress

• Data and GPU, also networks deeper and more non-linear.

Convolutional Neural Net 2012

Convolutional Neural Net 1998

Convolutional Neural Net 1988 - Convolutional Neural Network’s Progress

• Fukushima 1980: designed network with same basic structure but

did not train by back propagation.

• LeCun from late 80s: figured out back propagation for CNN,

popularized and deployed CNN for OCR applications and others.

• Poggio from 1999: same basic structure but learning is restricted to

top layer (k-means at second stage)

• LeCun from 2006: unsupervised feature learning

• DiCarlo from 2008: large scale experiments, normalization layer

• LeCun from 2009: harsher non-linearities, normalization layer,

learning unsupervised and supervised.

• Mallat from 2011: provides a theory behind the architecture

• Hinton 2012: use bigger nets, GPUs, more data - DL Winner in Object Recognition

• Won the 2012 ImageNet LSVRC. 60 Million parameters, 832M MAC ops;

• Convolutional Nets [Krizhevsky et al., 2012] - Parallel Deep Learning at Google

• More features always improve performance unless data is scarce;

• Deep learning methods have higher capacity and have the potential

to model data better;

• However, big data needs deep learning to be scalable: lots of training

samples (>10M), classes (>10K) and input dimensions (>10K).

• Distributed Deep Nets (easy to be distributed).

Model parallelism

Model parallelism + data parallelism - Scaling Across Multiple GPUs

• Two variations: 1) Simulate the synchronous execution of SGD in one core; 2)

Approximation of SGD, not perfectly simulating but working better;

• Two parallelisms: 1) model parallelism: Across the model dimension, where

different workers train different parts of the model (amount of computation per

neuron activity is high); 2) data parallelism: Across the data dimension, where

different workers train on different data examples (amount of computation per

weight is high);

• Observ.s: data parallelism for convolutional layer and model parallelism for fully

connected layer;

• Convolutional layers cumulatively contain ~90-95% computation, ~5% of parameters;

• Fully-connected layers contain ~5-10% of the computation, ~95% of the parameters;

• Forward pass:

• Each of the K workers is given a different data batch of (let’s say) 128 examples;

• Each of the K workers computes all of the convolutional layer activities on its batch;

• To compute the fully-connected layer activities, the workers switch to model

parallelism;

• Parallelism: three schemes of parallelism. - Scaling Across Multiple GPUs

• Scheme I: each worker sends its last-stage convolutional layer activities to each

other worker; the workers then assemble a big batch of activities for 128K

examples and compute the fully-connected activities on this batch as usual;

• Scheme II: one of the workers sends its last-stage convolutional layer activities

to all other workers; the workers then compute the fully connected activities on

this batch of 128 examples and then begin to back propagate the gradients for

these 128 examples; in parallel with this computation, the next worker sends its

last-stage convolutional layer activities to all other workers; then the workers

compute the fully-connected activities on this second batch of 128 examples,

and so on;

• Scheme III: all of the workers send 128=K of their last stage convolutional layer

activities to all other workers. The workers then proceed as in scheme II;

• Backward pass is similar: the workers compute the gradients in the fully

connected layers in the usual way, then the next step depends on the schemes

in forward pass.

• Weight synchronization in the convolutional layers after backward pass;

• Variable batch size (128k in the convolutional layers and 128 in the fully-

connected layers); - Model Parallelism: Partition model across machines

Data Parallelism: Asynchronous Distributed Stochastic Gradient Descent - Sparse Coding

• Sparse coding (Olshausen & Field, 1996).

• Originally developed to explain early visual processing in the

brain (edge detection).

• Objective: Given a set of input data vectors

learn a dictionary of bases such that:

Sparse: mostly zeros

• Each data vector is represented as a sparse linear

combination of bases. - Predictive Sparse Coding

• Recall the objective function for sparse coding:

• Modify by adding a penalty for prediction error:

• Approximate the sparse code with an encoder

• PSD for hierarchical feature training

• Phase 1: train the first layer;

• Phase 2: use encoder + absolute value as 1st feature extractor

• Phase 3: train the second layer;

• Phase 4: use encoder + absolute value as 1st feature extractor

• Phase 5: train a supervised classifier on top layer;

• Phase 6: optionally train the whole network with supervised BP. - Methods of Solving Sparse Coding

• Greedy methods: projecting the residual on some atom;

• Matching pursuit, orthogonal matching pursuit;

• L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);

• The residual is updated iteratively in the direction of the atom;

• Gradient-based finding new search directions

• Projected Gradient Descent

• Coordinate Descent

• Homotopy: a set of solutions indexed by a parameter (regularization)

• LARS (Least Angle Regression)

• First order/proximal methods: Generalized gradient descent

• solving efficiently the proximal operator

• soft-thresholding for L1-norm

• Accelerated by the Nesterov optimal first-order method

• Iterative reweighting schemes

• L2-norm: Chartand and Yin (2008)

• L1-norm: Cand`es et al. (2008) - Strategy of Dictionary Selection

• What D to use?

• A fixed overcomplete set of basis: no adaptivity.

• Steerable wavelet;

• Bandlet, curvelet, contourlet;

• DCT Basis;

• Gabor function;

• ….

• Data adaptive dictionary – learn from data;

• K-SVD: a generalized K-means clustering process for Vector

Quantization (VQ).

• An iterative algorithm to effectively optimize the sparse approximation

of signals in a learned dictionary.

• Other methods of dictionary learning:

• non-negative matrix decompositions.

• sparse PCA (sparse dictionaries).

• fused-lasso regularizations (piecewise constant dictionaries)

• Extending the models: Sparsity + Self-similarity=Group Sparsity - Multi Layer Neural Network

• A neural network = running several logistic regressions at the

same time;

• Neuron=logistic regression or…

• Calculate error derivatives (gradients) to refine: back propagate

the error derivative through model (the chain rule)

• Online learning: stochastic/incremental gradient descent

• Batch learning: conjugate gradient descent - Problems in MLPs

• Multi Layer Perceptrons (MLPs), one feed-forward neural network,

were popularly used for decades.

• Gradient is progressively getting more scattered

• Below the top few layers, the correction signal is minimal

• Gets stuck in local minima

• Especially start out far from ‘good’ regions (i.e., random initialization)

• In usual settings, use only labeled data

• Almost all data is unlabeled!

• Instead the human brain can learn from unlabeled data. - Convolutional Neural Networks

• CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually

images), based on spatially localized neural input;

• local receptive fields(shifted window), shared weights (weight averaging) across

the hidden units, and often, spatial or temporal sub-sampling;

• Related to generative MRF/discriminative CRF:

• CNN=Field of Experts MRF=ML inference in CRF;

• Generate ‘patterns of patterns’ for pattern recognition.

• Each layer combines (merge, smooth) patches from previous layers

• Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.

• Convolution filters: (translation invariance) unsupervised;

• Local contrast normalization: increase sparsity, improve optimization/invariance.

C layers convolutions,

S layers pool/sample - Convolutional Neural Networks

• Convolutional Networks are trainable multistage architectures composed of multiple

stages;

• Input and output of each stage are sets of arrays called feature maps;

• At output, each feature map represents a particular feature extracted at all locations on

input;

• Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling

layer;

• A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;

• A fully connected layer: softmax transfer function for posterior distribution.

• Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature

map;

* is discrete convolution operator

• Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;

• In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

• Feature pooling: treats each feature map separately -> a reduced-resolution output feature

map;

• Supervised training is performed using a form of SGD to minimize the prediction error;

• Gradients are computed with the back-propagation method.

• Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-

tuning. - LeNet (LeNet-5)

• A layered model composed of convolution and subsampling operations

followed by a holistic representation and ultimately a classifier for

handwritten digits;

• Local receptive fields (5x5) with local connections;

• Output via a RBF function, one for each class, with 84 inputs each;

• Learning by Graph Transformer Networks (GTN); - AlexNet

• A layered model composed of convol., subsample.,

followed by a holistic representation and all-in-all a

landmark classifier;

• Consists of 5 convolutional layers, some of which

followed by max-pooling layers, 3 fully-connected layers

with a final 1000-way softmax;

• Fully-connected “FULL” layers: linear classifiers/matrix

multiplications;

• ReLU are rectified-linear nonlinearities on layer output,

can be trained several times faster;

• Local normalization scheme aids generalization;

• Overlapping pooling slightly less prone to overfitting;

• Data augmentation: artificially enlarge the dataset using

label-preserving transformations;

• Dropout: setting to zero output of each hidden neuron

with prob. 0.5;

• Trained by SGD with batch # 128, momentum 0.9, weight

decay 0.0005. - Use the two GPUs: One GPU runs the layer-parts at the top of the figure while the other

runs the layer-parts at the bottom; The GPUs communicate only at certain layers. The

network’s input is 150,528-dimensional, and the number of neurons in the network’s

remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000. - MattNet

• Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in

2013;

• Preprocessing: subtracting a per-pixel mean;

• Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out

of the image and randomly flipped horizontally to provide more views of each example;

• SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to

prevent overfitting;

• 65M parameters trained for 12 days on a single Nvidia GPU;

• Visualization by layered DeconvNets: project the feature activations back to the input

pixel space;

• Reveal input stimuli exciting individual feature maps at any layer;

• Observe evolution of features during training;

• Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes

are important;

• DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to

preserve structure;

• Multiple such models were averaged together to further boost performance;

• Supervised pre-training with AlexNet, then modify it to get better performance (error rate

14.8%). - Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3

color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature

maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized

55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216

dimensions). The final layer: a C-way softmax function, C - number of classes. - Top: A deconvnet layer (left) attached to

a convnet layer (right). The deconvnet

will reconstruct approximate version of

convnet features from the layer

beneath.

Bottom: Unpooling operation in the

deconvnet, using switches which record

the location of the local max in each

pooling region (colored zones) during

pooling in the convnet. - Oxford VGG Net: Very Deep CNN

• Networks of increasing depth using an architecture with very small (3×3)

convolution filters;

• Spatial pooling is carried out by 5 max-pooling layers;

• A stack of convolutional layers followed by three Fully-Connected (FC) layers;

• All hidden layers are equipped with the rectification ReLU non-linearity;

• No Local Response Normalisation!

• Trained by optimising the multinomial logistic regression objective using SGD;

• Regularised by weight decay and dropout regularisation for the first two fully-

connected layers;

• The learning rate was initially set to 10−2, and then decreased by a factor of

10;

• For random initialisation, sample the weights from a normal distribution;

• Derived from the publicly available C++ Caffe toolbox, allow training and

evaluation on multiple GPUs installed in a single system, and on full-size

(uncropped) images at multiple scales;

• Combine the outputs of several models by averaging their soft-max class

posteriors. - The depth of the configurations increases from the left (A) to

the right (E), as more layers are added (the added layers are

shown in bold). The convolutional layer parameters are denoted

as “conv<receptive field size> - <number of channels>”. The

ReLU activation function is not shown for brevity. - GoogleNet

• Deep convolutional neural network architecture codenamed Inception;

• Increase depth and width of network but keep computational budget constant;

• Drawbacks: Bigger size typically means a larger number of parameters, which makes the

enlarged network more prone to overfitting, and the dramatically increased use of

computational resources;

• Solution: Move from fully connected to sparsely connected architectures, and analyze the

correlation statistics of the activations of the last layer and clustering neurons with highly

correlated outputs.

• Based on the well known Hebbian principle: neurons that fire together wire together;

• GoogleNet: a framework with Inception architecture

• Finding out how an optimal local sparse structure in a convolutional vision network can

be approximated and covered by readily available dense components;

• Judiciously applying dimension reduction and projections;

• Increasing the units # at each stage significantly without blow-up in computational

complexity;

• Aligning with the intuition that visual info. is processed at various scales and then

aggregated so that the next stage can abstract features from different scales

simultaneously.

• Trained using the DistBelief: A distributed machine learning system (cloud). - Inception module (with dimension reductions)
- Problems with training deep architectures?

Network in a network in a network

9 Inception modules

Convolution

Pooling

Softmax

Other - RNN: Recurrent Neural Network

• A nonlinear dynamical system that maps sequences to sequences;

• Parameterized with three weight matrices and three bias vectors;

• RNNs are fundamentally difficult to train due to their nonlinear iterative nature;

• The derivative of the loss function can be exponentially large with respect to the

hidden activations;

• RNN suffers also from the vanishing gradient problem.

• Back Propagation Through Time (BPTT):

• “Unfold” the recurrent network in time, by stacking identical copies of the RNN, and

redirecting connections within the network to obtain connections between

subsequent copies;

• It’s hard to be used where online adaption is required as the entire time series must be

used.

• Real-Time Recurrent Learning (RTRL) is a forward-pass only algorithm that

computes the derivatives of the RNN w.r.t. its parameters at each timestep;

• Unlike BPTT, RTRL maintains the exact derivative of the loss so far at each timestep of

the forward pass, without a backward pass and the need to store the past hidden

states;

• However, the computational cost of RTRL is prohibitive and more memory than BPTT

as well.

• Speech Recognition and Handwriting recognition. - LSTM: Long Short-Term Memory

• An RNN structure that elegantly addresses the vanishing gradients problem

using “memory units”;

• These linear units have a self-connection of strength 1 and a pair of auxiliary

“gating units” that control the flow of information to and from the unit;

• Let N be the number of memory units of the LSTM. At each time step t, the

LSTM maintains a set of vectors as below, whose evolution is governed by the

following equations:

• Since the forward pass of the LSTM is relatively intricate, the equations for the

correct derivatives of the LSTM are highly complex, making them tedious to

implement;

• Note: Theano has LSTM module. - From Simple RNN to BPTT

Left: RNN with one fully connected hidden layer;

Right: LSTM with memory blocks in hidden layer. - Gated Recurrent Unit

• GRU is a variation of RNN, adaptively capturing dependencies of different time scales with

each recurrent unit;

• GRU uses gating units as well to modulate the flow of information inside the unit, but without

a memory cells.

• GRU doesn’t control degree to which its state is exposed, but exposes the whole state each

time;

• Different from LSTM:

• GRU expose its full content without control;

• GRU controls the information flow from the previous activation when computing the new,

candidate activation, but does not independently control the amount of the candidate

activation being added (the control is tied via the update gate).

•

Shared virtues with LSTM: the additive component

of their update from t to t + 1;

•

Easy for each unit to remember the

existence of a specific feature in the input

stream for a long series of steps;

•

Effectively creates shortcut paths that

bypass multiple temporal steps, which

allow the error to be back-propagated

easily without too quickly vanishing. - Generative Model: MRF

• Random Field: F={F1,F2,…FM} a family of random variables

on set S in which each Fi takes value fi in a label set L.

• Markov Random Field: F is said to be a MRF on S w.r.t. a

neighborhood N if and only if it satisfies Markov property.

• Generative model for joint probability p(x)

• allows no direct probabilistic interpretation

• define potential functions Ψ on maximal cliques A

• map joint assignment to non-negative real number

• requires normalization

• MRF is undirected graphical models - Belief Nets

• Belief net is directed acyclic graph composed of stochastic var.

• Can observe some of the variables and solve two problems:

• inference: Infer the states of the unobserved variables.

• learning: Adjust the interactions between variables to more

likely generate the observed data.

stochastic

hidden

cause

Use nets composed of layers

of stochastic variables with

weighted connections.

visible

effect - Boltzmann Machines

• Energy-based model associate a energy to each configuration of stochastic

variables of interests (for example, MRF, Nearest Neighbor);

• Learning means adjustment of the low energy function’s shape properties;

• Boltzmann machine is a stochastic recurrent model with hidden variables;

• Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);

• Restricted Boltzmann machine is a special case:

• Only one layer of hidden units;

• factorization of each layer’s neurons/units (no connections in the same layer);

• Contrastive divergence: approximation of gradient (appendix).

probability

Energy Function

Learning rule - Deep Belief Networks

• A hybrid model: can be trained as

generative or discriminative model;

• Deep architecture: multiple layers

(learn features layer by layer);

• Multi layer learning is difficult in

sigmoid belief networks.

• Top two layers are undirected

connections, RBM;

• Lower layers get top down directed

connections from layers above;

• Unsupervised or self-taught pre-

learning provides a good initialization;

• Greedy layer-wise unsupervised

training for RBM

• Supervised fine-tuning

• Generative: wake-sleep algorithm (Up-

down)

• Discriminative: back propagation

(bottom-up) - Deep Boltzmann Machine

• Learning internal representations that become increasingly complex;

• High-level representations built from a large supply of unlabeled inputs;

• Pre-training consists of learning a stack of modified RBMs, which are

composed to create a deep Boltzmann machine (undirected graph);

• Generative fine-tuning: different from DBN

• Positive and negative phase (appendix)

• Discriminative fine-tuning: the same to DBN

• Back propagation. - Deep Gated MRF

• Conditional Distribution Over Input:

• P(x∣h)=N (mean(h) ,D);

• examples: PPCA, Factor Analysis, ICA, Gaussian RBM;

• model does not represent well dependencies, only mean intensity;

• P(x∣h)=N (0,Covariance(h));

• examples: PoT (product of student’s t), covariance RBM;

• model does not represent well mean intensity, only dependencies;

• P(x∣h)=N (mean(h) , Covariance(h));

• mean cRBM, mean PoT;

• two sets of latent variables to modulate mean and covariance of the

conditional distribution over the input;

• Deep gated MRF: RBM layers + MRF with adaptive affinities (to gate

the effective interactions and to decide mean intensities);

• Learning: Gibbs sampling/HMC sampling, Fast persistent CD. - Deep Gated MRF
- Denoising Auto-Encoder

• Multilayer NNs with target output=input;

• Reconstruction=decoder(encoder(input));

• Perturbs the input x to a corrupted version;

• Randomly sets some of the coordinates of input to zeros.

• Recover x from encoded perturbed data.

• Learns a vector field towards higher probability regions;

• Pre-trained with DBN or regularizer with perturbed training data;

• Minimizes variational lower bound on a generative model;

• corresponds to regularized score matching on an RBM;

• PCA=linear manifold=linear Auto Encoder;

• Auto-encoder learns the salient variation like a nonlinear PCA. - Stacked Denoising Auto-Encoder

• Stack many (may be sparse) auto-encoders in succession and train

them using greedy layer-wise unsupervised learning

• Drop the decode layer each time

• Performs better than stacking RBMs;

• Supervised training on the last layer using final features;

• (option) Supervised training on the entire network to fine- tune

all weights of the neural net;

• Empirically not quite as accurate as DBNs. - Image Denoising

• Noise reduction: various assumptions of content internal

structures;

• Learning-based

• Field of experts (MRF), CRF, NN (MLP, CNN);

• Sparse coding: K-SVD, LSSC,….

• Self-similarity

• Gaussian, Median;

• Bilateral filter, anisotropic diffusion;

• Non-local means.

• Sparsity prior

• Wavelet shrinkage;

• Use of both Redundancy and Sparsity

• BM3D (block matching 3-d filter)-benchmark;

• Can ‘Deep Learning’ compete with BM3D? - Block Matching 3-D for Denoising

• For each patch, find similar patches;

• Group the similar patches into a 3-d stack;

• Perform a 3-D transform (2-d + 1-d) and coefficient

thresholding (sparsity);

• Apply inverse 3-D transform (1-d + 2-d);

• Also combine multiple patches in a collaborative way

(aggregation);

• Two stages: hard -> wiener (soft). - BM3D Outline
- Apply Sparse Coding for Denoising

• A cost function for : Y = Z + n

• Solve for:

Prior term

• Break problem into smaller problems

Global

Proximity of

Sparsity of the

proximity

selected patch

representations

• Aim at minimization at the patch level. - Image Data in K-SVD Denoising

• Extract overlapping patches from a single image;

• clean or corrupted, even reference (multiple frames)?

• for example, 100k of size 8x8 block patches;

• Applied the K-SVD, training a dictionary;

• Size of 64x256 (n=64, dictionary size k).

• Lagrange multiplier namda = 30/sigma of noise;

• The coefficients from OMP;

• the maximal iteration is 180 and noise gain C=1.15;

• the number of nonzero elements L=6 (sigma=5).

• Denoising by normalized weighted averaging: - Image Denoising by Conv. Nets

• Image denoising is a learning problem to training Conv. Net;

• Parameter estimation to minimize the reconstruction error.

• Online learning (rather than batch learning): stochastic gradient

• Gradient update from 6x6 patches sampled from 6 different training images

• Run like greedy layer-wise training for each layer. - Image Denoising by MLP

• Denoising as learning: map noisy patches to noise-free ones;

• Patch size 17x17;

• Training with different noise types and levels:

• Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact;

• Feed-forward NN: MLP;

• input layer 289-d, four hidden layers (2047-d), output layer 289-d.

• input layer 169-d, four hidden layers (511-d), output layer 169-d.

• 40 million training images from LabelMe and Berkeley segmentation!

• 1000 testing images: Mcgill, Pascal VOC 2007;

• GPU: slower than BM3D, much faster than KSVD.

• Deep learning can help: unsupervised learning from unlabelled data. - Image Denoising with Deep Nets

• Combine sparse coding and deep network pre-trained by DAE;

• Reconstruct clean image from noisy image by training DAE;

• image denoising by choosing appropriate η in different situations.

• Deep network: stacked sparse DAE (denoising auto-encoder).

Hidden layer

• Pre-training

KL divergence with sparsity

• Fine-tuning by back propagation

• Patch-based. - Image Denoising by DBMs

• Combine Botlzmann machine and Denoising Auto-Encoder;

• 100, 000 image patches of sizes 4×4, 8×8 and 16×16 from CIFAR-10

dataset to get 50, 000 training samples;

• Three sets of testing images from USC, textures, aerials and

miscellaneous;

• Gaussian BMs+DAEs: one, two and four hidden layers;

• Deep Network training:

• A two-stage pre-training and PCD training for Gaussian DBMs;

• Stochastic BP for DAE training;

• Noise: Gaussian, salt-and-pepper;

• Patch-based as well;

• Comparison: when noise is heavy, DBM beats DAE; otherwise, vice versa. - Image Denoising by Deep Gated MRF

• Works as solving the following optimization problem

where F(x;θ) is the mPoT energy function

• Adapt the generic prior learned by mPoT:

• 1. Adapt the parameters to the denoised test image (mPoT+A), such

as sparse coding;

• 2. Add to the denoising loss an extra quadratic term pulling the

estimate close to the denoising result of the non-local means

algorithm (mPoT+A+NLM), such as adding the term as

Original noisy (22.1dB) mPoT(28.0dB) mPoT+A(29.2dB) mPoT+A+NLM(30.7dB) - Image Restoration by CNN

• Collect a dataset of clean/corrupted image pairs which are then used to train a

specialized form of convolutional neural network.

• Given a noisy image x, predict a clean image y close to the clean image y*

• the input kernels p1 = 16, the output kernel pL = 8.

• 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1.

• W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512.

• This learns how to map corrupted image patches to clean ones, implicitly

capturing the characteristic appearance of noise in natural images.

• Train the weights Wl and biases bl by minimizing the mean squared error

• Minimize with SGD

• Regarded as: first patchifying the input, applying a fully-connected neural network to each

patch, and averaging the resulting output patches. - Image Restoration by CNN

• Comparison. - Image Deconvolution with Deep CNN

• Establish the connection between traditional optimization-based schemes

and a CNN architecture;

• A separable structure is used as a reliable support for robust deconvolution

against artifacts;

• The deconvolution task can be approximated by a convolutional network by

nature, based on the kernel separability theorem;

• Kernel separability is achieved via SVD;

• An inverse kernel with length 100 is enough for plausible deconv. results;

• Image deconvolution convolutional neural network (DCNN);

• Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is

381x121 convolution kernels to each in h1, output is 1×1×38 kernel;

• Random-weight initialization or from the separable kernel inversion;

• Concatenation of deconvolution CNN module with denoising CNN;

• called “Outlier-rejection Deconvolution CNN (ODCNN)”;

• 2 million sharp patches together with their blurred versions in training. - Image Deconvolution with Deep CNN
- End-to-End Deep Learning for Deblur

Feature extraction module transforms the image to a learned gradient-like

representation suitable for kernel estimation. The kernel is estimated by

division in Fourier space, then similarly the latent image. Each consisting of

these three operations, operate on both the blurry image and the latent image. - End-to-End Deep Learning for Deblur

Intermediary outputs of a single-stage NN with 8x Conv, Tanh, 8x8, Tanh, 8x4. - Compression Artifacts Reduction by a Deep CNN

Reuse the features learned in a relatively easier task to initialize a

deeper or harder network, called “easy-hard transfer”: shallow to

deep model, high to low quality, standard to real use case.

Framework of Artifacts Reduction Convolutional Neural Network (AR-CNN). The

network consists of four convolutional layers, each of which is responsible for a

specific operation. Then it optimizes the four operations (i.e., feature extraction,

feature enhancement, mapping and reconstruction) jointly in an end-to-end

framework. Example feature maps shown in each step could well illustrate the

functionality of each operation. They are normalized for better visualization. - DehazeNet by CNN for Dehaze

DehazeNet conceptually consists of four sequential operations (feature

extraction, multi-scale mapping, local extremum and non-linear regression),

which is constructed by 3 convolution layers, a max-pooling, a Maxout unit

and a BReLU activation function. - Image Super-resolution

• Super-resolution (SR): how to find missing details/HF comp?

• Interpolation-based:

• Edge-directed;

• B-spline;

• Sub-pixel alignment;

• Reconstruction-based:

• Gradient prior;

• TV (Total Variation);

• MRF (Markov Random Field).

• Learning-based (hallucination).

• Example-based: texture synthesis, LR-HR mapping;

• Self learning: sparse coding, self similarity-based;

• ‘Deep Learning’ competes with shallow learning in image SR. - What is Example Based SR?

• Estimate missing HR detail that isn’t present in the original LR

image, and which we can’t make visible by simple sharpening;

• Image database with HR/LR image pairs;

• Algorithm uses a training set to learn the fine details of LR;

• It then uses learned relationships (MRF) to predict fine details. - SR from a Single Image

• Multi-frame-based SR (alignment);

• Example-based SR. - SR from a Single Image

• Combination of Example-based and Multi-frame-

based.

same scale

different scales

FindNN

Parent

Copy - Example-based

Edge Statistics

Single Frame - Sparse Coding for SR [Yang et al.08]

• HR patches have a sparse represent. w.r.t. an over-complete

dictionary of patches randomly sampled from similar images.

output HR patch

HR dictionary

for some with

• Sample 3 x 3 LR overlapping patches y on a regular grid.

The input LR patch satisfies Dictionary of low-resolution patches

Downsampling/Blurring operator

linear measurements of sparse coefficient vector !

If we can recover the sparse solution to the underdetermined

system of linear equations , we can reconstruct as

convex

relaxation

T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation. - Sparse Coding for SR [Yang et al.08]

Two training sets:

Flower images – smooth area, sharp edge

Animal images -- HF textures

Randomly sample 100,000 HR-LR patch pairs

from each set of training images. - Bicubic

MRF / BP

[Freeman IJCV ‘00]

Sparse coding

Original - Joint Dictionary Learning for SR

• Local sparse prior for detail recovery;

previous reconstruct on the overlap

extract overlap region

controls the tradeoff between matching the LR

input and finding a neighbor-compatible HR patch.

• Global constraints for artifact avoiding (L=SH);

Solved by back-projection: a gradient descent method

• Joint dictionary learning: - Input LR

MRF / BP [Freeman IJCV ‘00]

Bicubic

Sparse coding - Image SR by DBMs

• Sparsity prior pre-learned into the dictionary [Yang’08];

• Learn the dictionary (size=1024), encoded in the RBM;

• Trained by contrastive divergence.

• Use interpolation to initialize HR from LR, to accelerate inference;

• Training images: 10,000 HR/LR image patch (8x8) pairs.

The image patches are elements of the dictionaries to be

learned and collected from the normalized weights in RBM. - Results of images magnified by a factor of 2
- Super-resolution by DBNs

• SR is a image completion problem of missing data (HF);

• Training: HR image is divided into PxP patches transformed to DCT domain, and

DBNs trained by SGD, layer-by-layer;

• Restoring: LR image is interpolated first and divided into PxP patches as well,

transformed to DCT domain, fed into DBNs to infer missing HF, then reversed.

• Iteratively.

• Experiment setting: P=16, scaling =2, learning rate 0.01, hidden units 400 (1st layer) +

200 (2nd layer). - Super-resolution by DBNs

Connections among LF and HF

Restoration of HF after training

(Two hidden layers as example) - Super-resolution by DBNs

COMPARISON OF SUPER-RESOLUTION METHODS USING PSNR AND SSIM - Image Super-resolution by

Learning Deep CNN

• Learns an end-to-end mapping btw low/high-resolution images as a

deep CNN from the low-resolution image to the high-resolution one;

• Traditional sparse-coding-based SR viewed as a deep convolutional

network, but handle each component separately, rather jointly

optimizes all layers. - Image Super-resolution by

Learning Deep CNN

1. Convolution Layer:

W1 - size cxf1xf1xn1, B1 – n1-dim

2. Rectified Linear Unit: ReLU, max(0; x)

W2 - size n1x1x1xn2, B2 - n2-dim

3. Convolution Layer:

W3 - size n2xf3xf3xc, B3 - c-dim

4. Loss Function: MSE

Note: c=3, f1 = 9, f3 = 5, n1 = 64, n2 = 32. - Image Super-resolution by

Learning Deep CNN

• LR image upscaled to the desired size using bicubic interpolation as Y;

Then recover from Y an image F(Y) similar to ground truth HR image X.

• Learn a mapping F, consists of three operations:

• 1. Patch extraction and representation;

• 2. Non-linear mapping;

• 3. Reconstruction.

• Traditional Sparse coding method shown as - Image Super-resolution by

Learning Deep CNN

Results comparison. - Image Superresolution by a Cascade

of Stacked CLA

• In each layer of the cascade, non-local self-similarity search is first

performed to enhance high-frequency texture details of the partitioned

patches in the input image;

• The enhanced image patches are then input into a collaborative local

auto-encoder (CLA) to suppress the noises as well as collaborate the

compatibility of the overlapping patches;

• By closing the loop on non-local self-similarity search and CLA in a

cascade layer, refine the super-resolution fed into next layer until the

required image scale. - Image Superresolution by a Cascade

of Stacked CLA

• Experimental results compared with others.

Kim’ sparse regression exemplar-based Yang’s sparse coding Cacade of stacked LAS - Deeply-Recursive Convolutional

Network for Image Super-Resolution

• A deeply-recursive convolutional network (DRCN);

• A very deep recursive layer (up to 16 recursions);

• Training: recursive-supervision, skip-connection.

It consists of three parts: embedding network, inference network and

reconstruction network. Inference network has a recursive layer. - Deeply-Recursive Convolutional

Network for Image Super-Resolution

Unfolding inference network. Left: A recursive layer. Right: Unfolded structure.

The same filter W is applied to feature maps recursively. The unfolded model

can utilize very large context without adding new weight parameters. - Deeply-Recursive Convolutional

Network for Image Super-Resolution

(a) Model with recursive-supervision and skip-connection. (b) Applying deep-supervision.

(c) Example of expanded structure of (a) w/o parameter sharing (no recursion). - References

• Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends in

ML, 2(1), pp.1-127, 2009.

• R Fergus, H. Lee, M Ranzato, R. Salakhutdinov, G. Taylor, K Yu, Deep

Learning Methods for Vision, CVPR 2012 Tutorial.

• Hinton, G., Osindero, S. and The, Y. A fast learning algorithm for deep

belief nets. Neural Computation, 18, 2006.

• Salakhutdinov, Ruslan, and Geoffrey E. Hinton. Deep Boltzmann machines.

Int. conf. on AI and statistics. 5(2), 2009.

• Vincent, Pascal, et al. Stacked denoising autoencoders: Learning useful

representations in a deep network with a local denoising criterion. J. of

Machine Learning Research 11 (2010): 3371-3408.

• Le, Ranzato, Monga, Devin, Corrado, Chen, Dean, Ng. Building High-Level

Features Using Large Scale Unsupervised Learning. ICML 2012.

• K Cho, T Raiko, A Ilin and J Karhunen, A Two-stage Pretraining Algorithm

for Deep Boltzmann Machines, NIPS workshop, 2012.

• V. Jain and H.S. Seung. Natural image denoising with convolutional

networks. Advances in Neural Information Processing Systems, 21:769–

776, 2008. - References

• Burger, Schuler, Harmeling, Image Denoising: Can Plain Neural Networks

Compete with BM3D?, CVPR, 2012.

• Xie, J., Xu, L., Chen, E. Image denoising and inpainting with deep neural

networks. Advances in Neural Information Processing Systems 25, 2012.

• K. Cho, Simple Sparsification Improves Sparse Denoising Autoencoders in

Image Denoising, ICML, 2013.

• J Gao, Y Guo, M Yin, Restricted Boltzmann Machine Approach to Couple

Dictionary Training for Image Superresolution, IEEE ICIP, 2013.

• T. Nakashika, T Takiguchi, Y Ariki, HF Restoration Using Deep Belief Nets for

SR, 2013.

• D Eigen, D Krishnan, R Fergu, Restoring An Image Taken Through a Window

Covered with Dirt or Rain. ICCV’13;

• M Ranzato, V Mnih, J M. Susskind, G E. Hinton, Modeling Natural Images

Using Gated MRFs, IEEE T-PAMI, 2013;

• Li Xu, Jimmy S. Ren, Ce Liu, Jiaya Jia, Deep convolutional neural network

for image deconvolution, NIPS, 2014.

• Cui, Chang, Shan, Zhong, Chen, Deep Network Cascade for Image Super-

resolution, ECCV’14;

• Dong, Loy, He, Tang, Learning a Deep Convolutional Network for Image

Super-Resolution, ECCV’14. - Appendix
- Graphical Models

• Graphical Models: Powerful framework for representing

dependency structure between random variables.

• The joint probability distribution over a set of random variables.

• The graph contains a set of nodes (vertices) that represent random

variables, and a set of links (edges) that represent dependencies

between those random variables.

• The joint distribution over all random variables decomposes

into a product of factors, where each factor depends on a subset

of the variables.

• Two type of graphical models:

• Directed (Bayesian networks)

• Undirected (Markov random fields, Boltzmann machines)

• Hybrid graphical models that combine directed and undirected models,

such as Deep Belief Networks, Hierarchical-Deep Models. - PCA, AP & Spectral Clustering

• Principal Component Analysis (PCA) uses orthogonal transformation to

convert a set of observations of possibly correlated variables into a set of

linearly uncorrelated variables called principal components.

• This transformation is defined in such a way that the first principal

component has the largest possible variance and each succeeding

component in turn has the highest variance possible under the constraint

that it be orthogonal to the preceding components.

• PCA is sensitive to the relative scaling of the original variables.

• Also called as Karhunen–Loève transform (KLT), Hotelling transform,

singular value decomposition (SVD) , factor analysis, eigenvalue

decomposition (EVD), spectral decomposition etc.;

• Affinity Propagation (AP) is a clustering algorithm based on the concept

of "message passing" between data points.[Unlike clustering algorithms

such as k-means or k-medoids, AP does not require the number of

clusters to be determined or estimated before running the algorithm;

• Spectral Clustering makes use of the spectrum (eigenvalues) of the data

similarity matrix to perform dimensionality reduction before clustering in

fewer dimensions.

• The similarity matrix consists of a quantitative assessment of the relative

similarity of each pair of points in the dataset. - PCA, AP & Spectral Clustering
- NMF & pLSA

• Non-negative matrix factorization (NMF): a matrix V is factorized into

(usually) two matrices W and H, that all three matrices have no negative

elements.

• The different types arise from using different cost functions for measuring

the divergence between V and W*H and possibly by regularization of

the W and/or H matrices;

• squared error, Kullback-Leibler divergence or total variation (TV);

• NMF is an instance of a more general probabilistic model called

"multinomial PCA“, as pLSA (probabilistic latent semantic analysis);

• pLSA is a statistical technique for two-mode (extended naturally to higher

modes) analysis, modeling the probability of each co-occurrence as a

mixture of conditionally independent multinomial distributions;

• Their parameters are learned using EM algorithm;

• pLSA is based on a mixture decomposition derived from a latent class

model, not as downsizing the occurrence tables by SVD in LSA.

• Note: an extended model, LDA (Latent Dirichlet allocation) , adds

a Dirichlet prior on the per-document topic distribution. - NMF & pLSA

Note: d is the document index variable, c is a word's topic drawn from the document's

topic distribution, P(c|d), and w is a word drawn from the word distribution of this

word's topic, P(w|c). (d and w are observable variables, c is a latent variable.) - ISOMAP

• General idea:

• Approximate the geodesic distances by shortest graph distance.

• MDS (multi-dimensional scaling) using geodic distances

• Algorithm:

• Construct a neighborhood graph

• Construct a distance matrix

• Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a

new distance matrix such that Dij is the length of the shortest path between i and j.

• Apply MDS to matrix to find coordinates - LLE (Locally Linear Embedding)

• General idea: represent each point on the local linear subspace of the

manifold as a linear combination of its neighbors to characterize the local

neighborhood relations; then use the same linear coefficient for embedding

to preserve the neighborhood relations in the low dimensional space;

• Compute the coefficient w for each data by solving a constraint LS problem;

• Algorithm:

• 1. Find weight matrix W of linear coefficients

• 2. Find low dimensional embedding Y that minimizes the reconstruction error

2

Y

( ) Y

W Y

i

ij

j

i

j

• 3. Solution: Eigen-decomposition of M=(I-W)’(I-W) - Laplacian Eigenmaps

• General idea: minimize the norm of Laplace-Beltrami operator on the

manifold

• measures how far apart maps nearby points.

• Avoid the trivial solution of f = const.

• The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood

graph with appropriate weights.

• Construct the Laplacian matrix L=D-W.

• can be approximated by its discrete equivalent

• Algorithm:

• Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).

• Construct an adjacency matrix with the following weights

• Minimize

• The generalized eigen-decomposition of the graph Laplacian is

• Spectral embedding of the Laplacian manifold:

• • The first eigenvector is trivial (the all one vector). - Gaussian Mixture Model & EM

• Mixture model is a probabilistic model for representing the presence

of subpopulations within an overall population;

• “Mixture models" are used to make statistical inferences about the properties

of the sub-populations given only observations on the pooled population;

• A Gaussian mixture model can be Bayesian or non-Bayesian;

• A variety of approaches focus on maximum likelihood estimate (MLE)

as expectation maximization (EM) or maximum a posteriori (MAP);

• EM is used to determine the parameters of a mixture with an a priori given

number of components (a variation version can adapt it in the iteration);

• Expectation step: "partial membership" of each data point in each constituent

distribution is computed by calculating expectation values for the membership

variables of each data point;

• Maximization step: plug-in estimates, mixing coefficients and component model

parameters, are re-computed for the distribution parameters;

• Each successive EM iteration will not decrease the likelihood.

• Alternatives of EM for mixture models:

• mixture model parameters can be deduced using posterior sampling as indicated

by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC);

• Spectral methods based on SVD;

• Graphical model: MRF or CRF. - Gaussian Mixture Model & EM
- Hidden Markov Model

• A hidden Markov model (HMM) is a statistical Markov model: the

modeled system is a Markov process with unobserved (hidden) states;

• In HMM, state is not visible, but output, dependent on state, is visible.

• Each state has a probability distribution over the possible output tokens;

• Sequence of tokens generated by an HMM gives some information about

the sequence of states.

• Note: the adjective 'hidden' refers to the state sequence through which

the model passes, not to the parameters of the model;

• A HMM can be considered a generalization of a mixture model where the

hidden variables are related through a Markov process;

• Inference: prob. of an observed sequence by Forward-Backward

Algorithm and the most likely state trajectory by Viterbi algorithm (DP);

• Learning: optimize state transition and output probabilities by Baum-

Welch algorithm (special case of EM). - • A flow network G(V, E) defined as a fully

connected directed graph where each edge

(u,v) in E has a positive capacity c(u,v) >= 0;

• The max-flow problem is to find the flow of

maximum value on a flow network G;

• A s-t cut or simply cut of a flow network G is a

partition of V into S and T = V-S, such that s in S

and t in T;

• A minimum cut of a flow network is a cut

whose capacity is the least over all the s-t cuts

of the network;

• Methods of max flow or mini-cut:

• Ford Fulkerson method;

• "Push-Relabel" method. - • Mostly labeling is solved as an energy minimization problem;

• Two common energy models:

• Potts Interaction Energy Model;

• Linear Interaction Energy Model.

• Graph G contain two kinds of vertices: p-vertices and i-vertices;

• all the edges in the neighborhood N, called n-links;

• edges between the p-vertices and the i-vertices called t-links.

• In the multiple labeling case, the multi-way cut should leave each p-vertex

connected to one i-vertex;

• The minimum cost multi-way cut will minimize the energy function where

the severed n-links would correspond to the boundaries of the labeled

vertices;

• The approximation algorithms to find this multi-way cut:

• "alpha-expansion" algorithm;

• "alpha-beta swap" algorithm. - A simplified Bayes Net: it propagates info. throughout a graphical

model via a series of messages between neighboring nodes

iteratively; likely to converge to a consensus that determines the

marginal prob. of all the variables;

messages estimate the cost (or energy) of a configuration of a

clique given all other cliques; then the messages are combined to

compute a belief (marginal or maximum probability);

• Two types of BP methods:

• max-product;

• sum-product.

• BP provides exact solution when there are no loops in graph!

• Equivalent to dynamic programming/Viterbi in these cases;

• Loopy Belief Propagation: still provides approximate (but often

good) solution; - • Generalized BP for pairwise MRFs

• Hidden variables xi and xj are connected through a compatibility

function;

• Hidden variables xi are connected to observable variables yi by

the local “evidence” function;

• The joint probability of {x} is given by

• To improve inference by taking into account higher-order interactions

among the variables;

• An intuitive way is to define messages that propagate between groups of nodes

rather than just single nodes;

• This is the intuition in Generalized Belief Propagation (GBP). - Discriminative Model: CRF

• Conditional , not joint, probabilistic sequential models p(y|x)

• Allow arbitrary, non-independent features on the observation seq X

• Specify the probability of possible label seq given an observation seq

• Prob. of a transition between labels depend on past/future observ.

• Relax strong independence assumptions, no p(x) required

• CRF is MRF plus “external” variables, where “internal” variables Y of

MRF are un-observables and “external” variables X are observables

• Linear chain CRF: transition score depends on current observation

• Inference by DP like HMM, learning by forward-backward as HMM

• Optimization for learning CRF: discriminative model

• Conjugate gradient, stochastic gradient,… - Product of Experts (PoE)

• Model a probability distribution by combining the output from

several simpler distributions.

• Combine several probability distributions ("experts") by multiplying

their density functions— similar to “AND" operation.

• This allows each expert to make decisions on the basis of a few dimensions w.o.

having to cover the full dimensionality.

• Related to (but quite different from) a mixture model, combining

several probability distributions via “OR" operation.

• Learning by CD: run N samplers in parallel, one for each data-case in

the (mini-)batch;

• Boosting: focusing on training data with high reconstruct. errors;

• Easy for inference, no suffer from “Explaining Away”. - Stochastic Gradient Descent (SGD)

• The general class of estimators that arise as minimizers of

sums are called M-estimators;

• Where are stationary points of the likelihood function (or zeroes of its

derivative, the score function)?

• Online gradient descent samples a subset of summand

functions at every step;

• The true gradient of is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called

"mini-batches", where the true gradient is approximated by a

sum over a small number of training examples.

• STD converges almost surely to a global minimum when the

objective function is convex or pseudo-convex, and otherwise

converges almost surely to a local minimum. - Back Propagation

E (f(x0,w),y0) = -log (f(x0,w)- y0). - Loss function

• Euclidean loss is used for regressing to real-valued lables [-inf,inf];

• Sigmoid cross-entropy loss is used for predicting K independent probability

values in [0,1];

• Softmax (normalized exponential) loss is predicting a single class of K mutually

exclusive classes;

• Generalization of the logistic function that "squashes" a K-dimensional vector

of arbitrary real values z to a K-dimensional vector of real values σ(z) in the

range (0, 1).

• The predicted probability for the j'th class given a sample vector x is

• Sigmoidal or Softmax normalization is a way of reducing the influence of

extreme values or outliers in the data without removing them from the

dataset. - Variable Learning Rate

• Too large learning rate

• cause oscillation in searching for the minimal point

• Too slow learning rate

• too slow convergence to the minimal point

• Adaptive learning rate

• At the beginning, the learning rate can be large when the current

point is far from the optimal point;

• Gradually, the learning rate will decay as time goes by.

• Should not be too large or too small:

• annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

• 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a

constant. - Variable Momentum

• - AdaGrad/AdaDelta

• - Data Augmentation for Overfitting

• The easiest and most common method to reduce overfitting on

image data is to artificially enlarge the dataset using label-

preserving transformations;

• Perturbing an image I by transformations that leave the

underlying class unchanged (e.g. cropping and flipping) in order

to generate additional examples of the class;

• Two distinct forms of data augmentation:

• image translation

• horizontal reflections

• changing RGB intensities - Dropout and Maxout for Overfitting

• Dropout: set the output of each hidden neuron to zero w.p.

0.5.

• Motivation: Combining many different models that share parameters

succeeds in reducing test errors by approximately averaging together the

predictions, which resembles the bagging.

• The units which are “dropped out” in this way do not contribute to the

forward pass and do not participate in back propagation.

• So every time an input is presented, the NN samples a different architecture,

but all these architectures share weights.

• This technique reduces complex co-adaptations of units, since a neuron

cannot rely on the presence of particular other units.

• It is, therefore, forced to learn more robust features that are useful in

conjunction with many different random subsets of the other units.

• Without dropout, the network exhibits substantial overfitting.

• Dropout roughly doubles the number of iterations required to converge.

• Maxout takes the maximum across multiple feature maps; - Weight Decay for Overfitting

• Weight decay or L2 regularization adds a penalty term to the error function, a

term called the regularization term: the negative log prior in Bayesian

justification,

• Weight decay works as rescaling weights in the learning rule, but bias learning still the

same;

• Prefer to learn small weights, and large weights allowed if improving the original cost

function;

• A way of compromising btw finding small weights and minimizing the original cost

function;

• In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;

• L1 regularization: the weights not really useful shrink by a constant amount

toward zero;

• Act like a form of feature selection;

• Make the input filters cleaner and easier to interpret;

• L2 regularization penalizes large values strongly while L1 regularization ;

• Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose

equilibrium distr. is the posterior distribution for weights & hyper-parameters;

• Hybrid Monte Carlo: gradient and sampling. - Early Stopping for Overfitting

• Steps in early stopping:

• Divide the available data into training and validation sets.

• Use a large number of hidden units.

• Use very small random initial values.

• Use a slow learning rate.

• Compute the validation error rate periodically during training.

• Stop training when the validation error rate "starts to go up".

• Early stopping has several advantages:

• It is fast.

• It can be applied successfully to networks in which the number of weights

far exceeds the sample size.

• It requires only one major decision by the user: what proportion of

validation cases to use.

• Practical issues in early stopping:

• How many cases do you assign to the training and validation sets?

• Do you split the data into training and validation sets randomly or by

some systematic algorithm?

• How do you tell when the validation error rate "starts to go up"? - MCMC Sampling for Optimization

• Markov Chain: a stochastic process in which future states are

independent of past states but the present state.

• Markov chain will typically converge to a stable distribution.

• Monte Carlo Markov Chain: sampling using ‘local’ information

• Devise a Markov chain whose stationary distribution is the target.

• Ergodic MC must be aperiodic, irreducible, and positive recurrent.

• Monte Carlo Integration to get quantities of interest.

• Metropolis-Hastings method: sampling from a target distribution

• Create a Markov chain whose transition matrix does not depend on

the normalization term.

• Make sure the chain has a stationary distribution and it is equal to the

target distribution (accept ratio).

• After sufficient number of iterations, the chain will converge the

stationary distribution.

• Gibbs sampling is a special case of M-H Sampling.

• The Hammersley-Clifford Theorem: get the joint distribution from the

complete conditional distribution.

• Hybrid Monte Carlo: gradient sub step for each Markov chain. - Mean Field for Optimization

• Variational approximation modifies the optimization problem to

be tractable, at the price of approximate solution;

• Mean Field replaces M with a (simple) subset M(F), on which A*

(μ) is a closed form (Note: F is disconnected graph);

• Density becomes factorized product distribution in this sub-family.

• Objective: K-L divergence.

• Mean field is a structured variation approximation approach:

• Coordinate ascent (deterministic);

• Compared with stochastic approximation (sampling):

• Faster, but maybe not exact. - Contrastive Divergence for RBMs

• Contrastive divergence (CD) is proposed for training PoE first, also

being a quicker way to learn RBMs;

• Contrastive divergence as the new objective;

• Taking gradients and ignoring a term which is usually very small.

• Steps:

• Start with a training vector on the visible units.

• Then alternate between updating all the hidden units in parallel and

updating all the visible units in parallel.

• Can be applied using any MCMC algorithm to simulate the model (not

limited to just Gibbs sampling);

• CD learning is biased: not work as gradient descent

• Improved: Persistent CD explores more modes in the distribution

• Rather than from data samples, begin sampling from the mode samples,

obtained from the last gradient update.

• Still suffer from divergence of likelihood due to missing the modes.

• Score matching: the score function does not depend on its normal.

factor. So, match it b.t.w. the model with the empirical density. - “Wake-Sleep” Algorithm for DBN

• Pre-trained DBN is a generative model;

• Do a stochastic bottom-up pass (wake phase)

• Get samples from factorial distribution (visible first, then generate

hidden);

• Adjust the top-down weights to be good at reconstructing the feature

activities in the layer below.

• Do a few iterations of sampling in the top level RBM

• Adjust the weights in the top-level RBM.

• Do a stochastic top-down pass (sleep phase)

• Get visible and hidden samples generated by generative model using

data coming from nowhere!

• Adjust the bottom-up weights to be good at reconstructing the feature

activities in the layer above.

• Any guarantee for improvement? No!

• The “Wake-Sleep” algorithm is trying to describe the

representation economical (Shannon’s coding theory). - Greedy Layer-Wise Training

• Deep networks tend to have more local minima problems than

shallow networks during supervised training

• Train first layer using unlabeled data

• Supervised or semi-supervised: use more unlabeled data.

• Freeze the first layer parameters and train the second layer

• Repeat this for as many layers as desire

• Build more robust features

• Use the outputs of the final layer to train the last supervised

layer (leave early weights frozen)

• Fine tune the full network with a supervised approach;

• Avoid problems to train a deep net in a supervised fashion.

• Each layer gets full learning

• Help with ineffective early layer learning

• Help with deep network local minima - Why Greedy Layer-Wise Training Works?

• Take advantage of the unlabeled data;

• Regularization Hypothesis

• Pre-training is “constraining” parameters in a region

relevant to unsupervised dataset;

• Better generalization (representations that better

describe unlabeled data are more discriminative for

labeled data) ;

• Optimization Hypothesis

• Unsupervised training initializes lower level parameters

near localities of better minima than random

initialization can.

• Only need fine tuning in the supervised learning stage. - Two-Stage Pre-training in DBMs

• Pre-training in one stage

• Positive phase: clamp observed, sample hidden, using variational

approximation (mean-field)

• Negative phase: sample both observed and hidden, using

persistent sampling (stochastic approximation: MCMC)

• Pre-training in two stages

• Approximating a posterior distribution over the states of hidden

units (a simpler directed deep model as DBNs or stacked DAE);

• Train an RBM by updating parameters to maximize the lower-bound

of log-likelihood and correspond. posterior of hidden units.

• Options (CAST, contrastive divergence, stochastic approximation…). - A. Reference

• Dempster, A., Laird, N., Rubin, D. (1977). "Maximum Likelihood from Incomplete

Data via the EM Algorithm". J.of the Royal Statistical Society B, 39 (1): 1–38.

• L. R. Rabiner (Feb 1989). "A tutorial on Hidden Markov Models and selected

applications in speech recognition". Proc. of the IEEE. 77 (2): 257–286.

• Mitchell, T. (1997). Machine Learning, McGraw Hill.

• Jensen, Finn (1996). An introduction to Bayesian networks. Berlin: Springer.

• Frey, Brendan (1998). Graphical Models for Machine Learning and Digital

Communication. MIT Press

• Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian

Inference. Chapman and Hall: London.

• M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul (1999). An Introduction to

Variational Methods for Graphical Models, Machine Learning, v.37 n.2, p.183-233.

• S Roweis & L Saul. (Dec.2000 ). “Nonlinear dimensionality reduction by locally linear

embedding”. Science, v.290,. pp.2323--2326.

• Stan Z. Li. (2001). Markov Random Field Modeling in Image Analysis. Springer-Verlag.

• J. Lafferty, A. McCallum, and F. Pereira. (2001). Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. Proc. 18th

International Conf. on Machine Learning.

• Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y. (2006). "Efficient sparse

coding algorithms". Advances in Neural Information Processing Systems.

• U. von Luxburg, (2007)"A tutorial on spectral clustering", Stat. Comp. Vol. 17, Issue 4 ,

pp395-416.