このページは http://www.slideshare.net/yuhuang/machine-learning-based-contour-boundary-detection-from-images の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byYu Huang

約3年前 (2013/10/21)にアップロードinテクノロジー

machine learning, scene understanding, static segmentation, Gestalt Cues, superpixel, logistic re...

machine learning, scene understanding, static segmentation, Gestalt Cues, superpixel, logistic regression, MRF, CRF, manifold learning, ensemble learning, k-means, SVM, Naive Bayes, sparse coding, K-SVD, Orthogonal mactching pursuit. deep learning, RBM, DBM, DBN, SAE.

- Edges, Contours and Boundaries

Finding Meaningful Contours

Static Segmentation (Regions)

Classical Gestalt Cues

Berkeley Segmentation Data Set

Learning for Scene Segmentation

Learn a Local Boundary Model

Image Figure/Ground Assignment

Learning Edges and Boundaries

Sparse Models for Edge Detection

Boundary Detection and Grouping

Sparse Coding for Contour Detection

Sketch Tokens for Contour Detection

Deep learning shape prior for segmentation

Deep neural prediction network for visual boundary

References

Appendix - Edges: Significant local changes in image; occur on the

boundary between 2 different regions in an image.

Contour: Representation of linked edges for a region

boundary.

◦ Closed: Correspond to region boundaries; filling algorithm

determines the pixels in the region.

◦ Open: part of a region boundary; gaps’ formation due to high

edge-detection threshold or weak contrast.

occur when line fragments are linked together, as in drawing or

handwriting.

Contour Representation:

◦ Ordered list of Edges (chains codes)

◦ Curve- model for a contour (piecewise line segments or cubic

splines) - Local edge detection

◦ Problems - false targets, misses

One solution: use other cues (image segmentation)

◦ Texture: Sharp changes in orientation, scale of textures

◦ Motion: >=2 Frames

◦ Disparity: Stereo

Left eye

Right eye

Frame 1

Frame 2 - Regional Approaches (split-merge, watershed, mean shift, ...)

Use regional info, optimize labelling of regional tokens, e.g. clustering

Depending on uniformity in object region

Active Contour Models (snakes)

◦ Use regional (external) & boundary (internal) info, optimize deformation of

model

◦ Sensitivity to initialization, too smooth

Level Set (implicit active contour)

handle topological changes naturally

not robust to boundary gaps

Contour Grouping

Use boundary info (& regional info), optimize grouping of contour fragments

Learning-based: Boundary Detection. - How is grouping done in human vision?

Proximity

Similarity

◦ Brightness

◦ Contrast

Good continuation

◦ Parallelism

◦ Co-circularity - Two-class classification model

Over segmentation as preprocessing

Use classical Gestalt cues

◦ Contour, texture, brightness and continuation

A linear classifier is used for training (logistic regression)

•Local

•Coherent

•Preserve

structure

•Contour

•texture

Superpixel

Reconstruction of human

map

segmentation from

K=200

Superpixels - Image

Boundary Cues

Cue Combination

Brightness

Color

Model

Texture

Challenges: texture cue, cue combination

Goal: learn the posterior probability of a boundary

P (x,y, ) from local information only

b - Human subjects label ground truth figure/ground

assignments in natural images.

“Shapemes” encode high-level knowledge in a generic

way, capturing local figure/ground cues.

A conditional random field (CRF) incorporates junction

cues and enforces global consistency. - Shapemes (clusters of local shapes)

Color image

Pb edge maps contour/junction

human-marked

boundaries - Boosted Edge Learning (BEL): Probabilistic Boosting

Tree (PBT) classification;

Features: gradient+Haarlet, over a large image patch.

Learn to detect edges from images with labeled

ground truth; - PBT Training:
- Sparseland model and dictionary learning by k-SVD;

Edge detection as the pixelwise classification problem;

◦ “patches centered on edge pixel or not”;

Contour training: class specific edge classifier

Shape training: shape-based object classifier

Classification: edge classifier then shape classifier

◦ Bike, Motorbike, Person or Car?

Person? - Learning-based boundary detection: SIFT-based, dim. reduction by

PCA, boosting (Adaboost, Gentleboost and Madaboost);

Boundary grouping: use a normalized saliency criterion, fractional-

linear programming to find graph circles with min. cost . - Sparse Code Gradients (SCG): by sparse coding (k-SVD);

Gradient, color, plus depth & surface normal(option);

Linear classifier (SVM) with contrast features (SCG);

Globalization by computing a spectral gradient (like gPb) optionally; - Definition: straight lines, t-junctions, y-junctions, corners, curves, parallel lines;

Learned (k-means clustering) from patches of human generated contours: a

number of classes in hundreds (150 in the paper), Daisy (MSR) descriptors used

for shift invariance;

Low-level image features: gradient, color, orientation, etc.;

Classifier: Random decision forest for sketch token labeling from image patches.

Sketch Tokens

Like “Shapeme”? - Use deep Boltzmann machine to learn the hierarchical architecture of

shape priors: low level local feature and high level global feature;

Apply the learned architecture to model shape variations of global and

local structures;

A data-driven variational method to perform object extraction based

on shape probabilistic representation. - original result learned shape result by sparse learned shape
- Integration from multiple scales and semantic levels via multi-

streams of interlinked, layered, non-linear “deep” processing;

◦ Deep belief net with a variant of the mean-and-covariance RBM;

Unsupervised feature learning;

◦ Supervised boundary prediction by feed forward NN. - • Contour detection accuracy can be improved by instead making the use of the

deep features learned from CNNs.

• Customize the training strategy by partitioning contour (positive) data into

subclasses and fitting each subclass by different model parameters.

• A new loss function, named positive-sharing loss, in which each subclass shares

the loss for the whole positive class to learn the parameters

• It introduces an extra regularizer to emphasizes the losses for the positive and

negative classes, which facilitates to explore more discriminative features.

CNN structure: explicitly visualizing the dimensions of each network layers. - • Run the Canny edge detector to get candidate contour points.

• Around each candidate point, extract patches at four different scales and

simultaneously run them through the five convolutional layers of the KNet.

• Connect these convolutional layers to two separately-trained network branches.

• The first branch is trained for classification, the second is trained as a regressor.

• Outputs from these two sub-networks are averaged to produce the final score. - An input patch, centered around the candidate point, goes through five conv.

layers of the KNet. To extract high-level features, at each conv. layer extract a

small sub-volume of the feature map around the center point, and perform

max, average, and center pooling on this sub-volume. The pooled values feed

a bifurcated sub-network. The scalar outputs computed from the branches of a

bifurcated sub-networks are averaged to produce a final contour prediction. - X. Ren, and J. Malik. "Learning a Classification Model for Segmentation", ICCV’03

D. Martin, C. Fowlkes, and J. Malik. "Learning to detect natural Image boundaries

using local brightness, color, and texture cues", IEEE T-PAMI 2004

P. Doll´ar, Z. Tu, and S. Belongie, “Supervised learning of edges and object

boundaries”, CVPR, 2005

Ren, Fowlkes, Malik. "Figure/Ground assignment in natural images“, ECCV 2006

Mairal1, M. Leordeanu, F. Bach1, M. Hebert, J. Ponce, “Discriminative Sparse

Image Models for Class-Specific Edge Detection and Image Interpretation”,

ECCV’08.

I. Kokkinos, “Highly Accurate Boundary Detection and Grouping”. CVPR 2010.

X. Ren and L. Bo, “Discriminatively Trained Sparse Code Gradients for Contour

Detection”, NIPS’12.

J Lim, C. L. Zitnick, P Dollar, “Sketch Tokens: A Learned Mid-level Representation

for Contour and Object Detection”, CVPR, 2013

Chen, Yu, Hu, Zeng, “Deep Learning Shape Priors for Object Segmentation”,

CVPR’13.

Kivinen, Williams, Heess, “visual boundary prediction: a deep neural prediction

network and quality dissection”, AISTATS, 2014. - “Machine Learning is programming computers to optimize a

performance criterion using example data or past experience”

◦ Supervised/Unsupervised model: labeled/unlabeled data;

◦ Semi-supervised model: both labeled and unlabeled data;

◦ Online learning: incremental update;

◦ Ensemble classifiers: bagging, stacking, boosting, random forest,…

◦ Reinforcement Learning: learn by interacting with an environment.

Types of ML algorithms

◦ Prediction: predicting a variable from data

◦ Classification: assigning records to predefined groups

◦ Clustering: splitting records into groups based on similarity

◦ Association learning: seeing what often appears together with what

Relationship with others

◦ Artificial intelligence: emulate how the brain works with program.;

ML is a branch of AI

◦ Data mining: building models in order to detect the patterns;

◦ Statistical analysis: probabilistic models, on which to infer with data;

◦ Information retrieval: retrieval of information from a collection of data. - Unsupervised learning is that of trying to find hidden structure in

unlabeled data; Since the examples given to the learner are

unlabeled, there is no error or reward signal to evaluate a potential

solution;

It is closely related to the problem of density

estimation in statistics; However also encompasses many other

techniques that seek to summarize and explain key features of the

data

Unsupervised learning also encompasses many other techniques

that seek to summarize and explain key features of the data.

Approaches to unsupervised learning include:

◦ Clustering;

◦ Hidden Markov models;

◦ Blind signal separation (PCA, ICA, NMF, SVD…);

Unsupervised methods in NN:

◦ Self Organizing Map: topographic organization in which nearby locations in

the map represent inputs with similar properties;

◦ Adaptive Resonance Theory: allows the number of clusters to vary with

problem size and lets the user control the degree of similarity between

members of the same clusters by means of the vigilance parameter. - Supervised learning is the task of inferring a function from labeled

training data. The training data consist of a set of training

examples.

Each example is a pair consisting of an input object (typically a

vector) and a desired output value (also called the supervisory

signal).

A supervised learning algorithm analyzes the training data and

produces an inferred function, used for mapping new examples.

There are four major issues to consider in supervised learning:

◦ tradeoff between bias and variance;

◦ amount of training data relative to the complexity of the "true" function;

◦ dimensionality of the input space: curse of dimensionality;

◦ degree of noise in the desired output values: over-fitting.

There are several ways to be generalized:

◦ Semi-supervised learning: the desired output values are provided only for a

subset of the training data. The remaining data is unlabeled.

◦ Active learning: Instead of assuming that all of the training examples are

given at the start, interactively collect new examples, typically by making

queries to a human user. - Training/testing data (70%/30%)

Data unbalanced (one class’ data more than others)

◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…

Feature extraction

◦ Sparse coding, vector quantization,…

Curse of Dimensionality: Sensitivity to “noise”

◦ Dimension reduction, manifold learning/distance metric learning

Linear or non-linear model

◦ Local/Global minimum (convex/concave obj. function): Learning rate

◦ Regularization: L-1/L-2 norm

◦ Kernel trick: mapping nonlinear feature space to high dim. linear space

Discriminative or generative model

◦ Bottom up (conditional distribution) /Top down (joint distribution)

Over-fitting: Learn the “noise”

◦ Cross validation with grid search

Performance evaluation

◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operating characteristic) - Principal Component Analysis (PCA) uses orthogonal transformation to convert a

set of observations of possibly correlated variables into a set of linearly

uncorrelated variables called principal components.

This transformation is defined in such a way that the first principal component

has the largest possible variance and each succeeding component in turn has the

highest variance possible under the constraint that it be orthogonal to the

preceding components.

PCA is sensitive to the relative scaling of the original variables.

Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular

value decomposition (SVD) , factor analysis, eigenvalue

decomposition (EVD), spectral decomposition etc.;

Affinity Propagation (AP) is a clustering algorithm based on the concept of

"message passing" between data points. Unlike clustering algorithms such as k-

means or k-medoids, AP does not require the number of clusters to be

determined or estimated before running the algorithm;

Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity

matrix to perform dimensionality reduction before clustering in fewer

dimensions. - Independent component analysis (ICA) is for separating

a multivariate signal into additive subcomponents by assuming that the

subcomponents are non-Gaussian signals and all statistically

independent from each other.

◦ ICA is a special case of blind source separation.

Assumptions: the source signals are independent of each other;

distribution of the values in each source signals are non-Gaussian.

Three effects of mixing signals as below

◦ Independence: the signal mixtures may not;

◦ Normality: closer to Gaussian than any of original variables;

◦ Complexity: Greater than that of its constituent source signal.

Preprocessing: centering, whitening and dimension reduction;

ICA finds the independent components (latent variables) by maximizing

the statistical independence of the estimated components;

Definitions of independence for ICA:

◦ Minimization of mutual information (KL divergence or entropy);

◦ Maximization of non-Gaussianity (kurtosis and negative entropy). - Initial signal mixed signal whitening ICA
- Mixture model is a probabilistic model for representing the presence

of subpopulations within an overall population;

“Mixture models" are used to make statistical inferences about the

properties of the sub-populations given only observations on the pooled

population;

A Gaussian mixture model can be Bayesian or non-Bayesian;

A variety of approaches focus on maximum likelihood estimate (MLE)

as expectation maximization (EM) or maximum a posteriori (MAP);

EM is used to determine the parameters of a mixture with an a priori given

number of components (a variation version can adapt it in the iteration);

◦ Expectation step: "partial membership" of each data point in each

constituent distribution is computed by calculating expectation values for

the membership variables of each data point;

◦ Maximization step: plug-in estimates, mixing coefficients and component

model parameters, are re-computed for the distribution parameters;

◦ Each successive EM iteration will not decrease the likelihood.

Alternatives of EM for mixture models:

◦ mixture model parameters can be deduced using posterior sampling as

indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte

Carlo (MCMC);

◦ Spectral methods based on SVD;

◦ Graphical model: MRF or CRF. - Non-negative matrix factorization (NMF): a matrix V is factorized

into (usually) two matrices W and H, that all three matrices have no

negative elements.

The different types arise from using different cost functions for

measuring the divergence between V and W*H and possibly

by regularization of the W and/or H matrices;

◦ squared error, Kullback-Leibler divergence or total variation (TV);

NMF is an instance of a more general probabilistic model called

"multinomial PCA“, as pLSA (probabilistic latent semantic analysis);

pLSA is a statistical technique for two-mode (extended naturally to

higher modes) analysis, modeling the probability of each co-

occurrence as a mixture of conditionally independent multinomial

distributions;

◦ Their parameters are learned using EM algorithm;

pLSA is based on a mixture decomposition derived from a latent

class model, not as downsizing the occurrence tables by SVD in LSA.

Note: an extended model, LDA (Latent Dirichlet allocation) , adds

a Dirichlet prior on the per-document topic distribution. - Note: d is the document index variable, c is a word's topic drawn from the document's

topic distribution, P(c|d), and w is a word drawn from the word distribution of this

word's topic, P(w|c). (d and w are observable variables, c is a latent variable.) - A hidden Markov model (HMM) is a statistical Markov model: the modeled

system is a Markov process with unobserved (hidden) states;

In HMM, state is not visible, but output, dependent on state, is visible.

◦ Each state has a probability distribution over the possible output tokens;

◦ Sequence of tokens generated by an HMM gives some information about the sequence of

states.

Note: the adjective 'hidden' refers to the state sequence through which the

model passes, not to the parameters of the model;

A HMM can be considered a generalization of a mixture model where the

hidden variables are related through a Markov process;

Inference: prob. of an observed sequence by Forward-Backward Algorithm

and the most likely state trajectory by Viterbi algorithm (DP);

Learning: optimize state transition and output probabilities by Baum-Welch

algorithm (special case of EM). - logistic regression is a probabilistic statistical

classification model;

The prob. of the possible outcomes of a single trial are

modeled as a function of explanatory variables by a logistic

function;

Training: Maximizes conditional likelihood P(y|x) directly; - Convex optimization (logistic function): w = argmax|w P(Y|X, w);

◦ Adding regularization term as well for overfitting;

◦ Iterative solution: a gradient descent method.

In a Bayesian statistics context, prior distributions are normally

placed on the regression coefficients, usually as Gaussian

distributions;

◦ Apply Metropolis–Hastings algorithm (a more general MCMC method than

Gibbs sampling): based on proposal distribution or jumping distribution.

The proposal distribution Q proposes the next point that the random walk might move to. - The Naive Bayes classifier is designed when features are independent

of one another within each class, but it appears to work well in practice

even when that independence assumption is not valid. It classifies data

in two steps:

◦ Training step: Using the training samples, the method estimates the parameters of a

probability distribution, assuming features are conditionally independent given the class.

◦ Prediction step: For any unseen test sample, the method computes the posterior

probability of that sample belonging to each class. The method then classifies the test

sample according the largest posterior probability.

The class-conditional independence assumption greatly simplifies the

training step since you can estimate the one-dimensional class-

conditional density for each feature individually;

◦ While the class-conditional independence between features is not true in general,

research shows that this optimistic assumption works well in practice;

◦ This assumption of class independence allows the Naive Bayes classifier to better

estimate the parameters required for accurate classification while using less training data

than many other classifiers;

◦ This makes it particularly effective for datasets containing many predictors or features. - Supported distributions in NB classif.

◦ Naive Bayes is based on

estimating P(x|y), the probability or

probability density of features x given

class y.

◦ Support for normal (Gaussian), kernel,

multinomial, and multivariate

multinomial distributions.

Normal (Gaussian) Distribution: features have

normal distributions in each class;

Kernel: computes a separate kernel density

estimate for each class based on the training

data for that class;

Multinomial Distribution ("bag of words" model):

each feature for the count of one word;

classification is based on the relative frequencies

of the words.

Multivariate Multinomial Distribution: feature

categories, differ from the class levels for the

response variable. - Separable Data

◦ An SVM classifies data by finding the best hyperplane that

separates all data points of one class from those of the other

class.

◦ “Margin” means the maximal width of the slab parallel to the

hyperplane that has no interior data points.

◦ The support vectors are the data points that are closest to the

separating hyperplane. - Mathematical Formulation: Primal.

l

2

min f

C

y

K

i

if(xi) 1 - i, for all i

i

f, i

i 1

0

Mathematical Formulation: Dual.

l y 0

i

i

min

l

l

l

1

α

α α y y K(x , x )

i

i

i j i j i j

1

αi

2

i 1

i 1

j 1

0 i C, for all i

Variables i are slack variables measuring the error made at

point (xi,yi) - Non-separable Data

◦ Your data might not allow for a separating hyperplane. In that case, SVM can use

a soft margin, meaning a hyperplane that separates many, but not all data points.

Nonlinear Transformation with Kernels

◦ Some binary classification problems do not have a simple hyperplane as a useful

separating criterion;

◦ Theory of reproducing kernels: Polynomials, Radial basis or sigmoid function;

◦ Nonlinear kernels can use identical calculations and solution algorithms, and

obtain classifiers that are nonlinear. - Random Field: F={F1,F2,…FM} a family of random

variables on set S in which each Fi takes value fi in a

label set L.

Markov Random Field: F is said to be a MRF on S

w.r.t. a neighborhood N if and only if it satisfies

Markov property.

◦ Generative model for joint probability p(x)

◦ allows no direct probabilistic interpretation

◦ define potential functions Ψ on maximal cliques A

map joint assignment to non-negative real number

requires normalization

MRF is undirected graphical models - Conditional , not joint, probabilistic sequential models p(y|x)

Allow arbitrary, non-independent features on the observation seq X

Specify the probability of possible label seq given an observation seq

Prob. of a transition between labels depend on past/future observ.

Relax strong independence assumptions, no p(x) required

CRF is MRF plus “external” variables, where “internal” variables Y of MRF

are un-observables and “external” variables X are observables

Linear chain CRF: transition score depends on current observation

◦ Inference by DP like HMM, learning by forward-backward as HMM

Optimization for learning CRF: discriminative model

◦ Conjugate gradient, stochastic gradient,… - Ensemble methods use multiple models to obtain better predictive

performance than could be obtained from any of the constituent models;

Ensembles combine many weak learners to produce a strong learner;

◦ Term ensemble is for methods that generate multiple hypotheses using the

same base learner;

Ensembles can be shown to have more flexibility in the functions they

can represent. This flexibility can, in theory, enable them to over-fit the

training data more than a single model would; but in practice, some

ensemble techniques (especially bagging) tend to reduce problems

related to over-fitting of the training data;

Empirically, ensembles tend to yield better results when there is a

significant diversity among the models;

Popular types:

◦ Bagging, boosting, stacking, stochastic discrimination, random subspace, …

◦ Random forest, derived from the random subspace method, constructs a

multitude of decision trees at training time and outputting the class that is

the mode of the classes output by individual trees. - Samples the training set, generate random independent bootstrap

replicates, constructs the classifier, aggregates them by a majority

vote in the final decision rule; (called “bootstrap aggregating”)

Bootstrapping is based on random sampling with replacement;

Therefore, taking bootstrap replicate (random selection with

replacement) of the training set sometimes avoid or get less

misleading training objects in the bootstrap training set;

Consequently, a classifier constructed on such a training set may

have a better performance. - At each step, training data are re-weighted that incorrectly

classified objects get larger weights in a new, modified training

set, thus actually maximizes the margins between objects;

Classifiers are constructed on weighted versions of the training

set, which are independent on previous classification results;

Boosting learning originated from the Probably Approximately

Correct (PAC) learning theory;

AdaBoost is the first algorithm that could adapt to the weak

learners;

Variant of Adaboost (Adaptive boosting):

◦ LogitBoost:

◦ GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of - In SVM, one performs global optimization in order to maximize

the minimal margin, while in Boosting one maximizes the

margin locally for each training object;

SVM uses the L-2 norm for both hypothesis and weight vector,

while Boosting uses the L- norm for the hypothesis vector and

L-1 norm for the weight vector;

It is shown that if the number of relevant weak hypothesis k is a

small fraction of the total number of weak hypotheses, then the

margin associated with Boosting will be much larger than the

one associated with SVM;

SVM corresponds to quadratic programming while Boosting only

to linear programming;

Through the method of kernels, SVM allows to perform low

dimensional calculation that mathematically equivalent to inner

products in a high dimensional ‘virtual’ space; Instead, Boosting

employs greedy search, the re-weighting of the examples

changes the distribution with respect to which the correlation is

measured, thus guiding the weak learner to find different

correlated coordinates. - With discrete stochastic processes, arbitrary numbers of very

weak models are generated and combined to separate the

points in multi-dimensional spaces.

◦ Can be regarded as a method of dimensionality reduction;

◦ ``uniformity'‘: two points from the same class are equally likely to be

captured by a weak model of a given size ;

◦ “enrichment”: weak models do not have the same chance of capturing

points from different classes .

SD has the property that the very complex and accurate

classifiers produced in this way retain the ability,

characteristic of their weak component pieces, to generalize

to new data;

It is in combining these weak models that the discriminative

power is developed.

SD simply transforms the multi-d feature vectors to points

coming from two uni-variate normal distributions;

These two uni-variate normal distributions are separate

further as the number of weak models increases, which

intuitively is similar to how people learn the knowledge. - Classifiers are constructed in random subspaces of the data feature

space, usually combined by simple majority voting in the final

decision rule;

It relies on also a stochastic process that randomly selects a number

of components of the given feature vector in constructing each

classifier;

Geometrically this is equivalent to projecting all the points to the

selected subspace;

Random subspace method effectively takes advantages of high

dimensionality . - • Defined as a set {D, X, Y} such that

Y

=

D

X - • Given a D and yi, how to find xi ?

• Constraint : xi is sufficiently sparse;

• Finding exact solution is difficult;

• Approximate a solution good

enough? - Greedy methods: projecting the residual on some atom;

◦ Matching pursuit, orthogonal matching pursuit;

L1-norm: Least Absolute Shrinkage and Selection Operator

(LASSO);

◦ The residual is updated iteratively in the direction of the atom;

Gradient-based finding new search directions

◦ Projected Gradient Descent

◦ Coordinate Descent

Homotopy: a set of solutions indexed by a parameter

(regularization)

◦ LARS (Least Angle Regression)

First order/proximal methods: Generalized gradient descent

◦ solving efficiently the proximal operator

◦ soft-thresholding for L1-norm

◦ Accelerated by the Nesterov optimal first-order method

Iterative reweighting schemes

◦ L2-norm: Chartand and Yin (2008)

◦ L1-norm: Cand`es et al. (2008) - D, y

x

Select dk with max

projection on residue

xk = arg min ||y-Dkxk||

Update residue

Check terminating

condition

r = y - Dkxk - What D to use?

A fixed overcomplete set of basis: no adaptivity.

Steerable wavelet;

Bandlet, curvelet, contourlet;

DCT Basis;

Gabor function;

….

Data adaptive dictionary – learn from data;

K-SVD: a generalized K-means clustering process for

Vector Quantization (VQ).

◦ An iterative algorithm to effectively optimize the sparse

approximation of signals in a learned dictionary.

Other methods of dictionary learning:

◦ non-negative matrix decompositions.

◦ sparse PCA (sparse dictionaries).

◦ fused-lasso regularizations (piecewise constant dictionaries)

Extending the models: Sparsity + Self-similarity=Group Sparsity - • Select atoms from input;

• Atoms can be image patches;

Initialize Dictionary

• Patches are overlapping.

• Use OMP or any pursuit method;

Sparse Coding

• Output sparse code for all signals;

(OMP)

• Minimize representation error.

Update Dictionary

One atom at a time - Representation learning attempts to automatically learn good

features or representations;

Deep learning algorithms attempt to learn multiple levels of

representation of increasing complexity/abstraction (intermediate

and high level features);

Become effective via unsupervised pre-training + supervised fine

tuning;

◦ Deep networks trained with back propagation (without unsupervised pre-

training) perform worse than shallow networks.

Deal with the curse of dimensionality (smoothing & sparsity) and

over-fitting (unsupervised, regularizer);

Semi-supervised: structure of manifold assumption;

◦ labeled data is scarce and unlabeled data is abundant. - •

Supervised training of deep models (e.g. many-layered Nets) is too hard

(optimization problem);

Learn prior from unlabeled data;

•

Shallow models are not for learning high-level abstractions;

Ensembles or forests do not learn features first;

Graphical models could be deep net, but mostly not.

•

Unsupervised learning could be “local-learning”;

Resemble boosting with each layer being like a weak learner

•

Learning is weak in directed graphical models with many hidden

variables;

Sparsity and regularizer.

•

Traditional unsupervised learning methods aren’t easy to learn multiple

levels of representation.

Layer-wised unsupervised learning is the solution.

•

Multi-task learning (transfer learning and self taught learning);

•

Other issues: scalability & parallelism with the burden from big data. - A neural network = running several logistic regressions at the

same time;

◦ Neuron=logistic regression or…

Calculate error derivatives (gradients) to refine: back

propagate the error derivative through model (the chain rule)

◦ Online learning: stochastic/incremental gradient descent;

◦ Batch learning: conjugate gradient descent. - CNN is a special kind of multi-layer NNs applied to 2-d arrays

(usually images), based on spatially localized neural input;

◦ local receptive fields(shifted window), shared weights (weight averaging)

across the hidden units, and often, spatial or temporal sub-sampling;

◦ Related to generative MRF/discriminative CRF:

CNN=Field of Experts MRF=ML inference in CRF;

◦ Generate ‘patterns of patterns’ for pattern recognition.

Each layer combines (merge, smooth) patches from previous layers

◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the

data.

◦ Convolution filters: (translation invariance) unsupervised;

◦ Local contrast normalization: increase sparsity, improve

optimization/invariance.

C layers

convolutions,

S layers

pool/sample - Convolutional Networks are trainable multistage architectures composed of multiple

stages;

Input and output of each stage are sets of arrays called feature maps;

At output, each feature map represents a particular feature extracted at all locations on

input;

Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling

layer;

A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification

module;

◦ A fully connected layer: softmax transfer function for posterior distribution.

Filter: A trainable filter (kernel) in filter bank connects input feature map to output

feature map;

* is discrete convolution operator

Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;

◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;

Feature pooling: treats each feature map separately -> a reduced-resolution output

feature map;

Supervised training is performed using a form of SGD to minimize the prediction error;

◦ Gradients are computed with the back-propagation method.

Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-

tuning. - A layered model composed of convolution and subsampling

operations followed by a holistic representation and

ultimately a classifier for handwritten digits;

Local receptive fields (5x5) with local connections;

Output via a RBF function, one for each class, with 84 inputs

each;

Learning by Graph Transformer Networks (GTN); - A layered model composed of convol., subsample.,

followed by a holistic representation and all-in-all a

landmark classifier;

Consists of 5 convolutional layers, some of which

followed by max-pooling layers, 3 fully-connected

layers with a final 1000-way softmax;

Fully-connected “FULL” layers: linear

classifiers/matrix multiplications;

ReLU are rectified-linear nonlinearities on layer

output, can be trained several times faster;

Local normalization scheme aids generalization;

Overlapping pooling slightly less prone to overfitting;

Data augmentation: artificially enlarge the dataset

using label-preserving transformations;

Dropout: setting to zero output of each hidden

neuron with prob. 0.5;

Trained by SGD with batch # 128, momentum 0.9,

weight decay 0.0005. - The network’s input is 150,528-dimensional, and the number of neurons in the network’s

remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000. - Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet

Classification in 2013;

Preprocessing: subtracting a per-pixel mean;

Data augmentation: downsampled to 256 pixels and a random 224 pixel crop

is taken out of the image and randomly flipped horizontally to provide more

views of each example;

SGD with min-batch # 128, learning rate annealing, momentum 0.9 and

dropout to prevent overfitting;

65M parameters trained for 12 days on a single Nvidia GPU;

Visualization by layered DeconvNets: project the feature activations back to

the input pixel space;

◦ Reveal input stimuli exciting individual feature maps at any layer;

◦ Observe evolution of features during training;

◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts

of scenes are important;

DeconvNet attached to each of ConvNet layer, unpooling uses locations of

maxima to preserve structure;

Multiple such models were averaged together to further boost performance;

Supervised pre-training with AlexNet, then modify it to get better

performance (error rate 14.8%). - Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3

color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature

maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized

55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216

dimensions). The final layer: a C-way softmax function, C - number of classes. - Top: A deconvnet layer (left)

attached to a convnet layer

(right). The deconvnet will

reconstruct approximate version

of convnet features from the

layer beneath.

Bottom: Unpooling operation in

the deconvnet, using switches

which record the location of the

local max in each pooling region

(colored zones) during pooling in

the convnet. - •

A hybrid model: can be trained as

generative or discriminative model;

•

Deep architecture: multiple layers (learn

features layer by layer);

• Multi layer learning is difficult in sigmoid

belief networks.

• Top two layers are undirected

connections, Restricted Boltzmann

Machine (RBM);

• Lower layers get top down directed

connections from layers above;

•

Unsupervised or self-taught pre-learning

provides a good initialization;

• Greedy layer-wise unsupervised training

for RBM;

Belief net is directed acyclic graph

•

Supervised fine-tuning

composed of stochastic variables.

• Generative: wake-sleep algorithm (Up-

down);

• Discriminative: back propagation

(bottom-up); - •

Boltzmann machine is a stochastic recurrent model, and RBM is its special case

(one hidden layer);

•

Learning internal representations that become increasingly complex;

•

High-level representations built from a large supply of unlabeled inputs;

•

Pre-training: learning a stack of modified RBMs, which are composed to create a

deep Boltzmann machine (undirected graph);

•

Generative fine-tuning: different from DBN

• Positive and negative phase

•

Discriminative fine-tuning: the same to DBN

• Back propagation. - •

Denoising Auto-Encoder: Multilayer NNs with target output=input;

•

Auto-encoder learns the salient variation like a nonlinear PCA;

•

Stack many (may be sparse) auto-encoders in succession and train

them using greedy layer-wise unsupervised learning

• Drop the decode layer each time

• Performs better than stacking RBMs;

•

Supervised training on the last layer using final features;

•

(option) Supervised training on the entire network to fine- tune all

weights of the neural net;

•

Empirically not quite as accurate as DBNs. - Stochastic Gradient Descent (SGD)

• The general class of estimators that arise as minimizers of

sums are called M-estimators;

• Where are stationary points of the likelihood function (or zeroes of its

derivative, the score function)?

• Online gradient descent samples a subset of summand

functions at every step;

• The true gradient of is approximated by a gradient at a single example;

• Shuffling of training set at each pass.

• There is a compromise between two forms, often called

"mini-batches", where the true gradient is approximated by a

sum over a small number of training examples.

• STD converges almost surely to a global minimum when the

objective function is convex or pseudo-convex, and

otherwise converges almost surely to a local minimum. - Back Propagation

• Back propagation is a multi-layer network training

method

• We find parameters W, to minimize an error E (f(x0,w),y0) = -log (f(x0,w)- y0).

• For this we will do iterative gradient descent:

w(t) = w(t-1) – λ * −𝜕𝐸(t)

𝜕𝑤

• Error propagation

• Forward propagation of a training pattern's input through the

multilayer network to generate the output activations;

• Backward propagation of the output activations (logistic or soft-max)

through the multiplayer network using the pattern target to generate

deltas of all output and hidden units (the chain rule);

𝜕𝐸

𝜕𝐸

𝜕𝑦

=

×

𝑙(𝑤,𝑦𝑙−1)

𝜕𝑦𝑙−1

𝜕𝑦𝑙

𝜕𝑦𝑙−1

𝜕𝐸

𝜕𝐸

𝜕𝑦

=

×

𝑙(𝑤,𝑦𝑙−1)

𝜕𝑤𝑙

𝜕𝑦𝑙

𝜕𝑤𝑙

• Weight update

• Multiply its output delta and input activation to get the weight

gradient;

• Subtract a ratio (i.e. the learning rate) of the gradient from the weight. - Euclidean loss is used for regressing to real-valued lables [-

inf,inf];

Sigmoid cross-entropy loss is used for predicting K independent

probability values in [0,1];

Softmax (normalized exponential) loss is predicting a single class

of K mutually exclusive classes;

◦ Generalization of the logistic function that "squashes" a K-dimensional vector

of arbitrary real values z to a K-dimensional vector of real values σ(z) in the

range (0, 1).

◦ The predicted probability for the j'th class given a sample vector x is

Sigmoidal or Softmax normalization is a way of reducing the

influence of extreme values or outliers in the data without

removing them from the dataset. - Too large learning rate

◦ cause oscillation in searching for the minimal point

Too slow learning rate

◦ too slow convergence to the minimal point

Adaptive learning rate

◦ At the beginning, the learning rate can be large when the

current point is far from the optimal point;

◦ Gradually, the learning rate will decay as time goes by.

Should not be too large or too small:

◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)

◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is

almost a constant. - Classical Momentum (CM) is a technique for accelerating gradient descent

that accumulates a velocity vector in directions of persistent reduction in the

objective across iterations: given the objective function f(θ),

Vt+1 = µVt - ε𝛻f(θt), θt+1 = θt + Vt+1,

With ε>0 as learning rate, µͼ[0,1] as momentum coefficient and 𝛻f(θt) as

gradient at θt;

Nesterov’s Accelerated Gradient (NAG) is also a 1st order optimization

method with better convergence rate guarantee than gradient descent;

Vt+1 = µVt - ε𝛻f(θt + µVt), θt+1 = θt + Vt+1,

For convex objectives, momentum-based methods outperform SGD in the

early or transient stages of optimization, however equally effective in the

final stage;

Hessian-free (HF) methods and truncated Newton methods work by

optimizing a local quadratic model of the objective via the linear conjugate

gradient (CG) algorithms;

◦ If CG terminated after just one step, HF becomes equivalent to NAG; - AdaGrad: asymptotically sublinear regret, adapt learning rate for each weight

based on historical info.:

𝛾

𝜕𝐸

∆𝑊𝑖𝑗 𝑡 + 1 = −

∗

(𝑡 + 1)

𝜕𝑤𝑖𝑗

𝑡+1

𝜕𝐸

1

(

(𝜏))2

𝜕𝑤𝑖𝑗

◦ Normalizes each coordinate of gradient by the historical (previous iterations) magnitude

of that coordinate;

◦ Frequently occurring features in the gradients get small learning rates and infrequent

features get higher ones;

◦ Sensitive to initial conditions, continual decay of learning rate.

AdaDelta: accumulate the denominator over last k gradients (a sliding window):

𝜕𝐸

𝛼 𝑡 + 1 = 𝑡+1

𝑡−𝑘+1(

(𝜏))2

𝜕𝑤

𝛾

𝜕𝐸

∆𝑊 𝑡 + 1 = −

∗

(𝑡 + 1) .

𝛼(𝑡+1)

𝜕𝑤

◦ This requires to keep last k gradients; instead it use a simpler formula:

𝜕𝐸

𝛽 𝑡 + 1 = 𝜌 ∗ 𝛽 𝑡 + 1 − 𝜌 ∗ (

(𝑡 + 1))2

𝜕𝑤

𝛾

𝜕𝐸

∆𝑊 𝑡 + 1 = −

∗

(𝑡 + 1) .

𝛽 𝑡+1 +𝜖

𝜕𝑤

◦ Avoid AdaGrad’s weakness. - The easiest and most common method to

reduce overfitting on image data is to artificially

enlarge the dataset using label-preserving

transformations;

Perturbing an image I by transformations that

leave the underlying class unchanged (e.g.

cropping and flipping) in order to generate

additional examples of the class;

Two distinct forms of data augmentation:

◦ image translation

◦ horizontal reflections

◦ changing RGB intensities - Weight decay or L2 regularization adds a penalty term to the error function, a

term called the regularization term: the negative log prior in Bayesian

justification,

◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the

same;

◦ Prefer to learn small weights, and large weights allowed if improving the original cost

function;

◦ A way of compromising btw finding small weights and minimizing the original cost

function;

In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;

L1 regularization: the weights not really useful shrink by a constant amount

toward zero;

◦ Act like a form of feature selection;

◦ Make the input filters cleaner and easier to interpret;

L2 regularization penalizes large values strongly while L1 regularization ;

Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose

equilibrium distr. is the posterior distribution for weights & hyper-parameters;

Hybrid Monte Carlo: gradient and sampling. - Steps in early stopping:

◦ Divide the available data into training and validation sets.

◦ Use a large number of hidden units.

◦ Use very small random initial values.

◦ Use a slow learning rate.

◦ Compute the validation error rate periodically during training.

◦ Stop training when the validation error rate "starts to go up".

Early stopping has several advantages:

◦ It is fast.

◦ It can be applied successfully to networks in which the number of

weights far exceeds the sample size.

◦ It requires only one major decision by the user: what proportion of

validation cases to use.

Practical issues in early stopping:

◦ How many cases do you assign to the training and validation sets?

◦ Do you split the data into training and validation sets randomly or by

some systematic algorithm?

◦ How do you tell when the validation error rate "starts to go up"? - Dropout: set the output of each hidden neuron to zero w.p.

0.5.

◦ Motivation: Combining many different models that share parameters

succeeds in reducing test errors by approximately averaging together the

predictions, which resembles the bagging.

◦ The units which are “dropped out” in this way do not contribute to the

forward pass and do not participate in back propagation.

◦ So every time an input is presented, the NN samples a different

architecture, but all these architectures share weights.

◦ This technique reduces complex co-adaptations of units, since a neuron

cannot rely on the presence of particular other units.

◦ It is, therefore, forced to learn more robust features that are useful in

conjunction with many different random subsets of the other units.

◦ Without dropout, the network exhibits substantial overfitting.

◦ Dropout roughly doubles the number of iterations required to converge.

Maxout takes the maximum across multiple feature maps; - Markov Chain: a stochastic process in which future states are

independent of past states but the present state.

◦ Markov chain will typically converge to a stable distribution.

Monte Carlo Markov Chain: sampling using ‘local’ information

◦ Devise a Markov chain whose stationary distribution is the target.

Ergodic MC must be aperiodic, irreducible, and positive recurrent.

◦ Monte Carlo Integration to get quantities of interest.

Metropolis-Hastings method: sampling from a target distribution

◦ Create a Markov chain whose transition matrix does not depend on the

normalization term.

◦ Make sure the chain has a stationary distribution and it is equal to the

target distribution (accept ratio).

◦ After sufficient number of iterations, the chain will converge the stationary

distribution.

Gibbs sampling is a special case of M-H Sampling.

◦ The Hammersley-Clifford Theorem: get the joint distribution from the

complete conditional distribution.

Hybrid Monte Carlo: gradient sub step for each Markov chain. - Variational approximation modifies the optimization problem

to be tractable, at the price of approximate solution;

Mean Field replaces M with a (simple) subset M(F), on which A*

(μ) is a closed form (Note: F is disconnected graph);

◦ Density becomes factorized product distribution in this sub-family.

◦ Objective: K-L divergence.

Mean field is a structured variation approximation approach:

◦ Coordinate ascent (deterministic);

Compared with stochastic approximation (sampling):

◦ Faster, but maybe not exact. - Contrastive divergence (CD) is proposed for training PoE first, also

being a quicker way to learn RBMs;

◦ Contrastive divergence as the new objective;

◦ Taking gradients and ignoring a term which is usually very small.

Steps:

◦ Start with a training vector on the visible units.

◦ Then alternate between updating all the hidden units in parallel and

updating all the visible units in parallel.

Can be applied using any MCMC algorithm to simulate the model

(not limited to just Gibbs sampling);

CD learning is biased: not work as gradient descent

Improved: Persistent CD explores more modes in the distribution

◦ Rather than from data samples, begin sampling from the mode samples,

obtained from the last gradient update.

◦ Still suffer from divergence of likelihood due to missing the modes.

Score matching: the score function does not depend on its normal.

factor. So, match it b.t.w. the model with the empirical density. - Pre-trained DBN is a generative model;

Do a stochastic bottom-up pass (wake phase)

◦ Get samples from factorial distribution (visible first, then generate hidden);

◦ Adjust the top-down weights to be good at reconstructing the feature

activities in the layer below.

Do a few iterations of sampling in the top level RBM

◦ Adjust the weights in the top-level RBM.

Do a stochastic top-down pass (sleep phase)

◦ Get visible and hidden samples generated by generative model using data

coming from nowhere!

◦ Adjust the bottom-up weights to be good at reconstructing the feature

activities in the layer above.

◦ Any guarantee for improvement? No!

The “Wake-Sleep” algorithm is trying to describe

the representation economical (Shannon’s coding

theory). - Deep networks tend to have more local minima problems

than shallow networks during supervised training

Train first layer using unlabeled data

◦ Supervised or semi-supervised: use more unlabeled data.

Freeze the first layer parameters and train the second layer

Repeat this for as many layers as desire

◦ Build more robust features

Use the outputs of the final layer to train the last supervised

layer (leave early weights frozen)

Fine tune the full network with a supervised approach;

Avoid problems to train a deep net in a supervised fashion.

◦ Each layer gets full learning

◦ Help with ineffective early layer learning

◦ Help with deep network local minima - Take advantage of the unlabeled data;

Regularization Hypothesis

◦ Pre-training is “constraining” parameters in a region

relevant to unsupervised dataset;

◦ Better generalization (representations that better

describe unlabeled data are more discriminative for

labeled data) ;

Optimization Hypothesis

◦ Unsupervised training initializes lower level parameters

near localities of better minima than random initialization

can.

Only need fine tuning in the supervised learning stage. - Pre-training in one stage

◦ Positive phase: clamp observed, sample hidden, using

variational approximation (mean-field)

◦ Negative phase: sample both observed and hidden, using

persistent sampling (stochastic approximation: MCMC)

Pre-training in two stages

◦ Approximating a posterior distribution over the states of hidden

units (a simpler directed deep model as DBNs or stacked DAE);

◦ Train an RBM by updating parameters to maximize the lower-

bound of log-likelihood and correspond. posterior of hidden

units.

Options (CAST, contrastive divergence, stochastic approximation…).