このページは https://speakerdeck.com/ogrisel/trends-in-machine-learning-2 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/06/27)にアップロードinテクノロジー

SciPy 2013, Austin, TX

Video of the presentation here: http://www.youtube.com/watch?v=S6IbD86Dbvc

- Trends in Machine Learning

and the

SciPy Community

SciPy - Austin, TX - June 2013

Tuesday, June 25, 13 - Outline

• Black Box Models with scikit-learn

• Probabilistic Programming with PyMC

• Deep Learning with PyLearn2 & Theano - Machine Learning

==

Executable Data

Summarization - Blackbox Machine

Learning

with scikit-learn

Data

Predictions - Supervised Machine Learning
- Supervised ML with sklearn
- Spam Classification

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

m?

wo

wo

wo

wo

wo

wo

Spa

email 1

0

2

1

0

0

1

0

email 2

0

0

0

1

1

0

1

email 3

1

1

0

0

0

1

1

email 4

1

0

1

1

2

3

0

email 5

0

0

1

1

0

0

0

X

y - Topic Classification

s

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

rt

ines

wo

wo

wo

wo

wo

wo

Spo

Bus

Tech.

news 1

0

2

1

0

0

1

0

0

1

news 2

0

0

0

1

1

0

1

1

0

news 3

1

1

0

0

0

1

1

0

0

news 4

1

0

1

1

2

3

0

1

1

news 5

0

0

1

1

0

0

0

0

0

X

y - Sentiment Analysis

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

wo

wo

wo

wo

wo

wo

Positive?

review 1

0

2

1

0

0

1

0

review 2

0

0

0

1

1

0

1

review 3

1

1

0

0

0

1

1

review 4

1

0

1

1

2

3

0

review 5

0

0

1

1

0

0

0

X

y - Vegetation Cover Type

to

e

e

e

stance t river

titudt river

tion

in t

titud

d Ice

Di

Al

Ra

ssland

Latitud

Al

Slope

Slope

Ari

closes

closes

orienta

fores

Gra

location 1

46. 200.

1

0

0.0

N

0

1

0

location 2

-30. 150.

2.

149 0.1

S

1

0

0

location 3

87.

50 1000 10

0.1

W

0

0

1

location 4

45.

10

10.

1

0.4 NW

0

1

0

location 5

5.

2.

67.

1.

0.2

E

1

0

0

X

y - Object Classification in

Images

an

T

T

T

T

T

T

r

tri

SIF

t

rd 1

SIF rd 2

SIF rd 3

SIF rd 4

SIF rd 5

SIF rd 6

Ca

Ca

es

wo

wo

wo

wo

wo

wo

Ped

image 1

0

2

1

0

0

1

0

0

1

image 2

0

0

0

1

1

0

1

1

0

image 3

1

1

0

0

0

1

1

0

0

image 4

1

0

1

1

2

3

0

1

1

image 5

0

0

1

1

0

0

0

0

0

X

y - Many more

applications. .

• Product Recommendations

Given past purchase history of al users

• Ad-Placement / bidding in Web Pages

Given user browsing history / keywords

• Fraud detection

Given features derived from behavior - Unsupervised ML
- Limitations of Blackbox

Machine Learning - Problem #1:

Not So Blackbox

• Feature Extraction: highly domain specific

• + Feature Normalization / Transformation

• Unmet Statistical Assumptions

• Linear Separability of the target classes

• Correlations between features

• Natural metric to for the features - scikit-learn in practice

by Andreas Mueller - Problem #2:

Lack of Explainability

Blackbox models can rarely

explain what they learned.

Expert knowledge required to understand the model

behavior and gain deeper insight on the data:

this is model specific. - Possible Solutions

• Problem #1: Costly Feature Engineering

• Unsupervised feature extraction with

Deep Learning

• Problem #2: Lack of Explainability

• Probabilistic Programming with

generic inference engines - Probabilistic

Programming

Openbox Models

Blackbox Inference Engine

Data

Predictions - What is Prob.

Programming?

• Model unknown causes of a

phenomenon with random variables

• Write a programmatic story to derive

observables from unknown variables

• Plug data into observed variables

• Use engine to invert the story and assign

prob. distributions to unknown params. - Inverting the Story

w/ Bayesian Inference

p(H|D) = p(D|H). p(H) / p(D)

D: data, H: hypothesis (e.g. parameters)

p(D|H): likelihood

p(H): prior

p(H|D): posterior

p(D): evidence - Generic Inference with

MCMC

• Monte Carlo Markov Chains

• Start from a Random Point

• Move variable values randomly

• Reject with new sample randomly

depending on a likelihood test

• Accumulate non-rejected samples and cal

it the trace - Alternatives to MCMC

• Closed-Form Solutions

• Belief Propagation

• Deterministic Approximations:

‣ Mean Field Approximation

‣ Variational Bayes and VMP - Alternatives to MCMC

• Closed-Form Solutions

• Belief Propagation

• Deterministic Approximations:

‣ Mean Field Approximation

‣ Variational Bayes and VMP

Only is VMP seems as generic

as MCMC for Prob. Programming - Implementations

• Prob. Programming with MCMC

• Stan: in C++ with R bindings

• PyMC: in Python / NumPy / Theano

• Prob. Programming with VMP

• Infer.NET (C#, F#.. , academic use only)

• Infer.py (pythonnet bindings, very alpha) - Why is Probabilistic

Programming so hot?

• Open Model that tells a Generative Story

• Story Telling is good for Human

Understanding and Persuasion

• Grounded in Quantitative Analysis and

the sound theory of Bayesian Inference

• Black Box inference Engine (e.g. MCMC):

‣ can be treated as Compiler Optimization - Why Bayesian

Inference?

• Makes it possible to explicitly model

uncertainty caused by lack of data using

priors - Prob. Programming not

so hot (yet)?

• Scalability? Accuracy?

• Highly nonlinear dependencies lead to

highly multi-modals posterior

• Hard to mix between posterior modes:

slow convergence

• How to best build models? How to choose

priors? - Old idea but recent

developments

• No-U-Turn Sampler (2011): breakthrough

for scalability of MCMC for some model

classes (in Stan and PyMC3 with Theano)

• VMP (orig. paper 2005, generalized in 2011)

in Infer.NET

• New DARPA Program (2013-2017) to fund

research on Prob. Programming. - Learning Prob.

Programming

• Probabilistic Programming and Bayesian

Methods for Hackers

• Creative Commons Book on Github

• Uses PyMC & IPython notebook

• Doing Bayesian Data Analysis

• Book with example in R and BUGS - Deep Learning

The end of feature engineering? - A bit of history

It al started with connectionist models in the late 50s

early 60s - 60s

70s

80s

90s

00s

10s

1957

Perceptron

by Rosenblatt - The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- The Perceptron
- But the perceptron

cannot solve non-linear

classification tasks. . - The 2D XOR problem
- The 2D XOR problem
- Neural Nets progress

in the 80s with early

successes (e.g. OCR) - 60s

70s

80s

90s

00s

10s

1957

1986

Backprop

by

Perceptron

Rumelhart,

by Rosenblatt

Hinton &

Wil iams - 2 Major Problems

• In practice Backpropagation stops

working with more than 1 or 2 hidden

layers

• Overfitting: forces early stopping - Overfitting with Neural

Networks

Average

number

of mis-

Error on testing

classified

set examples

examples

Error on training

set examples

Number of passes over in the training set (epochs) - One exception

• For Computer Vision: Convolutional

Networks can learn deep hierarchies

• Shared weights in convolution kernel

reduce the total number of parameters

hence limit the over-fitting problem of nets

• On works if task in translation invariant in

original feature space - 60s

70s

80s

90s

00s

10s

1957

1986

1998

Backprop

by

ConvNet

Perceptron

Rumelhart,

by

by Rosenblatt

Hinton &

LeCun

Wil iams - Theoretical Break!
- What is depth in ML?

• Architectural depth, not decision trees

depth

• Number of non-linearities between the

unobserved “True”, “Real-World” factors

of variations (causes) and the observed

data (e.g. pixels in a robot’s camera)

• A decision tree prediction function can be

factored as a sum of products: depth = 1 - Common ML

Architectures by Depth

Depth 0:

Perceptron, Linear SVM, Logistic Regression,

Multinomial Naive Bayes

Depth 1:

NN with 1 hidden Layer, Non-linear SVM,

Decision Trees

Depth 2:

NN with 2 hidden Layers, Ensembles of Trees - Depth 0: Linearly

Separable Data - Depth 1: the 2D XOR

problem - Generalizing the XOR

problem to N dim

• The Parity Function: given N boolean

variables, return 1 if number of positive

values is even, 0 otherwise

• Depth 1 models can learn the parity

function but:

• Need ~ 2^N hidden nodes / SVs

• Require 1 example per local variation - Depth 2+ models can

be more compact

• The parity function can be learned by

depth-2 NN with a number of hidden unit

that grows linearly with the dimensionality

of the problem

• Similar results for the Checker Board

learning task - Enough theory!

and back to our chronology of Deep Learning - So in the 90s

and early 00s

• ML community moved away from NN

• SVM with kernel: less hyper-parameters

• Random Forests / Boosted Trees often

beat al other models when enough labeled

data and CPU time

• The majority of kaggle winners use

ensemble of trees (up until recently. .) - But in 2006..

• Breakthrough by Geof. Hinton at the U. of

Toronto

• Unsupervised Pre-training of Deep

Architectures (Deep Belief Networks)

• Can be unfolded into a traditional NN for

fine tuning - 60s

70s

80s

90s

00s

10s

1957

1986

1998

2006

Backprop

Unsupervised

by

ConvNet

Perceptron

Pre-training

Rumelhart,

by

by Rosenblatt

by

Hinton &

LeCun

Hinton

Wil iams - Hidden Representation #1

RBM

Input data - Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Hidden Representation #3

RBM

Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Labels

clf

Hidden Representation #3

RBM

Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Soon replicated and

extended. .

• Bengio et al. in U. of Montreal

• Ng et al. in Stanford

• Replaced RBM with various other models

such as Auto-Encoders in a denoising

setting or with a sparsity penalty

• Started to reach state of the art in speech

recognition - Example:

Convolutional DBN

Convolutional Deep

Belief Networks for

Scalable Unsupervised

Learning of Hierarchical

Representations

Honglak Lee

Roger Grosse

Rajesh Ranganath

Andrew Y. Ng - 2012 results by

Stanford / Google - The YouTube Neuron
- 60s

70s

80s

90s

00s

10s

1957

1986

1998

2006

2012

Backprop

Unsupervised

by

ConvNet

Dropout

Perceptron

Pre-training

Rumelhart,

by

by

by Rosenblatt

by

Hinton &

LeCun

Hinton

Hinton

Wil iams - Dropout

• New way to train deep supervised neural

networks with much less overfitting and

without unsupervised pre-training

• Allows NN to beat state of the art

approaches on ImageNet (object

classification in images) - Dropout: the end of

overfitting?

Average

number

of mis-

Test error

classified

without dropout

examples

Test error with

dropout

Number of passes over the training set (epochs) - Even more recently

• Maxout networks:

• New non linearity optimized for Dropout

• Easier / faster to train

• Implementation in Python / Theano

• http://deeplearning.net/software/

pylearn2/ - 60s

70s

80s

90s

00s

10s

1957

1986

1998

2006

2012

Backprop

Unsupervised

by

ConvNet

Dropout

Perceptron

Pre-training

Rumelhart,

by

by

by Rosenblatt

by

Hinton &

LeCun

Hinton

Hinton

Wil iams

2013

Maxout,

Fast Dropout

DropConnect,

... - Why is Deep Learning

so hot?

• Can automatical y extract high level,

invariant, discriminative features from raw

data (pixels, sound frequencies. .)

• Starting to reach or beat State of the Art in

several Speech Understanding and

Computer Vision tasks

• Stacked Abstractions and Composability

might be a path to build a real AI - Why is Deep Learning

not so practical (yet)?

• Requires lots of (labeled) training data

• Typical y requires running a GPU for days

to fit a model + many hyperparameters

• Not yet that useful with high level abstract

input (e.g. text data): shal ow models can

already do very well for text classification - Deep Learning:

very hot research area

• Big Industry Players (Google, Microsoft,

IBM. .) investing in DL for speech

understanding and computer vision

• Many top ML researchers are starting to

look at DL & some on the theory side - In Production for

Speech Recognition

Google and Microsoft use Deep Auto

Encoders for extracting features for Speech

Recognition in Chrome, Android and

WindowsPhone - In Production for

Computer Vision

Google and Microsoft use Deep Auto

Encoders for extracting features for Speech

Recognition in Chrome, Android and

WindowsPhone - SOLVED!
- Concluding remarks

Advices to the SciPy crowd - Learn Prob.

Programming

• If you want to do data analysis with a priori

knowledge on the data generation process

from hidden causes

• If you want to model uncertainty of hidden

causes using probability distribution

• But don’t expect high predictive accuracy

• PyMC is a good place to start in Python - Learn Deep Learning

• If you have many labeled samples

• If you are researcher in Speech Recognition

or Computer Vision (or NLP)

• If you are ready to invest time in learning

the latest tricks

• If you are ready to mess with GPUs

• http://deeplearning.net - Otherwise stick with

scikit-learn for now

• K-Means, Regularized Linear Models and

Ensemble of Trees can get you pretty far

• Less parameters to tune

• Faster to train on CPUs

• http://scikit-learn.org http://kaggle.com

https://www.coursera.org/course/ml - Thanks

• https://github.com/CamDavidsonPilon/

Probabilistic-Programming-and-Bayesian-

Methods-for-Hackers

• http://radar.oreil y.com/2013/04/

probabilistic-programming.html

• http://deeplearning.net

• NIPS 2012, ICLR 2013, ICML 2013