このページは https://speakerdeck.com/ogrisel/trends-in-machine-learning の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/05/20)にアップロードinテクノロジー

with a focus on the Python ecosystem. Talk given at Paris DataGeeks 2013. This is a preliminary v...

with a focus on the Python ecosystem. Talk given at Paris DataGeeks 2013. This is a preliminary version of the talk I will give at SciPy 2013. Feedback appreciated.

- Trends in Machine Learning

and the

SciPy Community

Paris Datageeks - May 2013

mardi 21 mai 13 - About me

• Regular contributor to scikit-learn

• Interested in NLP, Computer Vision,

Predictive Modeling & ML in general

• Interested in Cloud Tech and Scaling Stuff

• Starting an ML consultancy, writing a book

http://ogrisel.com - Outline

• Black Box Models with scikit-learn

• Probabilistic Programming with PyMC

• Deep Learning with PyLearn2 & Theano - Machine Learning

==

Executable Data

Summarization - Blackbox Machine

Learning

with scikit-learn

Data

Predictions - Supervised Machine Learning
- Supervised ML with sklearn
- Spam Classification

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

m?

wo

wo

wo

wo

wo

wo

Spa

email 1

0

2

1

0

0

1

0

email 2

0

0

0

1

1

0

1

email 3

1

1

0

0

0

1

1

email 4

1

0

1

1

2

3

0

email 5

0

0

1

1

0

0

0

X

y - Topic Classification

s

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

rt

ines

wo

wo

wo

wo

wo

wo

Spo

Bus

Tech.

news 1

0

2

1

0

0

1

0

0

1

news 2

0

0

0

1

1

0

1

1

0

news 3

1

1

0

0

0

1

1

0

0

news 4

1

0

1

1

2

3

0

1

1

news 5

0

0

1

1

0

0

0

0

0

X

y - Sentiment Analysis

rd 1

rd 2

rd 3

rd 4

rd 5

rd 6

rd

wo

wo

wo

wo

wo

wo

Positive?

review 1

0

2

1

0

0

1

0

review 2

0

0

0

1

1

0

1

review 3

1

1

0

0

0

1

1

review 4

1

0

1

1

2

3

0

review 5

0

0

1

1

0

0

0

X

y - Vegetation Cover Type

to

e

e

e

stance t river

titudt river

tion

in t

titud

d Ice

Di

Al

Ra

ssland

Latitud

Al

Slope

Slope

Ari

closes

closes

orienta

fores

Gra

location 1

46. 200.

1

0

0.0

N

0

1

0

location 2

-30. 150.

2.

149 0.1

S

1

0

0

location 3

87.

50 1000 10

0.1

W

0

0

1

location 4

45.

10

10.

1

0.4 NW

0

1

0

location 5

5.

2.

67.

1.

0.2

E

1

0

0

X

y - Object Classification in

Images

an

T

T

T

T

T

T

r

tri

SIF

t

rd 1

SIF rd 2

SIF rd 3

SIF rd 4

SIF rd 5

SIF rd 6

Ca

Ca

es

wo

wo

wo

wo

wo

wo

Ped

image 1

0

2

1

0

0

1

0

0

1

image 2

0

0

0

1

1

0

1

1

0

image 3

1

1

0

0

0

1

1

0

0

image 4

1

0

1

1

2

3

0

1

1

image 5

0

0

1

1

0

0

0

0

0

X

y - Many more

applications. .

• Product Recommendations

Given past purchase history of al users

• Ad-Placement / bidding in Web Pages

Given user browsing history / keywords

• Fraud detection

Given features derived from behavior - Unsupervised ML
- Limitations of Blackbox

Machine Learning - Problem #1:

Not So Blackbox

• Feature Extraction: highly domain specific

• + Feature Normalization / Transformation

• Unmet Statistical Assumptions

• Linear Separability of the target classes

• Correlations between features

• Natural metric to for the features - scikit-learn in practice

by Andreas Mueller - Problem #2:

Lack of Explainability

Blackbox models can rarely

explain what they learned.

Expert knowledge required to understand the model

behavior and gain deeper insight on the data:

this is model specific. - Possible Solutions

• Problem #1: Costly Feature Engineering

• Unsupervised feature extraction with

Deep Learning

• Problem #2: Lack of Explainability

• Probabilistic Programming with

generic inference engines - Probabilistic

Programming

Openbox Models

Blackbox Inference Engine

Data

Predictions - What is Prob.

Programming?

• Model unknown causes of a

phenomenom with random variables

• Write a programmatic story to derive

observables from unknown variables

• Plug data into observed variables

• Use engine to invert the story and assign

prob. distributions to unknown params. - Inverting the Story

w/ Bayesian Inference

p(H|D) = p(D|H). p(H) / p(D)

D: data, H: hypothesis (e.g. parameters)

p(D|H): likelihood

p(H): prior

p(H|D): posterior

p(D): evidence - Generic Inference with

MCMC

• Monte Carlo Markov Chains

• Start from a Random Point

• Move Parameters values Randomly

• Reject with new sample randomly

depending on a likelihood test

• Accumulate non-rejected samples and cal

it the trace - Alternatives to MCMC

• Closed-Form Solutions

• Belief Propagation

• Deterministic Approximations:

‣ Mean Field Approximation

‣ Variational Bayes and VMP - Alternatives to MCMC

• Closed-Form Solutions

• Belief Propagation

• Deterministic Approximations:

‣ Mean Field Approximation

‣ Variational Bayes and VMP

Only is VMP seems as generic

as MCMC for Prob. Programming - Implementations

• Prob. Programming with MCMC

• Stan: in C++ with R bindings

• PyMC: in Python / NumPy / Theano

• Prob. Programming with VMP

• Infer.NET (C#, F#.. , academic use only)

• Infer.py (pythonnet bindings, very alpha) - Why is Probabilistic

Programming so cool?

• Open Model that tells a Generative Story

• Story Telling is good for Human

Understanding and Persuasion

• Grounded in Quantitative Analysis and

the sound theory of Bayesian Inference

• Black Box inference Engine (e.g. MCMC):

‣ can be treated as Compiler Optimization - Why is Bayesian

Inference so cool?

• Makes it possible explicitly inject

uncertainty caused by lack of data using

priors - Prob. Programming not

so cool (yet)?

• Scalability? Improving but stil . .

• Highly nonlinear dependencies lead to highy

multi-modals posterior

• Hard to mix between posterior modes:

slow convergence

• How to best build models? How to choose

priors? - Old idea but recent

developments

• No-U-Turn Sampler (2011): breakthrough

for scalability of MCMC for some model

classes (in Stan and PyMC3)

• VMP (orig. paper 2005, generalized in 2011)

in Infer.NET

• New DARPA Program (2013-2017) to fund

research on Prob. Programming. - Learning Prob.

Programming

• Probabilistic Programming and Bayesian

Methods for Hackers

• Creative Commons Book on Github

• Uses PyMC & IPython notebook

• Doing Bayesian Data Analysis

• Book with example in R and BUGS - Deep Learning

The end of feature engineering? - What is depth in ML?

• Architectural depth, not decision trees

depth

• Number of non-linearities between the

unobserved “True”, “Real-World” factors

of variations (causes) and the observed

data (e.g. pixels in a robot’s camera)

• But what is non-linearly separable data? - Depth 0: Linearly

Separable Data - Depth 0: Linearly

Separable Data - Depth 1: the 2D XOR

problem - Depth 1: the 2D XOR

problem - Common ML

Archictetures by Depth

Depth 0:

Perceptron, Linear SVM, Logistic Regression,

Multinomial Naive Bayes

Depth 1:

NN with 1 hidden Layer, RBF SVM, Decision

Trees

Depth 2:

NN with 2 hidden Layers, Ensembles of Trees - Generalizing the XOR

problem to N dim

• The Parity Function: given N boolean

variables, return 1 if number of positive

values is even, 0 otherwise

• Depth 1 models can learn the parity

function but:

• Need ~ 2^N hidden nodes / SVs

• Require 1 example per local variation - Deeper models can be

more compact

• The parity function can be learned by

depth-2 NN with a number of hidden unit

that grows linearly with the dimensionality

of the problem

• Similar results for the Checker Board

learning task - Common ML

Archictetures by Depth

Depth 0:

Perceptron == NN with 0 hidden layer

Depth 1:

NN with 1 hidden layer

Depth 2:

NN with 2 hidden layer - A bit of history

• Neural Nets progressed in the 80s with

early successes (e.g. Neural Nets for OCR)

• 2 Major Problems:

• Backprop does not work with more than

1 or 2 hidden layers

• Overfitting: forces early stopping - Overfitting with Neural

Networks

Number

of mis-

Error on testing

classified

set examples

examples

Error on training

set examples

Number of passes over the training set (epochs) - So in the 90s

and early 00s

• ML community moved away from NN

• SVM with kernel: less hyperparameters

• Random Forests / Boosted Trees often

beat al other models when enough

labeled data and CPU time - But in 2006..

• Breakthrough by Geof. Hinton at the U. of

Toronto

• Unsupervised Pre-training of Deep

Architectures (Deep Belief Networks)

• Can be unfolded into a traditional NN for

fine tuning - Hidden Representation #1

RBM

Input data - Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Hidden Representation #3

RBM

Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Labels

clf

Hidden Representation #3

RBM

Hidden Representation #2

RBM

Hidden Representation #1

RBM

Input data - Soon replicated and

extended. .

• Bengio et al. in U. of Montreal

• Ng et al. in Stanford

• Replaced RBM with various other models

such as Autoencoders in a denoising setting

or with a sparsity penalty

• Started to reach state of the art in speech

recognition, image recognition.. - Example:

Convolutional DBN

Convolutional Deep

Belief Networks for

Scalable Unsupervised

Learning of Hierarchical

Representations

Honglak Lee

Roger Grosse

Rajesh Ranganath

Andrew Y. Ng - More recently

• Second breakthrough in 2012 by Hinton

again: Dropout networks

• New way to train deep feed forward neural

networks with much less overfitting and

without unsupervised pretraining

• Allows NN to beat state of the art

approaches on ImageNet (object

recognition in images) - Even more recently

• Maxout networks:

• New non linearity optimized for Dropout

• Easier / faster to train

• Implementation in Python / Theano

• http://deeplearning.net/software/

pylearn2/ - Why is Deep Learning

so cool?

• Can automatical y extract high level,

invariant, discriminative features from raw

data (pixels, sound frequencies. .)

• Starting to reach or beat State of the Art in

some Speech Understanding and Computer

Vision tasks

• Stacked Abstractions and Composability

might be a path to build a real AI - Why is Deep Learning

not so cool (yet)?

• Requires lots of training data

• Typical y requires running a GPU for days

to fit a model + many hyperparameters

• Under-fitting issues for large models

• Not yet that useful with high level abstract

input (e.g. text data): shal ow models can

already do very well for text classification - DL: very hot research

area

• Big Industry Players (Google, Microsoft.. )

investing in DL for speech understanding

and computer vision

• Many top ML researchers are starting to

look at DL & some on the theory side - 2012 results by

Stanford / Google - The YouTube Neuron