このページは http://www.slideshare.net/0xdata/h2o-distributed-deep-learning-by-arno-candel-071614 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/07/17)にアップロードinテクノロジー

Deep Learning R Vignette Documentation: https://github.com/0xdata/h2o/tree/master/docs/deeplearni...

Deep Learning R Vignette Documentation: https://github.com/0xdata/h2o/tree/master/docs/deeplearning/

Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice in traditional business analytics.

This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of enterprise-scale problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization and optimization for class imbalance. World record performance on the classic MNIST dataset, best-in-class accuracy for eBay text classification and others showcase the power of this game changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core.

About the Speaker: Arno Candel

Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes.

He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich.

- Deep Learning

with H2O

Arno Candel

!

0xdata, H2O.ai

Scalable In-Memory Machine Learning

!

Hadoop User Group, Chicago, 7/16/14 - Who am I?

@ArnoCandel

PhD in Computational Physics, 2005

from ETH Zurich Switzerland

!

6 years at SLAC - Accelerator Physics Modeling

2 years at Skytree, Inc - Machine Learning

7 months at 0xdata/H2O - Machine Learning

!

15 years in HPC, C++, MPI, Supercomputing - H2O Deep Learning, @ArnoCandel

3

Outline

Intro & Live Demo (5 mins)

Methods & Implementation (20 mins)

Results & Live Demos (25 mins)

MNIST handwritten digits

text classification

Weather prediction

Q & A (10 mins) - H2O Deep Learning, @ArnoCandel

4

H2O Open Source in-memory

Prediction Engine for Big Data

Distributed in-memory math platform

➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API

➔ Java, R, Scala, Python, JSON, Browser-based GUI

!

Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling

Big Data + Better Algorithms

➔ Better Predictions - H2O Deep Learning, @ArnoCandel

5

About H20 (aka 0xdata)

Pure Java, Apache v2 Open Source

Join the www.h2o.ai/community!

+1 Cyprien Noel for prior work - H2O Deep Learning, @ArnoCandel

6

Customer Demands for

Practical Machine Learning

Requirements

Value

In-Memory

Fast (Interactive)

Distributed

Big Data (No Sampling)

Open Source

Ownership of Methods

API / SDK

Extensibility

H2O was developed by 0xdata to

meet these requirements - H2O Deep Learning, @ArnoCandel

7

H2O Integration

Java

R

JSON

Scala

Python

H2O

H

H2O

2O

YARN

Hadoop MR

HDFS

HDFS

HDFS

Standalone

Over YARN

On MRv1 - H2O Deep Learning, @ArnoCandel

8

H2O Architecture

Prediction Engine

Nano fast

R Engine

Scoring Engine

Distributed

In-Memory K-V store

Machine

Col. compression

Learning

Memory manager

Algorithms

MapReduce

e.g. Deep Learning - H2O Deep Learning, @ArnoCandel

9

H2O - The Killer App on Spark

http://databricks.com/blog/2014/06/30/

sparkling-water-h20-spark.html - H2O Deep Learning, @ArnoCandel

10

H2O R CRAN package

John Chambers (creator of the S language, R-core member)

names H2O R API in top three promising R projects - H2O Deep Learning, @ArnoCandel

11

H2O + R = Happy Data Scientist

Machine Learning on Big Data with R:

Data resides on the H2O cluster! - H2O Deep Learning, @ArnoCandel

12

H2O Deep Learning in Action

MNIST = Digitized handwritten

digits database (Yann LeCun)

Yann LeCun: “Yet another advice: don't get fooled

by people who claim to have a solution to

Artificial General Intelligence. Ask them what

error rate they get on MNIST or ImageNet.”

Data: 28x28=784 pixels with

(gray-scale) values in 0…255

Train: 60,000 rows 784 integer columns 10 classes

Test: 10,000 rows 784 integer columns 10 classes

Live Demo

Build a H2O Deep Learning

model on MNIST train/test data - H2O Deep Learning, @ArnoCandel

13

What is Deep Learning?

Wikipedia:

Deep learning is a set of algorithms in

machine learning that attempt to model

high-level abstractions in data by using

architectures composed of multiple

non-linear transformations.

Example:

Prediction

Input data

(who is it?)

(image)

Facebook's DeepFace (Yann LeCun)

recognises faces as well as humans - H2O Deep Learning, @ArnoCandel 14

Deep Learning is Trending

Google trends

2011 2012 2013

Businesses are using

Deep Learning techniques!

Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)

!

FBI FACE: $1 billion face recognition project

!

Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng) - H2O Deep Learning, @ArnoCandel

15

Deep Learning History

slides by Yan LeCun (now Facebook)

Deep Learning wins competitions

AND

makes humans, businesses and

machines (cyborgs!?) smarter - H2O Deep Learning, @ArnoCandel

16

What is NOT Deep

Linear models are not deep

(by definition)

!

Neural nets with 1 hidden layer are not deep

(no feature hierarchy)

!

SVMs and Kernel methods are not deep

(2 layers: kernel + linear)

!

Classification trees are not deep

(operate on original input space) - H2O Deep Learning, @ArnoCandel

17

Deep Learning in H2O

1970s multi-layer feed-forward Neural Network

(supervised learning with stochastic gradient descent using back-propagation)

!

+ distributed processing for big data

(H2O in-memory MapReduce paradigm on distributed data)

!

+ multi-threaded speedup

(H2O Fork/Join worker threads update the model asynchronously)

!

+ smart algorithms for accuracy

(weight initialization, adaptive learning, momentum, dropout, regularization)

!

= Top-notch prediction engine! - H2O Deep Learning, @ArnoCandel

18

Example Neural Network

“fully connected” directed graph of neurons

input/output neuron

hidden neuron

age

information flow

married

income

single

employment

Hidden

Hidden

Input layer

Output layer

layer 1

layer 2

#neurons 3

4

3

2

#connections 3x4

4x3

3x2 - H2O Deep Learning, @ArnoCandel

19

Prediction: Forward Propagation

“neurons activate each other via weighted sums”

pl is a non-linear function of xi:

can approximate ANY function

with enough layers!

age

z

p

k

l

wkl

married

per-class probabilities

income x

sum(pl) = 1

i

single

u

y

ij

j

vjk

employment

yj = tanh(sumi(xi*uij)+bj)

bj, ck, dl: bias values

(indep. of inputs)

zk = tanh(sumj(yj*vjk)+ck)

activation function: tanh

pl = softmax(sumk(zk*wkl)+dl)

alternative:

softmax(xk) = exp(xk) / sumk(exp(xk))

x -> max(0,x) “rectifier” - H2O Deep Learning, @ArnoCandel

20

Data preparation & Initialization

Neural Networks are sensitive to numerical noise,

operate best in the linear regime (not saturated)

age

wkl

married

income xi

single

employment

Automatic standardization of data

Automatic initialization of weights

xi: mean = 0, stddev = 1

!

!

Poor man’s initialization: random weights wkl

horizontalize categorical variables, e.g.

!

{full-time, part-time, none, self-employed}

Default (better): Uniform distribution in

->

+/- sqrt(6/(#units + #units_previous_layer))

{0,1,0} = part-time, {0,0,0} = self-employed - H2O Deep Learning, @ArnoCandel

21

Training: Update Weights & Biases

For each training row, we make a prediction and compare

with the actual label (supervised learning):

predicted actual

0.8

1

married

0.2

0

single

Objective: minimize prediction error (MSE or cross-entropy)

Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”

!

Cross-entropy = -log(0.8) “strongly penalize non-1-ness”

1

Stochastic Gradient Descent: Update weights and biases via

gradient of the error (via back-propagation):

E

w <— w - rate * ∂E/∂w

w

rate - H2O Deep Learning, @ArnoCandel

22

Backward Propagation

How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?

Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!

Backprop: Compute ∂E/∂wi via chain rule going backwards

net = sumi(wi*xi) + b

y = activation(net)

xi

wi

E = error(y)

!

∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi

= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi - H2O Deep Learning, @ArnoCandel

23

H2O Deep Learning Architecture

initial model: weights and biases w

H2O atomic

in-memory

map:

K-V

w

K-V store

1

each node trains a

copy of the weights

HTTPD

w

w

and biases with

1

nodes/JVMs: sync

2

(some* or all of) its

i

local data with

threads: async

w

w

w

w

1

2

4

3

asynchronous F/J

communication

threads

w4

w

w

1

3

w2

4

3

2

reduce:

1

model averaging:

w2+w4

w

average weights and

1+w3 1

2

biases from all nodes,

speedup is at least

w* = (w

K-V

1+w2+w3+w4)/4

1

#nodes/log(#rows)

Query & display

arxiv:1209.4129v3

the model via

HTTPD

updated model: w*

JSON, WWW

Keep iterating over the data (“epochs”), score from time to time

*user can specify the number of total rows per MapReduce iteration - H2O Deep Learning, @ArnoCandel 24

“Secret” Sauce to Higher Accuracy

Adaptive learning rate - ADADELTA (Google)

Automatically set learning rate for each neuron

based on its training history

Regularization

L1: penalizes non-zero weights

L2: penalizes large weights

Dropout: randomly ignore certain inputs

Grid Search and Checkpointing

Run a grid search to scan many hyper-

parameters, then continue training the most

promising model(s) - H2O Deep Learning, @ArnoCandel 25

Detail: Adaptive Learning Rate

!

Compute moving average of ∆wi2 at time t for window length rho:

!

E[∆wi2]t = rho * E[∆wi2]t-1 + (1-rho) * ∆wi2

!

Compute RMS of ∆wi at time t with smoothing epsilon:

!

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Do the same for ∂E/∂wi, then

Adaptive acceleration / momentum:

obtain per-weight learning rate:

accumulate previous weight updates,

but over a window of time

RMS[∆wi]t-1

rate(wi, t) =

Adaptive annealing / progress:

RMS[∂E/∂wi]t

Gradient-dependent learning rate,

moving window prevents “freezing”

cf. ADADELTA paper

(unlike ADAGRAD: no window) - H2O Deep Learning, @ArnoCandel 26

Detail: Dropout Regularization

Training:

For each hidden neuron, for each training sample, for each iteration,

ignore (zero out) a different random fraction p of input activations.

!

X

age

married

X

income

X

single

employment

Testing:

Use all activations, but reduce them by a factor p

(to “simulate” the missing activations during training).

cf. Geoff Hinton's paper - H2O Deep Learning, @ArnoCandel

27

MNIST: digits classification

Time to check in

on the demo!

Standing world record:

Without distortions or convolutions, the best-ever

published error rate on test set: 0.83% (Microsoft)

Let’s see how H2O did in the past 20 minutes! - H2O Deep Learning, @ArnoCandel

28

H2O Deep Learning on MNIST:

0.87% test set error (so far)

test set error: 1.5% after 10 mins

World-class

1.0% after 1.5 hours

results!

0.87% after 4 hours

No pre-training

No distortions

No convolutions

No unsupervised

training

Frequent errors: confuse 2/7 and 4/9

Running on 4

nodes with 16

cores each - H2O Deep Learning, A. Candel

29

Weather Dataset

Predict “RainTomorrow” from Temperature,

Humidity, Wind, Pressure, etc. - H2O Deep Learning, A. Candel

30

Live Demo: Weather Prediction

5-fold cross validation

3 hidden Rectifier

Interactive ROC curve with

layers, Dropout,

real-time updates

L1-penalty

12.7% 5-fold cross-validation error is at

least as good as GBM/RF/GLM models - H2O Deep Learning, @ArnoCandel

31

Live Demo: Grid Search

How did I find those parameters? Grid Search!

(works for multiple hyper parameters at once)

Then continue training

the best model - H2O Deep Learning, @ArnoCandel

32

Use Case: Text Classification

Goal: Predict the item from

seller’s text description

“Vintage 18KT gold Rolex 2 Tone

in great condition”

Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0

gold

vintage condition

Train: 578,361 rows 8,647 cols 467 classes

Test: 64,263 rows 8,647 cols 143 classes

Let’s see how H2O does on the ebay dataset! - H2O Deep Learning, @ArnoCandel

33

Use Case: Text Classification

Train: 578,361 rows 8,647 cols 467 classes

Test: 64,263 rows 8,647 cols 143 classes

Out-Of-The-Box: 11.6% test set error after 10 epochs!

Predicts the correct class (out of 143) 88.4% of the time!

Note 1: H2O columnar-compressed in-memory

store only needs 60 MB to store 5 billion

values (dense CSV needs 18 GB)

Note 2: No tuning was done

(results are for illustration only) - H2O Deep Learning, @ArnoCandel 34

Parallel Scalability

(for 64 epochs on MNIST, with “0.87%” parameters)

Training Time

Speedup

in minutes

100

40.00

75

30.00

50

20.00

25

10.00

2.7 mins

0

0.00

1

2 4 8 16 32 63

1 2 4 8 16 32 63

H2O Nodes

H2O Nodes

(4 cores per node, 1 epoch per node per MapReduce) - H2O Deep Learning, @ArnoCandel

35

Tips for H2O Deep Learning

!

General:

More layers for more complex functions (exp. more non-linearity)

More neurons per layer to detect finer structure in data (“memorizing”)

Add some regularization for less overfitting (smaller validation error)

Do a grid search to get a feel for convergence, then continue training.

Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5.

Try Dropout (input: 20%, hidden: 50%) with test/validation set after

finding good parameters for convergence on training set.

Distributed: More training samples per iteration: faster, but less accuracy?

With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99

Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8,

momentum_start = 0.5, momentum_stable = 0.99,

momentum_ramp = 1/rate_annealing.

Try balance_classes = true for imbalanced classes.

Use force_load_balance and replicate_training_data for small datasets. - H2O Deep Learning, @ArnoCandel

36

H2O brings Deep Learning to R

All parameters are

available from R…

Draft

… and more docs

coming soon! - H2O Deep Learning, @ArnoCandel

37

POJO Model Export for

Production Scoring

Plain old Java code is

auto-generated to take

your H2O Deep Learning

models into production! - H2O Deep Learning, @ArnoCandel 38

Deep Learning Auto-Encoders

for Anomaly Detection

Toy example:

Find anomaly in ECG heart

beat data. First, train a

model on what’s “normal”:

20 time-series samples of

210 data points each

Deep Auto-Encoder:

Learn low-dimensional

non-linear “structure” of

the data that allows to

reconstruct the orig. data

Also for categorical data! - H2O Deep Learning, @ArnoCandel

39

Deep Learning Auto-Encoders

for Anomaly Detection

+

Model of what’s “normal”

Test set with anomaly

=>

Test set prediction is

Found anomaly! large

reconstruction, looks “normal”

reconstruction error - H2O Deep Learning, @ArnoCandel 40

H2O Steam: Scoring Platform - H2O Deep Learning, @ArnoCandel 41

H2O Steam: More Coming Soon! - H2O Deep Learning, @ArnoCandel 42

Key Take-Aways

H2O is a distributed in-memory data science

platform. It was designed for high-performance

machine learning applications on big data.

!

H2O Deep Learning is ready to take your advanced

analytics to the next level - Try it on your data!

!

Join our Community and Meetups!

git clone https://github.com/0xdata/h2o

http://docs.0xdata.com

www.h2o.ai/community

@hexadata