このページは http://www.slideshare.net/jpatanooga/hadoop-summit-eu-2013-parallel-linear-regression-iterativereduce-and-yarn の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/03/24)にアップロードinテクノロジー

Josh Patterson’s Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeRedu...

Josh Patterson’s Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeReduce and YARN.

- Linear Regression and Metronome

Modern Big Data Analytics - Josh Patterson

Email:

Past

Published in IAAI-09:

josh@floe.tv

“TinyTermite: A Secure Routing Algorithm”

Twitter:

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

@jpatanooga

Hadoop and the Smartgrid

Cloudera

Github:

Principal Solution Architect

Today

https://github.com/

jpatanooga

Independent Consultant - Sections

1. Modern Data Analytics

2. Parallel Linear Regression

3. Performance and Results - Modern Data Analytics

The Evolution of Data Tools - The World as Optimization

Data tells us about our model/engine/product

We take this data and evolve our product towards a state

of minimal market error

WSJ Special Section, Monday March 11, 2013

Zynga changing games based off player behavior

UPS cut fuel consumption by 8.4MM gallons

Ford used sentiment analysis to look at how new car

features would be received - The Modern Data Landscape

Apps are coming but they need

Platforms

Components

Workflows

Lots of investment in Hadoop in this space

Lots of ETL pipelines

Lots of descriptive Statistics

Growing interest in Machine Learning - Hadoop as The Linux of Data

Hadoop has won the Cycle

“Hadoop is the

kernel of a

Gartner: Hadoop will be in

distributed operating

2/3s of advanced analytics

system, and all the

products by 2015 [1]

other components

around the kernel are

now arriving on this

stage”

---Doug Cutting - Today’s Hadoop ML Pipeline

Data cleansing / ETL performed with Hive or Pig

Data In Place Processed

Mahout

R

Custom MapReduce Algorithm

Or Externally Processed

SAS

SPSS

KXEN

Weka - As Focus Shifts to Applications

Data rates have been climbing fast

Speed at Scale becomes the new Killer App

Companies will want to leverage the Big Data infrastructure

they’ve already been working with

Hadoop

HDFS as main storage system

A drive to validate big data investments with results

Emergence of applications which create “data products” - Patterson’s Law

“As the percent of your total data held

in a storage system approaches 100%

the amount of in-system processing

and analytics also approaches 100%” - Tools Will Move onto Hadoop

Already seeing this with Vendors

Who hasn’t announced a SQL engine on Hadoop

lately?

Trend will continue with machine learning tools

Mahout was the beginning

More are following

But what about parallel iterative algorithms? - Distributed Systems Are Hard

Lots of moving parts

Especially as these applications become more complicated

Machine learning can be a non-trivial operation

We need great building blocks that work well together

I agree with Jimmy Lin [3]: “keep it simple”

“make sure costs don’t outweigh benefits”

Minimize “Yet Another Tool To Learn” (YATTL) as much

as we can! - To Summarize

Data moving into Hadoop everywhere

Patterson’s Law

Focus on hadoop, build around next-gen “linux of data”

Need simple components to build next-gen data base apps

They should work cleanly with the cluster that the fortune 500 has:

Hadoop

Also should be easy to integrate into Hadoop and with the hadoop-

tool ecosystem

Minimize YATTL - Multivariate Linear Regression on Hadoop

Parallel Linear Regression - Linear Regression

In linear regression, data is modeled

using linear predictor functions

unknown model parameters are

estimated from the data.

We use optimization techniques

like Stochastic Gradient Descent to

find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN) - 16

Machine Learning and Optimization

Algorithms

(Convergent) Iterative Methods

Newton’s Method

Quasi-Newton

Gradient Descent

Heuristics

AntNet

PSO

Genetic Algorithms - 17

Stochastic Gradient Descent

Hypothesis about data

Cost function

Update function

Andrew Ng’s Tutorial:

https://class.coursera.org/ml/lecture/preview_vie

w/11 - 18

Stochastic Gradient Descent

Training Data

Training

Simple gradient descent procedure

Loss functions needs to be convex (with

exceptions)

Linear Regression

SGD

Loss Function: squared error of

prediction

Prediction: linear combination of

coefficients and input variables

Model - 19

Mahout’s SGD

Currently Single Process

Multi-threaded parallel, but not cluster parallel

Runs locally, not deployed to the cluster

Tied to logistic regression implementation - 20

Current Limitations

Sequential algorithms on a single node only goes so far

The “Data Deluge”

Presents algorithmic challenges when combined with

large data sets

need to design algorithms that are able to perform in a

distributed fashion

MapReduce only fits certain types of algorithms - 21

Distributed Learning Strategies

McDonald, 2010

Distributed Training Strategies for the Structured Perceptron

Langford, 2007

Vowpal Wabbit

Jeff Dean’s Work on Parallel SGD

DownPour SGD

Sandblaster - 22

MapReduce vs. Parallel Iterative

Input

Processor

Pro

r cessor

Pro

r cesso

s r

Map

Map

Map

Su

S per

pe ste

st p

e 1

Pro

r cesso

s r

Pro

r cesso

s r

Pro

r cesso

s r

Reduce

Reduce

Su

S per

pe ste

st p

e 2

Output

. . . - 23

YARN

Node

Yet Another Resource Negotiator

Manager

Container

App Mstr

Framework for scheduling

Client

distributed applications

Resource

Node

Manager

Manager

Client

Allows for any type of parallel

App Mstr

Container

application to run natively on hadoop

MRv2 is now a distributed applicationMapReduce Status

Node

Manager

Job Submission

Node Status

Resource Request

Container

Container - 24

IterativeReduce

Designed specifically for parallel iterative

algorithms on Hadoop

Implemented directly on top of YARN

Intrinsic Parallelism

Easier to focus on problem

Not focusing on the distributed application part - 25

IterativeReduce API

ComputableMaster

Work

r er

Worker

Worker

Setup()

Ma

M st

a e

st r

Compute()

Complete()

Worker

Worker

Worker

ComputableWorker

Ma

M st

a e

st r

Setup()

Compute()

. . . - 26

SGD Master

Collects all parameter vectors at each pass /

superstep

Produces new global parameter vector

By averaging workers’ vectors

Sends update to all workers

Workers replace local parameter vector with new global

parameter vector - 27

SGD Worker

Each given a split of the total dataset

Similar to a map task

Performs local SGD pass

Local parameter vector sent to master at

superstep

Stays active/resident between iterations - 28

SGD: Serial vs Parallel

Split 1

Split 2

Split 3

Training Data

Worker 1

Worker 2

Worker N

…

Partial

Partial Model

Partial

Model

Model

Master

Model

Global Model - Parallel Linear Regression with IterativeReduce

Based directly on work we did with Knitting Boar

Parallel logistic regression

Scales linearly with input size

Can produce a linear regression model off large amounts of

data

Packaged in a new suite of parallel iterative algorithms called

Metronome

100% Java, ASF 2.0 Licensed, on github - Unit Testing and IRUnit

Simulates the IterativeReduce parallel framework

Uses the same app.properties file that YARN

applications do

Examples

https://github.com/jpatanooga/Metronome/blob/mast

er/src/test/java/tv/floe/metronome/linearregressi

on/iterativereduce/TestSimulateLinearRegressionIt

erativeReduce.java

https

://github.com/jpatanooga/KnittingBoar/blob/master/

src/test/java/com/cloudera/knittingboar/sgd/iterat

ivereduce/TestKnittingBoar_IRUnitSim.java - Metronome: Parallel Linear Regression

Performance and Results - Running the Job via YARN

Build with Maven

Copy Jar to host with cluster access

Copy dataset to HDFS

Run job

Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties - Results

Linear Regression - Parallel vs Serial

Parallel Runs

Serial Runs - Lessons Learned

Linear scale continues to be achieved with parameter

averaging variations

Tuning is critical

Need to be good at selecting a learning rate

YARN still experimental, has caveats

Container allocation is still slow

Metronome continues to be experimental - Special Thanks

Michael Katzenellenbollen

Dr. James Scott

University of Texas at Austin

Dr. Jason Baldridge

University of Texas at Austin - Future Directions

More testing, stability

Cache vectors in memory for speed

Metronome

Take on properties of LibLinear

Plugable optimization, general linear models

YARN-centric first class Hadoop citizen

Focus on being a complement to Mahout

K-means, PageRank implementations - Github

IterativeReduce

https://github.com/emsixteeen/IterativeReduce

Metronome

https://github.com/jpatanooga/Metronome

Knitting Boar

https://github.com/jpatanooga/KnittingBoar - References

1. http://www.infoworld.com/d/business-intelligence/gartner

-hadoop-will-be-in-two-thirds-of-advanced-analytics-prod

ucts-2015-

211475

2. https://cwiki.apache.org/MAHOUT/logistic-

regression.html

3. MapReduce is Good Enough? If All You Have is a Hammer,

Throw Away Everything That’s Not a Nail!

•

http://arxiv.org/pdf/1209.2191.pdf