このページは http://www.slideshare.net/teofili/machine-learning-with-apache-hama の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Machine Learning with

Apache Hama

Tommaso Teofili

tommaso [at] apache [dot] org

1 - About me

ASF member having fun with:

Lucene / Solr

Hama

UIMA

Stanbol

… some others

SW engineer @ Adobe R&D

2 - Agenda

Apache Hama and BSP

Why machine learning on BSP

Some examples

Benchmarks

3 - Apache Hama

Bulk Synchronous Parallel computing

framework on top of HDFS for massive

scientific computations

TLP since May 2012

0.6.0 release out soon

Growing community

4 - BSP supersteps

A BSP algorithm is composed by a sequence of “supersteps”

5 - BSP supersteps

Each task

Superstep 1

Do some computation

Communicate with other tasks

Synchronize

Superstep 2

Do some computation

Communicate with other tasks

Synchronize

…

…

…

Superstep N

Do some computation

Communicate with other tasks

Synchronize

6 - Why BSP

Simple programming model

Supersteps semantic is easy

Preserve data locality

Improve performance

Well suited for iterative algorithms

7 - Apache Hama architecture

BSP Program execution flow

8 - Apache Hama architecture

9 - Apache Hama

Features

BSP API

M/R like I/O API

Graph API

Job management / monitoring

Checkpoint recovery

Local & (Pseudo) Distributed run modes

Pluggable message transfer architecture

YARN supported

Running in Apache Whirr

10 - Apache Hama BSP API

public abstract class BSP<K1, V1, K2, V2,

M extends Writable> …

K1, V1 are key, values for inputs

K2, V2 are key, values for outputs

M are they type of messages used for task

communication

11 - Apache Hama BSP API

public void bsp(BSPPeer<K1, V1, K2, V2,

M> peer) throws ..

public void setup(BSPPeer<K1, V1, K2,

V2, M> peer) throws ..

public void cleanup(BSPPeer<K1, V1, K2,

V2, M> peer) throws ..

12 - Machine learning on BSP

Lots (most?) of ML algorithms are

inherently iterative

Hama ML module currently counts

Collaborative filtering

Clustering

Gradient descent

13 - Benchmarking architecture

Node

No

Node

No

Nod

N e

od

Node

No

Hama

H

Solr

DBMS

Lucene

Maho

Ma ut

ho

HDF

HD S

14 - Collaborative filtering

Given user preferences on movies

We want to find users “near” to some

specific user

So that that user can “follow” them

And/or see what they like (which he/she could

like too)

15 - Collaborative filtering BSP

Given a specific user

Iteratively (for each task)

Superstep 1*i

Read a new user preference row

Find how near is that user from the current user

That is finding how near their preferences are

Since they are given as vectors we may use vector

distance measures like Euclidean, cosine, etc. distance

algorithms

Broadcast the measure output to other peers

Superstep 2*i

Aggregate measure outputs

Update most relevant users

Still to be committed (HAMA-612)

16 - Collaborative filtering BSP

Given user ratings about movies

"john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8

"paula" -> 7, 3, 8, 2, 8.5, 0, 0

"jim” -> 4, 5, 0, 5, 8, 0, 1.5

"tom" -> 9, 4, 9, 1, 5, 0, 8

"timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0

We ask for 2 nearest users to “paula” and

we get “timothy” and “tom”

user recommendation

We can extract highly rated movies

“timothy” and “tom” that “paula” didn’t see

Item recommendation

17 - Benchmarks

Fairly simple algorithm

Highly iterative

Comparing to Apache Mahout

Behaves better than ALS-WR

Behaves similarly to RecommenderJob and

ItemSimilarityJob

18 - K-Means clustering

We have a bunch of data (e.g. documents)

We want to group those docs in k

homogeneous clusters

Iteratively for each cluster

Calculate new cluster center

Add doc nearest to new center to the cluster

19 - K-Means clustering

20 - K-Means clustering BSP

Iteratively

Superstep 1*i

Assignment phase

Read vectors splits

Sum up temporary centers with assigned vectors

Broadcast sum and ingested vectors count

Superstep 2*i

Update phase

Calculate the total sum over all received

messages and average

Replace old centers with new centers and check

for convergence

21 - Benchmarks

One rack (16 nodes 256 cores) cluster

10G network

On average faster than Mahout’s impl

22 - Gradient descent

Optimization algorithm

Find a (local) minimum of some function

Used for

solving linear systems

solving non linear systems

in machine learning tasks

linear regression

logistic regression

neural networks backpropagation

…

23 - Gradient descent

Minimize a given (cost) function

Give the function a starting point (set of parameters)

Iteratively change parameters in order to minimize the

function

Stop at the (local)

minimum

There’s some math but intuitively:

evaluate derivatives at a given point in order to choose

where to “go” next

24 - Gradient descent BSP

Iteratively

Superstep 1*i

each task calculates and broadcasts portions of the

cost function with the current parameters

Superstep 2*i

aggregate and update cost function

check the aggregated cost and iterations count

cost should always decrease

Superstep 3*i

each task calculates and broadcasts portions of

(partial) derivatives

Superstep 4*i

aggregate and update parameters

25 - Gradient descent BSP

Simplistic example

Linear regression

Given real estate market dataset

Estimate new houses prices given known

houses’ size, geographic region and prices

Expected output: actual parameters for the

(linear) prediction function

26 - Gradient descent BSP

Generate a different model for each region

House item vectors

price -> size

150k -> 80

2 dimensional space

~1.3M vectors dataset

27 - Gradient descent BSP

Dataset and model fit

28 - Gradient descent BSP

Cost checking

29 - Gradient descent BSP

Classification

Logistic regression with gradient descent

Real estate market dataset

We want to find which estate listings belong to agencies

To avoid buying from them

Same algorithm

With different cost function and features

Existing items are tagged or not as “belonging to agency”

Create vectors from items’ text

Sample vector

1 -> 1 3 0 0 5 3 4 1

30 - Gradient descent BSP

Classification

31 - Benchmarks

Not directly comparable to Mahout’s

regression algorithms

Both SGD and CGD are inherently better than

plain GD

But Hama GD had on average same

performance of Mahout’s SGD / CGD

Next step is implementing SGD / CGD on top of

Hama

32 - Wrap up

Even if

ML module is still “young” / work in progress

and tools like Apache Mahout have better

“coverage”

Apache Hama can be particularly useful in

certain “highly iterative” use cases

Interesting benchmarks

33 - Thanks!

34