このページは http://www.slideshare.net/dbtsai/m-llib-sf-machine-learning の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byDB Tsai

2年以上前 (2014/05/09)にアップロードinテクノロジー

Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contri...

Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings.

Bio:

Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.

- MLlib: Scalable Machine Learning on Spark

Xiangrui Meng - What is MLlib?

2 - What is MLlib?

MLlib is a Spark subproject providing machine

learning primitives:

• initial contribution from AMPLab, UC Berkeley

• shipped with Spark since version 0.8

• 35 contributors

3 - What is MLlib?

Algorithms:!

• classification: logistic regression, linear support vector machine

(SVM), naive Bayes, classification tree

• regression: generalized linear models (GLMs), regression tree

• collaborative filtering: alternating least squares (ALS)

• clustering: k-means

• decomposition: singular value decomposition (SVD), principal

component analysis (PCA)

4 - Why MLlib?

5 - scikit-learn?

Algorithms:!

•

classification: SVM, nearest neighbors, random forest, …

•

regression: support vector regression (SVR), ridge regression,

Lasso, logistic regression, …!

•

clustering: k-means, spectral clustering, …

•

decomposition: PCA, non-negative matrix factorization (NMF),

independent component analysis (ICA), …

6 - Mahout?

Algorithms:!

• classification: logistic regression, naive Bayes, random forest, …

• collaborative filtering: ALS, …

• clustering: k-means, fuzzy k-means, …

• decomposition: SVD, randomized SVD, …

7 - Mahout?

LIBLINEAR?

Vowpal Wabbit?

H2O?

MATLAB?

R?

scikit-learn?

Weka?

GraphLab?

8 - Why MLlib?

9 - Why MLlib?

• It is built on Apache Spark, a fast and general

engine for large-scale data processing.

• Run programs up to 100x faster than

Hadoop MapReduce in memory, or

10x faster on disk.

• Write applications quickly in Java, Scala, or Python.

10 - Spark philosophy

Make life easy and productive for data scientists:

•

Well documented, expressive API’s

•

Powerful domain specific libraries

•

Easy integration with storage systems

•

… and caching to avoid data movement - Word count (Scala)

val counts = sc.textFile("hdfs://...")

.flatMap(line => line.split(" "))

.map(word => (word, 1L))

.reduceByKey(_ + _) - Word count (SparkR)

lines <- textFile(sc, "hdfs://...")

words <- flatMap(lines,

function(line) {

unlist(strsplit(line, " "))

})

wordCount <- lapply(words,

function(word) {

list(word, 1L)

})

counts <- reduceByKey(wordCount, "+", 2L)

output <- collect(counts) - Gradient descent

n

X

w w

↵ ·

g(w; xi, yi)

i=1

val points = spark.textFile(...).map(parsePoint).cache()

var w = Vector.zeros(d)

for (i <- 1 to numIterations) {

val gradient = points.map { p =>

(1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x

).reduce(_ + _)

w -= alpha * gradient

}

14 - k-means (scala)

// Load and parse the data.

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()

!

// Cluster the data into two classes using KMeans.

val clusters = KMeans.train(parsedData, 2, numIterations = 20)

!

// Compute the sum of squared errors.

val cost = clusters.computeCost(parsedData)

println("Sum of squared errors = " + cost)

15 - k-means (python)

# Load and parse the data

data = sc.textFile("kmeans_data.txt")

parsedData = data.map(lambda line:

array([float(x) for x in line.split(' ‘)])).cache()

!

# Build the model (cluster the data)

clusters = KMeans.train(parsedData, 2, maxIterations = 10,

runs = 1, initialization_mode = "kmeans||")

!

# Evaluate clustering by computing the sum of squared errors

def error(point):

center = clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

!

cost = parsedData.map(lambda point: error(point))

.reduce(lambda x, y: x + y)

print("Sum of squared error = " + str(cost))

16 - Dimension reduction

+ k-means

// compute principal components

val points: RDD[Vector] = ...

val mat = RowMatrix(points)

val pc = mat.computePrincipalComponents(20)

!

// project points to a low-dimensional space

val projected = mat.multiply(pc).rows

!

// train a k-means model on the projected data

val model = KMeans.train(projected, 10) - Collaborative filtering

// Load and parse the data

val data = sc.textFile("mllib/data/als/test.data")

val ratings = data.map(_.split(',') match {

case Array(user, item, rate) =>

Rating(user.toInt, item.toInt, rate.toDouble)

})

!

// Build the recommendation model using ALS

val numIterations = 20

val model = ALS.train(ratings, 1, 20, 0.01)

!

// Evaluate the model on rating data

val usersProducts = ratings.map { case Rating(user, product, rate) =>

(user, product)

}

val predictions = model.predict(usersProducts)

18 - Why MLlib?

• It ships with Spark as

a standard component.

19 - Spark community

One of the largest open source projects in big data:

• 170+ developers contributing

• 30+ companies contributing

• 400+ discussions per month on the mailing list - 30-day commit activity

200

40000

14000

150

30000

10500

100

20000

7000

50

10000

3500

0

0

0

Patches

Lines Added

Lines Removed

MapReduce

MapReduce

MapReduce

Storm

Storm

Storm

Yarn

Yarn

Yarn

Spark

Spark

Spark - Out for dinner?!

•

Search for a restaurant and make a reservation.

•

Start navigation.

•

Food looks good? Take a photo and share.

22 - Why smartphone?

Out for dinner?!

•

Search for a restaurant and make a reservation. (Yellow Pages?)

•

Start navigation. (GPS?)

•

Food looks good? Take a photo and share. (Camera?)

23 - Why MLlib?

A special-purpose device may be better at one

aspect than a general-purpose device. But the cost

of context switching is high:

• different languages or APIs

• different data formats

• different tuning tricks

24 - Spark SQL + MLlib

// Data can easily be extracted from existing sources,

// such as Apache Hive.

val trainingTable = sql("""

SELECT e.action,

u.age,

u.latitude,

u.longitude

FROM Users u

JOIN Events e

ON u.userId = e.userId""")

!

// Since `sql` returns an RDD, the results of the above

// query can be easily used in MLlib.

val training = trainingTable.map { row =>

val features = Vectors.dense(row(1), row(2), row(3))

LabeledPoint(row(0), features)

}

!

val model = SVMWithSGD.train(training) - Streaming + MLlib

// collect tweets using streaming

!

// train a k-means model

val model: KMmeansModel = ...

!

// apply model to filter tweets

val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))

val statuses = tweets.map(_.getText)

val filteredTweets =

statuses.filter(t => model.predict(featurize(t)) == clusterNumber)

!

// print tweets within this particular cluster

filteredTweets.print() - GraphX + MLlib

// assemble link graph

val graph = Graph(pages, links)

val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices

!

// load page labels (spam or not) and content features

val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...

val training: RDD[LabeledPoint] =

labelAndFeatures.join(pageRank).map {

case (id, ((label, features), pageRank)) =>

LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))

}

!

// train a spam detector using logistic regression

val model = LogisticRegressionWithSGD.train(training) - Why MLlib?

• Built on Spark’s lightweight yet powerful APIs.

• Spark’s open source community.

• Seamless integration with Spark’s other components.

• Comparable to or even better than other libraries

specialized in large-scale machine learning.

28 - Why MLlib?

• Scalability

• Performance

• User-friendly APIs

• Integration with Spark and its other components

29 - Logistic regression

30 - Logistic regression - weak scaling

4000

10

4000

n=12K, d=160K

MLbase

MLlib

n=6K, d=160K

n=25K, d=160K

VW

n=12.5K, d=160K

8

Ideal

3000

n=25K, d=160K

3000

n=50K, d=160K

n=50K, d=160K

n=100K, d=160K

6

n=100K, d=160K

n=200K, d=160K

2000

n=200K, d=160K

2000

4

walltime (s)

relative walltime

walltime (s)

1000

1000

2

0

00

5

10

15

20

25

30

MLbase

MLlib

VW

Matlab

0

# machines

MLbase

VW

Matlab

Fig. 5: Walltime for weak scaling for logistic regression.

Fig. 6: Weak scaling for logistic regression

• Full dataset: 200K images, 160K dense features.

35

1400

•

MLbase

1 Machine

Similar weak scaling.

30

VW

1200

2 Machines

•

Ideal

4 Machines

MLlib within a factor of 2 of VW’s wall-clock time.

25

1000

8 Machines

16 Machines

20

800

32 Machines

31

speedup 15

600

walltime (s)

10

400

5

200

0

0

0

5

10

15

20

25

30

MLbase

VW

Matlab

# machines

Fig. 7: Walltime for strong scaling for logistic regression.

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparable

System

Lines of Code

scaling results as more machines are added.

MLbase

32

GraphLab

383

In MATLAB, we implement gradient descent instead of

Mahout

865

SGD, as gradient descent requires roughly the same number

MATLAB-Mex

124

of numeric operations as SGD but does not require an inner

MATLAB

20

loop to pass over the data. It can thus be implemented in a

’vectorized’ fashion, which leads to a significantly more favor-

TABLE II: Lines of code for various implementations of ALS

able runtime. Moreover, while we are not adding additional

processing units to MATLAB as we scale the dataset size, we

show MATLAB’s performance here as a reference for training

a model on a similarly sized dataset on a single multicore

B. Collaborative Filtering: Alternating Least Squares

machine.

Matrix factorization is a technique used in recommender

systems to predict user-product associations. Let M 2 Rm⇥n

Results: In our weak scaling experiments (Figures 5 and

be some underlying matrix and suppose that only a small

6), we can see that our clustered system begins to outperform

subset, ⌦(M), of its entries are revealed. The goal of matrix

MATLAB at even moderate levels of data, and while MATLAB

factorization is to find low-rank matrices U 2 Rm⇥k and

runs out of memory and cannot complete the experiment on

V 2 Rn⇥k, where k ⌧ n, m, such that M ⇡ UV T .

the 200K point dataset, our system finishes in less than 10

Commonly, U and V are estimated using the following bi-

minutes. Moreover, the highly specialized VW is on average

convex objective:

35% faster than our system, and never twice as fast. These

X

times do not include time spent preparing data for input input

min

(Mij

U T

for VW, which was significant, but expect that they’d be a

i Vj )2 +

(||U||2F + ||V ||2F ) . (2)

U,V (i,j)2⌦(M)

one-time cost in a fully deployed environment.

Alternating least squares (ALS) is a widely used method for

From the perspective of strong scaling (Figures 7 and 8),

matrix factorization that solves (2) by alternating between

our solution actually outperforms VW in raw time to train a

optimizing U with V fixed, and V with U fixed. ALS is

model on a fixed dataset size when using 16 and 32 machines,

well-suited for parallelism, as each row of U can be solved

and exhibits stronger scaling properties, much closer to the

independently with V fixed, and vice-versa. With V fixed, the

gold standard of linear scaling for these algorithms. We are

minimization problem for each row ui is solved with the closed

unsure whether this is due to our simpler (broadcast/gather)

form solution. where u⇤i 2 Rk is the optimal solution for the

communication paradigm, or some other property of the sys-

ith row vector of U , V⌦ is a sub-matrix of rows v

i

j such that

tem.

j 2 ⌦i, and Mi⌦ is a sub-vector of observed entries in the

i - 4000

10

n=12K, d=160K

MLbase

n=25K, d=160K

VW

8

Ideal

3000

n=50K, d=160K

n=100K, d=160K

6

4000

10

n=12K, d=160K

MLbase

n=200K, d=160K

n=25K, d=160K

VW

2000

8

Ideal

4

3000

n=50K, d=160K

relative walltime

n=100K, d=160K

6

walltime (s)

n=200K, d=160K

1000

2

2000

4

relative walltime

walltime (s)

0 0

5

10

15

20

25

30

1000

2

0

# machines

MLbase

VW

Matlab

Fig. Logistic r

6: Weak scaling for egr

logistic ession - str

0 0

5

10

15

20

25

30

regression

0

ong scaling

# machines

Fig. 5: Walltime for weak scaling for logistic regression.

MLbase

VW

Matlab

Fig. 5: Walltime for weak scaling for logistic regression.

Fig. 6: Weak scaling for logistic regression

35

1400

MLbase

MLlib

1 Machine

35

VW

1400

MLbase

30

2 Machines

1 Machine

1200

Ideal

30

VW

1200

2 Machines

4 Machines

Ideal

25

4 Machines

25

1000

8 Machines

1000

8 Machines

16 Machines

16 Machines

20

20

800

800

32 Machines

32 Machines

speedup 15

speedup 15

600

600

walltime (s)

10

walltime (s)

400

10

400

5

200

5

0

200

0

0

5

10

15

20

25

30

MLbase

MLlib

VW

Matlab

# machines

0

0

0

5

10

15

20

25

30

# machines

Fig. 7: Walltime for strong scaling for logistic regression.

MLbase

VW

Matlab

Fig. 8: Strong scaling for logistic regression

Fig. 7: Walltime for strong scaling for logistic regression.

Fig. 8: Strong

•

scalingFixed Dataset: 50K images, 160K dense featur

for logistic regression

es.

System

Lines of Code

• MLlib exhibits better scaling pr

with respect oper

to

ties.

computation. In practice, we see comparable

scaling results as more machines are added.

MLbase

32

• MLlib is faster than VW with 16 and 32 machines.

GraphLab

383

System

Lines of Code

In MATLAB, we implement gradient descent instead of

Mahout

865

with respect to computation. In practice, we see comparable

SGD, as gradient descent requires roughly the same number

MATLAB-Mex

124

scaling results as more machines are added.

MLbase

32

of numeric operations as SGD but does not require an inner

MATLAB

20

GraphLab

383

32

loop to pass over the data. It can thus be implemented in a

In MATLAB, we implement gradient descent instead of

Mahout

865

’vectorized’ fashion, which leads to a significantly more favor-

TABLE II: Lines of code for various implementations of ALS

MATLAB-Mex

124

able runtime. Moreover, while we are not adding additional

SGD, as gradient descent requires roughly the same number

processing units to MATLAB as we scale the dataset size, we

of numeric operations as SGD but does not require an inner

MATLAB

20

show MATLAB’s performance here as a reference for training

loop to pass over the data. It can thus be implemented in a

a model on a similarly sized dataset on a single multicore

B. Collaborative Filtering: Alternating Least Squares

’vectorized’ fashion, which leads to a significantly more favor-

TABLE II: Lines of code for various implem

machine.

entations of ALS

Matrix factorization is a technique used in recommender

able runtime. Moreover, while we are not adding additional

systems to predict user-product associations. Let M 2 Rm⇥n

Results: In our weak scaling experiments (Figures 5 and

be some underlying matrix and suppose that only a small

processing units to MATLAB as we scale the dataset size, we

6), we can see that our clustered system begins to outperform

subset, ⌦(M), of its entries are revealed. The goal of matrix

show MATLAB’s performance here as a reference for training

MATLAB at even moderate levels of data, and while MATLAB

factorization is to find low-rank matrices U 2 Rm⇥k and

a model on a similarly sized dataset on a single multicore

B. Collaborative Filtering: Alternating Least

runs

Squares out of memory and cannot complete the experiment on V 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UVT.

the 200K point dataset, our system finishes in less than 10

machine.

Commonly, U and V are estimated using the following bi-

Matrix factorization is a technique used in

minutes. Moreo

recommender ver, the highly specialized VW is on average convex objective:

systems to predict user-product associations. Let 35%

M 2 faster

Rm⇥ than

n

our system, and never twice as fast. These

X

Results: In our weak scaling experiments (Figures 5 and

be some underlying matrix and suppose that

times do not

only a small include time spent preparing data for input input

min

(Mij

U T

6), we can see that our clustered system begins to outperform

for VW, which was significant, but expect that they’d be a

i Vj )2 +

(||U||2F + ||V ||2F ) . (2)

U,V (i,j)

subset, ⌦(M), of its entries are revealed. The goal of matrix

2⌦(M)

one-time cost in a fully deployed environment.

MATLAB at even moderate levels of data, and while MATLAB

factorization is to find low-rank matrices U 2 Rm⇥k and

Alternating least squares (ALS) is a widely used method for

runs out of memory and cannot complete the experiment on

V

From the perspective of strong scaling (Figures 7 and 8),

matrix factorization that solves (2) by alternating between

2 Rn⇥k, where k ⌧ n, m, such that M ⇡ UV T .

the 200K point dataset, our system finishes in less than 10

our solution actually outperforms VW in raw time to train a

optimizing U with V fixed, and V with U fixed. ALS is

Commonly, U and V are estimated using the following bi-

model on a fixed dataset size when using 16 and 32 machines,

well-suited for parallelism, as each row of U can be solved

minutes. Moreover, the highly specialized VW is on average

convex objective:

and exhibits stronger scaling properties, much closer to the

independently with V fixed, and vice-versa. With V fixed, the

35% faster than our system, and never twice as fast. These

X

gold standard of linear scaling for these algorithms. We are

minimization problem for each row ui is solved with the closed

times do not include time spent preparing data for input input

min

(M

unsure whether this is due to our simpler (broadcast/gather)

form solution. where u⇤

ij

U T

i Vj )2 +

(||U||2F + ||V ||2F ) . (2)

i 2 Rk is the optimal solution for the

for VW, which was significant, but expect that they’d be a

U,V

communication paradigm, or some other property of the sys-

ith row vector of U , V⌦ is a sub-matrix of rows v

i

j such that

(i,j)2⌦(M)

tem.

j 2 ⌦i, and Mi⌦ is a sub-vector of observed entries in the

i

one-time cost in a fully deployed environment.

Alternating least squares (ALS) is a widely used method for

From the perspective of strong scaling (Figures 7 and 8),

matrix factorization that solves (2) by alternating between

our solution actually outperforms VW in raw time to train a

optimizing U with V fixed, and V with U fixed. ALS is

model on a fixed dataset size when using 16 and 32 machines,

well-suited for parallelism, as each row of U can be solved

and exhibits stronger scaling properties, much closer to the

independently with V fixed, and vice-versa. With V fixed, the

gold standard of linear scaling for these algorithms. We are

minimization problem for each row ui is solved with the closed

unsure whether this is due to our simpler (broadcast/gather)

form solution. where u⇤i 2 Rk is the optimal solution for the

communication paradigm, or some other property of the sys-

ith row vector of U , V⌦ is a sub-matrix of rows v

i

j such that

tem.

j 2 ⌦i, and Mi⌦ is a sub-vector of observed entries in the

i - Collaborative filtering

33 - Collaborative filtering

• Recover a ra-ng matrix from a

?

subset of its entries.

?

?

?

?

34 - Alternating least squares (ALS)

35 - ALS - wall-clock time

System

Wall- ‐clock /me (seconds)

Matlab

15443

Mahout

4206

GraphLab

291

MLlib

481

• Dataset: scaled version of Netflix data (9X in size).

• Cluster: 9 machines.

• MLlib is an order of magnitude faster than Mahout.

• MLlib is within factor of 2 of GraphLab.

36 - Implementation
- Implementation of k-means

Initialization:

•

random

•

k-means++

•

k-means|| - Implementation of k-means

Iterations:

•

For each point, find its closest center.

li = arg min kxi

cjk22

j

•

Update cluster centers.

P

xj

c

i,li=j

j = P

1

i,li=j - Implementation of k-means

The points are usually sparse, but the centers are most likely to be

dense. Computing the distance takes O(d) time. So the time

complexity is O(n d k) per iteration. We don’t take any advantage of

sparsity on the running time. However, we have

kx

ck22 = kxk22 + kck22

2hx, ci

Computing the inner product only needs non-zero elements. So we

can cache the norms of the points and of the centers, and then only

need the inner products to obtain the distances. This reduce the

running time to O(nnz k + d k) per iteration.

!

However, is it accurate? - Implementation of ALS

• broadcast everything

• data parallel

• fully parallel

41 - Broadcast everything

• Master loads (small)

data file and initializes

Ratings

models.

• Master broadcasts data

and initial models.

Movie!

Factors

• At each iteration,

updated models are

broadcast again.

User!

Factors

• Works OK for small

data.

• Lots of communication

overhead - doesn’t

scale well.

• Ships with Spark

Examples

Master

Worke

42 rs - Data parallel

Ratings

• Workers load data

• Master broadcasts

initial models

Movie!

Factors

Ratings

• At each iteration,

updated models are

broadcast again

User!

Factors

• Much better scaling

Ratings

• Works on large

datasets

• Works well for smaller

Ratings

models. (low K)

Master

Worke

43 rs - Fully parallel

Rati

Mon

vg

i s

e!

Facto

Userr!s

Factors

• Workers load data

• Models are

instantiated at

workers.

Rati

Mon

vg

i s

e!

Facto

Userr!s

Factors

• At each iteration,

models are shared via

join between workers.

Rati

Mon

vg

i s

e!

User!

• Much better

Factors

scalability.

Factors

• Works on large

datasets

Rati

Mon

vg

i s

e!

Facto

Userr!s

Master

Factors

Worke

44 rs - Implementation of ALS

• broadcast everything

• data parallel

• fully parallel

• block-wise parallel

•

Users/products are partitioned into blocks and join is

based on blocks instead of individual user/product.

45 - New features for v1.0

• Sparse data support

• Classification and regression tree (CART)

• Tall-and-skinny SVD and PCA

• L-BFGS

• Model evaluation

46 - MLlib v1.1?

• Model selection!

• training multiple models in parallel

• separating problem/algorithm/parameters/model

• Learning algorithms!

• Latent Dirichlet allocation (LDA)

And?

• Random Forests

• Online updates with Spark Streaming

• Optimization algorithms!

• Alternating direction method of multipliers (ADMM)

• Accelerated gradient descent - Contributors

Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai,

Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden

Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek

Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei

Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma,

Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram

Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan,

Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte

48