このページは http://www.slideshare.net/sscdotopen/next-directions-in-mahouts-recommenders の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約3年前 (2013/08/29)にアップロードinテクノロジー

Slides from my talk "Next directions in Mahout’s recommenders" given at the Bay Area Mahout Meetup

- Next Directions in Mahout’s Recommenders

Sebastian Schelter, Apache Software Foundation

Bay Area Mahout Meetup - About me

Next

Directions

PhD student at the Database Systems and Information

in

Mahout’s

Management Group of Technische Universit¨

at Berlin

Recommenders

Member of the Apache Software Foundation, committer on

Mahout and Giraph

currently interning at IBM Research Almaden

2/38 - Next Directions?

Next

Directions

in

Mahout’s

Mahout in Action is the prime source of

information for using Mahout in practice.

Recommenders

As it is more than two years old

(and only covers Mahout 0.5), it is

missing a lot of recent developments.

This talk describes what has been added to the recommenders

of Mahout since then and gives suggestions on directions for

future versions of Mahout.

3/38 - Collaborative Filtering 101
- Collaborative Filtering

Next

Directions

Problem: Given a user’s interactions with items, guess which

in

Mahout’s

other items would be highly preferred

Recommenders

Collaborative Filtering: infer recommendations from patterns

found in the historical user-item interactions

data can be explicit feedback (ratings) or implicit feedback

(clicks, pageviews), represented in the interaction matrix A

item1

· · ·

item3

· · ·

user

1

3

· · ·

4

· · ·

user2

−

· · ·

4

· · ·

user

3

5

· · ·

1

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

5/38 - Neighborhood Methods

Next

Directions

User-based:

for each user, compute a ”jury” of users with similar taste

in

Mahout’s

pick the recommendations from the ”jury’s” items

Item-based:

Recommenders

for each item, compute a set of items with similar

interaction pattern

pick the recommendations from those similar items

6/38 - Neighborhood Methods

Next

Directions

item-based variant most popular:

in

Mahout’s

simple and intuitively understandable

additionally gives non-personalized, per-item

Recommenders

recommendations (people who like X might also like Y)

recommendations for new users without model retraining

comprehensible explanations (we recommend Y because

you liked X)

7/38 - Latent factor models

Next

Directions

Idea: interactions are deeply influenced by a set of factors

that are very specific to the domain (e.g. amount of action

in

Mahout’s

or complexity of characters in movies)

these factors are in general not obvious and need to be

Recommenders

inferred from the interaction data

both users and items can be described in terms of these factors

8/38 - Matrix factorization

Next

Directions

Computing a latent factor model: approximately factor A

in

into the product of two rank k feature matrices U and M such

Mahout’s

that A ≈ UM.

Recommenders

U models the latent features of the users, M models the latent

features of the items

dot product u m

i

j in the latent feature space predicts strength

of interaction between user i and item j

A

U

M

u × i

u × k

k × i

≈

×

9/38 - Single machine recommenders
- Taste

Next

Directions

in

based on Sean Owen’s Taste framework (started in 2005)

Mahout’s

mature and stable codebase

Recommenders

Recommender implementations encapsulate recommender

algorithms

DataModel implementations handle interaction data in

memory, files, databases, key-value stores

but focus was mostly on neighborhood methods

lack of implementations for latent factor models

little support for scientific usecases (e.g. recommender

contests)

11/38 - Collaboration

Next

Directions

MyMedialite, scientific library of recom-

in

Mahout’s

mender system algorithms

http://www.mymedialite.net/

Recommenders

Mahout now features a couple of popular latent factor models,

mostly ported by Zeno Gantner.

12/38 - Lots of different Factorizers for our SVDRecommender

Next

Directions

RatingSGDFactorizer, biased matrix factorization

in

Koren et al.: Matrix Factorization Techniques for Recommender Systems, IEEE Computer ’09

Mahout’s

SVDPlusPlusFactorizer, SVD++

Recommenders

Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD ’08

ALSWRFactorizer, matrix factorization using Alternating

Least Squares

Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08

Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08

ParallelSGDFactorizer, parallel version of biased matrix

factorization (contributed by Peng Cheng)

Tak´

acs et. al.: Scalable Collaborative Filtering Approaches for Large Recommender Systems, JMLR ’09

Niu et al.: Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, NIPS ’11

13/38 - Next directions

Next

Directions

in

Mahout’s

Recommenders

better tooling for cross-validation and hold-out tests (e.g.

time-based splits of interactions)

memory-efficient DataModel implementations tailored to

specific usecases (e.g. matrix factorization with SGD)

better support for computing recommendations for

”anonymous” users

online recommenders

14/38 - Usage

Next

Directions

in

researchers at TU Berlin and CWI Amsterdam

Mahout’s

regularly use Mahout for their recommender research

published at international conferences

Recommenders

”Bayrischer Rundfunk”, one of Germany’s largest public

TV broadcasters, uses Mahout to help users discover TV

content in its online media library

Berlin-based company plista runs a live contest for the

best news recommender algorithm and provides

Mahout-based ”skeleton code” to participants

The Dutch Institute of Sound and Vision runs a

webplatform that uses Mahout for recommending content

from its archive of Dutch audio-visual heritage collections

of the 20th century

15/38 - Parallel processing
- Distribution

Next

Directions

in

Mahout’s

difficult environment:

data is partitioned and stored in a distributed filesystem

Recommenders

algorithms must be expressed in MapReduce

our distributed implementations focus on two popular methods

item-based collaborative filtering

matrix factorization with Alternating Least Squares

17/38 - Scalable neighborhood methods
- Cooccurrences

Next

Directions

start with a simplified view:

in

Mahout’s

imagine interaction matrix A was

binary

Recommenders

→ we look at cooccurrences only

item similarity computation becomes matrix multiplication

S = A A

scale-out of the item-based approach reduces to finding an

efficient way to compute this item similarity matrix

19/38 - Parallelizing S = A A

Next

Directions

standard approach of computing item cooccurrences requires

random access to both users and items

in

Mahout’s

foreach item f do

Recommenders

foreach user i who interacted with f do

foreach item j that i also interacted with do

Sfj = Sfj + 1

→ not efficiently parallelizable on partitioned data

row outer product formulation of matrix multiplication is

efficiently parallelizable on a row-partitioned A

S = A A =

ai ai

i ∈A

mappers compute the outer products of rows of A, emit the

results row-wise, reducers sum these up to form S

20/38 - Parallel similarity computation

Next

Directions

much more details in the implementation

in

Mahout’s

support for various similarity measures

Recommenders

various optimizations (e.g. for symmetric similarity

measures)

downsampling of skewed interaction data

in-depth description available in:

Sebastian Schelter, Christoph Boden, Volker Markl:

Scalable Similarity-Based Neighborhood Methods with

MapReduce

ACM RecSys 2012

21/38 - Implementation in Mahout

Next

Directions

in

o.a.m.math.hadoop.similarity.cooccurrence.RowSimilarityJob

Mahout’s

computes the top-k pairwise similarities for each row of a

matrix using some similarity measure

Recommenders

o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

computes the top-k similar items per item using

RowSimilarityJob

o.a.m.cf.taste.hadoop.item.RecommenderJob

computes recommendations and similar items using

RowSimilarityJob

22/38 - Scalable Neighborhood Methods: Experiments

Next

Directions

in

Setup

Mahout’s

Recommenders

6 machines running Java 7 and Hadoop 1.0.4

two 4-core Opteron CPUs, 32 GB memory and four 1 TB

disk drives per machine

Results

Yahoo Songs dataset (700M datapoints, 1.8M users, 136K

items), similarity computation takes less than 100 minutes

23/38 - Scalable matrix factorization
- Alternating Least Squares

Next

Directions

ALS rotates between fixing U and M. When U is fixed, the

in

system recomputes M by solving a least-squares problem per

Mahout’s

item, and vice versa.

Recommenders

easy to parallelize, as all users (and vice versa, items) can be

recomputed independently

additionally, ALS can be applied to usecases with implicit data

(pageviews, clicks)

A

U

M

u × i

u × k

k × i

≈

×

25/38 - Scalable Matrix Factorization: Implementation

Next

Directions

Recompute user feature matrix U using a broadcast-join:

1. Run a map-only job using multithreaded mappers

in

Mahout’s

2. load item-feature matrix M into memory from HDFS to

share it among the individual mappers

Recommenders

3. mappers read the interaction histories of the users

4. multithreaded: solve a least squares problem per user to

recompute its feature vector

item features M

broadcast

user histories A

user features U

loc

al

Map

fw

d

Hash-Join + Re-computation

machine 1

loc

al

Map

fw

d

Hash-Join + Re-computation

machine 2

loc

al

Map

fw

Hash-Join + Re-computation

d

machine 3

26/38 - Implementation in Mahout

Next

Directions

o.a.m.cf.taste.hadoop.als.ParallelALSFactorizationJob

in

different solvers for explicit and implicit data

Mahout’s

Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08

Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08

Recommenders

o.a.m.cf.taste.hadoop.als.RecommenderJob computes

recommendations from a factorization

in-depth description available in:

Sebastian Schelter, Christoph Boden, Martin Schenck,

Alexander Alexandrov, Volker Markl:

Distributed Matrix Factorization with MapReduce using a

series of Broadcast-Joins

to appear at ACM RecSys 2013

27/38 - Scalable Matrix Factorization: Experiments

Next

Directions

Cluster: 26 machines, two 4-core Opteron CPUs, 32 GB

in

memory and four 1 TB disk drives each

Mahout’s

Hadoop Configuration: reuse JVMs, used JBlas as solver,

Recommenders

run multithreaded mappers

Datasets: Netflix (0.5M users, 100M datapoints), Yahoo

Songs (1.8M users, 700M datapoints), Bigflix (25M users, 5B

datapoints)

Bigflix (M)

Yahoo Songs

600

Bigflix (U)

150

Netflix

500

100

400

300

50

200

0

avg. duration per job (seconds)

(U) 10 (M) 10 (U) 20 (M) 20 (U) 50 (M) 50 (U) 100 (M) 100

100

avg. duration per job (seconds)

0 5

10

15

20

25

number of features r

number of machines

28/38 - Next directions

Next

Directions

in

Mahout’s

Recommenders

better tooling for cross-validation and hold-out tests (e.g.

to find parameters for ALS)

integration of more efficient solver libraries like JBlas

should be easier to modify and adjust the MapReduce

code

29/38 - A selection of users

Next

Directions

in

Mahout’s

Mendeley, a data platform for researchers (2.5M users,

Recommenders

50M research articles): Mendeley Suggest for discovering

relevant research publications

Researchgate, the world’s largest social network for

researchers (3M users)

a German online retailer with several million customers

across Europe

German online market places for real estate and

pre-owned cars with millions of users

30/38 - Deployment

- - ”Small data, low load”

Next

Directions

in

use GenericItembasedRecommender or

Mahout’s

GenericUserbasedRecommender, feed it with interaction

data stored in a file, database or key-value store

Recommenders

have it load the interaction data in memory and compute

recommendations on request

collect new interactions into your files or database and

periodically refresh the recommender

In order to improve performance, try to:

have your recommender look at fewer interactions by

using SamplingCandidateItemsStrategy

cache computed similarities with a CachingItemSimilarity

32/38 - ”Medium data, high load”

Next

Directions

Assumption: interaction data still fits into main memory

in

Mahout’s

use a recommender that is able to leverage a

Recommenders

precomputed model, e.g. GenericItembasedRecommender

or SVDRecommender

load the interaction data and the model in memory and

compute recommendations on request

collect new interactions into your files or database and

periodically recompute the model and refresh the

recommender

use BatchItemSimilarities or ParallelSGDFactorizer for

precomputing the model using multiple threads on a single

machine

33/38 - ”Lots of data, high load”

Next

Directions

Assumption: interaction data does not fit into main memory

in

Mahout’s

use a recommender that is able to leverage a

precomputed model, e.g. GenericItembasedRecommender

Recommenders

or SVDRecommender

keep the interaction data in a (potentially partitioned)

database or in a key-value store

load the model into memory, the recommender will only

use one (cacheable) query per recommendation request to

retrieve the user’s interaction history

collect new interactions into your files or database and

periodically recompute the model offline

use ItemSimilarityJob or ParallelALSFactorizationJob to

precompute the model with Hadoop

34/38 - ”Precompute everything”

Next

Directions

in

Mahout’s

use RecommenderJob to precompute recommendations

for all users with Hadoop

Recommenders

directly serve those recommendations

successfully employed by Mendeley for their research paper

recommender ”Suggest”

allowed them to run their recommender infrastructure serving

2 million users for less than $100 per month in AWS

35/38 - Next directions

Next

Directions

”Search engine based recommender infrastructure”

in

(work in progress driven by Pat Ferrel)

Mahout’s

use RowSimilarityJob to find anomalously co-occuring

Recommenders

items using Hadoop

index those item pairs with a distributed search engine

such as Apache Solr

query based on a user’s interaction history and the search

engine will answer with recommendations

gives us an easy-to-use, scalable serving layer for free

(Apache Solr)

allows complex recommendation queries containing filters,

geo-location, etc.

36/38 - The shape of things to come

Next

Directions

MapReduce is not well suited for certain ML usecases, e.g.

in

Mahout’s

when the algorithms to apply are iterative and the dataset fits

into the aggregate main memory of the cluster

Recommenders

Mahout always stated that it is not tied to Hadoop, however

there were no production-quality alternatives in the past

With the advent of YARN and the maturing of alternative

systems, this situation is changing and we should embrace this

change

Personally, I would love to see an experimental port of our

distributed recommenders to another Apache-supported

system such Spark or Giraph

37/38 - Thanks for listening!

Follow me on twitter at http://twitter.com/sscdotopen

Join Mahout’s mailinglists at http://s.apache.org/mahout-lists

picture on slide 3 by Tim Abott, http://www.flickr.com/photos/theabbott/

picture on slide 21 by Crimson Diabolics, http://crimsondiabolics.deviantart.com/