このページは http://www.slideshare.net/hamukazu/recommendation-system-theory-and-practice の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

1年以上前 (2015/02/18)にアップロードinテクノロジー

Survey on recommendation systems presented at IMI Colloquium, Kyushu University, Feb 18, 2015.

レ...

Survey on recommendation systems presented at IMI Colloquium, Kyushu University, Feb 18, 2015.

レコメンデーションシステムの最新の研究動向に関する解説です。2015年2月18日に九州大学IMIコロキアムで講演したものです。資料は英語ですが、講演は日本語でやりました。

- Recommendation System

— Theory and Practice

IMI Col oquium @ Kyushu Univ.

February 18, 2015

Kimikazu Kato

Silver Egg Technology

1 / 27 - About myself

Kimikazu Kato

Ph.D in computer science, Master's degree in mathematics

Experience in numerical computation, especially ...

Geometric computation, computer graphics

Partial differential equation, parallel computation, GPGPU

Now specialize in

Machine learning, especially, recommendation system

2 - About our Company

Silver Egg Technology

Established: 1998

CEO: Tom Foley

Main Service: Recommendation System, Online Advertisement

Major Clients: QVC, Senshukai (Bell Maison), Tsutaya

We provide a recommendation system to Japan's leading web sites.

3 - Today's Story

Introduction to recommendation system

Rating prediction

Shopping behavior prediction

Practical viewpoint

Conclusion

4 - Recommendation System

Recommender systems or recommendation systems (sometimes

replacing "system" with a synonym such as platform or engine) are a

subclass of information filtering system that seek to predict the

'rating' or 'preference' that user would give to an item. — Wikipedia

In this talk, we focus on collaborative filtering method, which only utilize

users' behavior, activity, and preference.

Other methods includes:

Content-based methods

Method using demographic data

Hybrid

5 - Our Service and Mechanism

ASP service named "Aigent Recommender"

Works as an add-on to the existing web site.

6 - Netflix Prize

The Netflix Prize was an open competition for the best collaborative

filtering algorithm to predict user ratings for films, based on previous

ratings without any other information about the users or films, i.e.

without the users or the films being identified except by numbers

assigned for the contest. — Wikipedia

Shortly, an open competition for preference prediction.

Closed in 2009.

7 - Description of the Problem

user\movie

W

X

Y

Z

A

5

4

1

4

B

4

C

2

3

D

1

4

?

Given rating information for some user/movie pairs,

is it possible to predict a rating for an unknown user/movie pair?

8 - Notations

Number of users: n

Set of users: U = {1, 2, … , n}

Number of items (movies): m

Set of items (movies): I = {1, 2, … , m}

Input matrix: A (n ×

m matrix)

9 - Matrix Factorization

Based on the assumption that each item is described by a small number of

latent factors

Each rating is expressed as a linear combination of the latent factors

Achieve good performance in Netflix Prize

A ≈ X T Y

Find such matrices X ∈ Mat(f , n), Y ∈ Mat(f , m) where f ≪ n, m

10 - p(A|X, Y, σ) =

(Aui|XTu Yi, σ)

∏

aui≠0

p(X|σX ) =

(Xu|0, σX I)

∏

u

p(Y|σY ) =

(Yi|0, σY I)

∏

i

Find X and Y maximize p(X, Y|A, σ)

11 - According to Bayes' Theorem,

p(X, Y|A, σ)

= p(A|X, Y, σ)p(X|σX )p(X|σX ) × const.

Thus,

log p(U, V|A, σ, σU , σV )

=

(Aui − X Tu Yi) + λX∥X∥2Fro + λY∥Y∥2Fro + const.

∑

Aui

where ∥ ⋅ ∥ means Frobenius norm.

Fro

How can this be computed? Use MCMC. See [Salakhutdinov et al., 2008].

Once X and Y are determined, A

~ := XTY and the prediction for Aui is

estimated by A

~ui

12 - Difference between Rating and Shopping

Rating

Shopping (Browsing)

user\movie

W

X

Y

Z

user\item

W

X

Y

Z

A

5

4

1

4

A

1

1

1

1

B

4

B

1

C

2

3

C

1

D

1

4

?

D

1

1

?

Includes negative feedback

Includes no negative feedback

"1" means "boring"

Zero means "unknown" or

Zero means "unknown"

"negative"

More degree of the freedom

Consequently, the algorithm effective for the rating matrix is not necessarily

effective for the shopping matrix.

13 - Evaluation Metrics for Recommendation

Systems

Rating prediction

The Root of the Mean Squared Error (RMSE)

The square root of the sum of squared errors

Shopping prediction

Precision

(# of Recommended and Purchased)/(# of Recommended)

Recall

(# of Recommended and Purchased)/(# of Purchased)

The criteria are different. This is another reason different algorithms should

be applied.

14 - Solutions

Adding a constraint to the optimization problem

Changing the objective function itself

15 - Adding a Constraint

The problem is the too much degree of freedom

Desirable characteristic is that many elements of the product should be

zero.

Assume that a certain ratio of zero elements of the input matrix remains

zero after the optimization [Sindhwani et al., 2010]

Experimentally outperform the "zero-as-negative" method

16 - [Sindhwani et al., 2010]

Introduced variables p to relax the problem.

ui

Minimize

(Aui − X Tu Yi) + λX∥X∥2Fro + λY∥Y

∑

∥2Fro

Aui!=0

+ ∑ [pui(0 − XTuYi)2 − (1 − pui)(1 − XTuYi)2]

Aui=0

+T

[−pui log pui − (1 − pui) log(1 − pui)]

∑

Aui=0

subject to

1

pui = r

|{Aui|Aui = 0}| ∑

Aui=0

17 - Ranking prediction

Another strategy of shopping prediction

"Learn from the order" approach

Predict whether X is more likely to be bought than Y, rather than the

probability for X or Y.

18 - Bayesian Probabilistic Ranking

[Rendle et al., 2009]

Consider matrix factorization model, but the update of elements is

according to the observation of the "orders"

The parameters are the same as usual matrix factorization, but the

objective function is different

Consider a total order >u for each u ∈ U. Suppose that i >u j(i, j ∈ I) means

"the user u is more likely to buy i than j.

The objective is to calculate p(i >u j) such that Aui = 0 and Auj (which means

i and j are not bought by u).

19 - Let

DA = {(u, i, j) ∈ U × I × I|Aui = 1, Auj = 0},

and define

p(>u |X, Y) :=

p(i >u j|X, Y)

∏

∏

u∈U

(u,i,j)∈DA

where we assume

p(i >u j|X, Y) = σ(X Tu Yi − XuYj)

σ(x) =

1

1 + e−x

According to Bayes' theorem, the function to be optimized becomes:

∏ p(X, Y| >u) = ∏ p(>u |X, Y) × p(X)p(Y) × const.

20 - Taking log of this,

L := log[∏ p(>u |X, Y) × p(X)p(Y)]

= log

p(i >u j|X, Y) − λX∥X∥2Fro − λY∥Y

∏

∥2Fro

(u,i,j)∈DA

=

log σ(X Tu Yi − X Tu Yj) − λX∥X∥2Fro − λY∥Y

∑

∥2Fro

(u,i,j)∈DA

Now consider the following problem:

ma

X T

X T

λX ∥2

λY ∥2

X,Y [

x

log σ( u Yi − u Yj) − ∥X Fro − ∥Y

∑

Fro]

(u,i,j)∈DA

This means "find a pair of matrices X, Y which preserve the order of the

element of the input matrix for each u."

21 - Computation

The function we want to optimize:

log σ(X Tu Yi − X Tu Yj) − λX∥X∥2Fro − λY∥Y

∑

∥2Fro

(u,i,j)∈DA

U × I × I is huge, so in practice, a stochastic method is necessary.

Let the parameters be Θ = (X, Y).

The algorithm is the following:

Repeat the following

Choose (u, i, j) ∈ DA randomly

Update Θ with

Θ = Θ − α ∂

X T

X T

λ

∥2

λ

∥2

∂Θ (log σ( u Yi − u Yj ) − X ∥X Fro − Y ∥Y Fro)

This method is called Stochastic Gradient Descent (SGD).

22 - Practical Aspect of Recommendation

Problem

Computational time

Memory consumption

How many services can be integrated in a server rack?

Super high accuracy with a super computer is useless for real business

23 - Sparsification

As an expression of a big matrix, a sparse matrix can save computational

time and memory consumption at the same time

It is advantageous to employ a model whose parameters become sparse

24 - Example of sparse model: Elastic Net

In the regression model, adding L1 term makes the solution sparse:

min 1

∥2

λ(1 − ρ)

∥2

|1

w [

∥Xw − y 2 +

∥w 2 + λρ|w ]

2n

2

The similar idea is used for the matrix factorization [Ning et al., 2011]:

Minimize

∥A − AW∥ + λ(1 − ρ) ∥W∥2Fro + λρ|W|

2

1

subject to

diag W = 0

25 - Conclusion: What is Important for Good

Prediction?

Theory

Machine learning

Mathematical optimization

Implementation

Algorithms

Computer architecture

Mathematics

Human factors!

Hand tuning of parameters

Domain specific knowledge

26 - References

Salakhutdinov, Ruslan, and Andriy Mnih. "Bayesian probabilistic matrix

factorization using Markov chain Monte Carlo." Proceedings of the 25th

international conference on Machine learning. ACM, 2008.

Sindhwani, Vikas, et al. "One-class matrix completion with low-density

factorizations." Data Mining (ICDM), 2010 IEEE 10th International

Conference on. IEEE, 2010.

Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit

feedback." Proceedings of the Twenty-Fifth Conference on Uncertainty in

Artificial Intelligence. AUAI Press, 2009.

Zou, Hui, and Trevor Hastie. "Regularization and variable selection via the

elastic net." Journal of the Royal Statistical Society: Series B (Statistical

Methodology) 67.2 (2005): 301-320.

Ning, Xia, and George Karypis. "SLIM: Sparse linear methods for top-n

recommender systems." Data Mining (ICDM), 2011 IEEE 11th

International Conference on. IEEE, 2011.

27