このページは http://www.slideshare.net/glouppe/understanding-random-forests-from-theory-to-practice の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約2年前 (2014/10/10)にアップロードinテクノロジー

Slides of my PhD defense, held on October 9.

- Understanding Random Forests

From Theory to Practice

Gilles Louppe

Universit´

e de Li`

ege, Belgium

October 9, 2014

1 / 39 - Motivation

2 - Objective

From a set of measurements,

learn a model

to predict and understand a phenomenon.

3 - Running example

From physicochemical

properties (alcohol, acidity,

sulphates, ...),

learn a model

to predict wine taste

preferences (from 0 to 10).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine

preferences by data mining from physicochemical properties, 2009.

4 - Outline

1 Motivation

2 Growing decision trees and random forests

Review of state-of-the-art, minor contributions

3 Interpreting random forests

Major contributions (Theory)

4 Implementing and accelerating random forests

Major contributions (Practice)

5 Conclusions

5 - Supervised learning

• The inputs are random variables X = X1, ..., Xp ;

• The output is a random variable Y .

• Data comes as a finite learning set

L = {(xi, yi)|i = 0, . . . , N − 1},

where xi ∈ X = X1 × ... × Xp and yi ∈ Y are randomly drawn

from PX,Y .

E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6)

• The goal is to find a model ϕL : X → Y minimizing

Err (ϕL) = EX,Y {L(Y , ϕL(X ))}.

6 - Performance evaluation

Classification

• Symbolic output (e.g., Y = {yes, no})

• Zero-one loss

L(Y , ϕL(X )) = 1(Y = ϕL(X ))

Regression

• Numerical output (e.g., Y = R)

• Squared error loss

L(Y , ϕL(X )) = (Y − ϕL(X ))2

7 - Divide and conquer

X2

X1

8 - Divide and conquer

X2

0.7

X1

8 - Divide and conquer

X2

0.5

0.7

X1

8 - Decision trees

𝒙

X2

Split node

t

𝑡1

Leaf node

5

≤

>

𝑋1 ≤ 0.7

0.5

t3

𝑡2

𝑡3

t

≤

>

𝑋2 ≤ 0.5

4

𝑡4

𝑡5

0.7

X1

𝑝(𝑌 = 𝑐|𝑋 = 𝒙)

t ∈ ϕ : nodes of the tree ϕ

Xt : split variable at t

vt ∈ R : split threshold at t

ϕ(x) = arg maxc∈Y p(Y = c|X = x)

9 - Learning from data (CART)

function BuildDecisionTree(L)

Create node t from the learning sample Lt = L

if the stopping criterion is met for t then

yt = some constant value

else

Find the split on Lt that maximizes impurity decrease

s∗ = arg max ∆i (s, t)

s∈Q

Partition Lt into Lt ∪ L

according to s∗

L

tR

tL = BuildDecisionTree(LL)

tR = BuildDecisionTree(LR )

end if

return t

end function

10 - Back to our example

alcohol <= 10.625

vol. acidity <= 0.237

alcohol <= 11.741

y = 5.915

y = 5.382

vol. acidity <= 0.442

y = 6.516

y = 6.131

y = 5.557

11 - Bias-variance decomposition

'B (x)

Theorem. For the squared error loss, the

L '

{ L(x)}

P

noise(x)

var(x)

bias-variance decomposition of the

expected generalization error at X = x is

bias2 (x)

EL{Err (ϕL(x))} = noise(x)+bias2(x)+var(x)

y

where

noise(x) = Err (ϕB (x)),

bias2(x) = (ϕB (x) − EL{ϕL(x)})2,

var(x) = EL{(EL{ϕL(x)} − ϕL(x))2}.

12 - Diagnosing the generalization error of a decision tree

• (Residual error : Lowest achievable error, independent of ϕL.)

• Bias : Decision trees usually have low bias.

• Variance : They often suffer from high variance.

• Solution : Combine the predictions of several randomized trees

into a single model.

13 - Random forests

𝒙

𝜑1

𝜑𝑀

…

𝑝𝜑 (𝑌 = 𝑐|𝑋 = 𝒙)

𝑝

(𝑌 = 𝑐|𝑋 = 𝒙)

1

𝜑𝑚

∑

𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)

Randomization

• Bootstrap samples

} Random Forests

• Random selection of K

p split variables

} Extra-Trees

• Random selection of the threshold

14 - Bias-variance decomposition (cont.)

Theorem. For the squared error loss, the bias-variance

decomposition of the expected generalization error

EL{Err (ψL,θ

(x))} at X = x of an ensemble of M

1,...,θM

randomized models ϕL,θ is

m

EL{Err (ψL,θ

(x))} = noise(x) + bias2(x) + var(x),

1,...,θM

where

noise(x) = Err (ϕB (x)),

bias2(x) = (ϕB (x) − EL,θ{ϕL,θ(x)})2,

1 − ρ(x)

var(x) = ρ(x)σ2L (x) +

σ2

(x).

,θ

M

L,θ

and where ρ(x) is the Pearson correlation coefficient between the

predictions of two randomized trees built on the same learning set.

15 - Diagnosing the generalization error of random forests

• Bias : Identical to the bias of a single randomized tree.

• Variance : var(x) = ρ(x)σ2L (x) + 1−ρ(x)σ2 (x)

,θ

M

L,θ

As M → ∞, var(x) → ρ(x)σ2L (x)

,θ

The stronger the randomization, ρ(x) → 0, var(x) → 0.

The weaker the randomization, ρ(x) → 1, var(x) → σ2L (x)

,θ

Bias-variance trade-off. Randomization increases bias but makes

it possible to reduce the variance of the corresponding ensemble

model. The crux of the problem is to find the right trade-off.

16 - Back to our example

Method

Trees

MSE

CART

1

1.055

Random Forest

50

0.517

Extra-Trees

50

0.507

Combining several randomized trees indeed works better !

17 - Variable importances

• Interpretability can be recovered through variable importances

• Two main importance measures :

The mean decrease of impurity (MDI) : summing total

impurity reductions at all tree nodes where the variable

appears (Breiman et al., 1984) ;

The mean decrease of accuracy (MDA) : measuring

accuracy reduction on out-of-bag samples when the values of

the variable are randomly permuted (Breiman, 2001).

• We focus here on MDI because :

It is faster to compute ;

It does not require to use bootstrap sampling ;

In practice, it correlates well with the MDA measure.

19 - Mean decrease of impurity

𝜑1

𝜑

𝜑

2

𝑀

…

Importance of variable Xj for an ensemble of M trees ϕm is :

M

1

Imp(Xj ) =

1(jt = j) p(t)∆i(t) ,

M m=1 t∈ϕm

where jt denotes the variable used at node t, p(t) = Nt/N and

∆i (t) is the impurity reduction at node t :

Nt

Nt

∆i (t) = i (t) −

L i(t

r

L) −

i (tR )

Nt

Nt

20 - Back to our example

MDI scores as computed from a forest of 1000 fully developed

trees on the Wine dataset (Random Forest, default parameters).

alcohol

volatile acidity

fre e s ulfur dioxide

s ulphate s

total s ulfur dioxide

re s idual s ugar

pH

chloride s

de ns ity

citric acid

fixe d acidity

color

0.00

0.05

0.10

0.15

0.20

0.25

0.30

21 - What does it mean ?

• MDI works well, but it is not well understood theoretically ;

• We would like to better characterize it and derive its main

properties from this characterization.

• Working assumptions :

All variables are discrete ;

Multi-way splits `

a la C4.5 (i.e., one branch per value) ;

Shannon entropy as impurity measure :

Nt,c

Nt,c

i (t) = −

log

N

N

c

t

t

Totally randomized trees (RF with K = 1) ;

Asymptotic conditions : N → ∞, M → ∞.

22 - Result 1 : Three-level decomposition (Louppe et al., 2013)

Theorem. Variable importances provide a three-level

decomposition of the information jointly provided by all the input

variables about the output, accounting for all interaction terms in

a fair and exhaustive way.

p

I (X1, . . . , Xp; Y )

=

Imp(Xj )

j =1

Information jointly provided

by all input variables

about the output

i) Decomposition in terms of

the MDI importance of

each input variable

p−1

1

1

Imp(Xj ) =

I (Xj ; Y |B)

C k p − k

k=0

p

B∈Pk (V −j )

ii) Decomposition along

iii) Decomposition along all

the degrees k of interaction

interaction terms B

with the other variables

of a given degree k

E.g. : p = 3, Imp(X1) = 1 I (X

(I (X

I (X

3

1; Y )+ 1

6

1; Y |X2)+I (X1; Y |X3))+ 1

3

1; Y |X2, X3)

23 - Illustration : 7-segment display (Breiman et al., 1984)

y

x1

x2

x3

x4

x5

x6

x7

0

1

1

1

0

1

1

1

1

0

0

1

0

0

1

0

2

1

0

1

1

1

0

1

3

1

0

1

1

0

1

1

4

0

1

1

1

0

1

0

5

1

1

0

1

0

1

1

6

1

1

0

1

1

1

1

7

1

0

1

0

0

1

0

8

1

1

1

1

1

1

1

9

1

1

1

1

0

1

1

24 - Illustration : 7-segment display (Breiman et al., 1984)

p−1

1

1

Imp(Xj ) =

I (Xj ; Y |B)

C k p − k

k=0

p

B∈Pk (V −j )

Var

Imp

X1

X1

0.412

X2

X2

0.581

X

X

3

0.531

3

X4

0.542

X4

X5

0.656

X

X5

6

0.225

X7

0.372

X6

3.321

X7

0

1

2

3

4

5

6

k

24 - Result 2 : Irrelevant variables (Louppe et al., 2013)

Theorem. Variable importances depend only on the relevant

variables.

Theorem. A variable Xj is irrelevant if and only if Imp(Xj ) = 0.

⇒ The importance of a relevant variable is insensitive to the

addition or the removal of irrelevant variables.

Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V )

if, for all B ⊆ V , I (X ; Y |B) = 0. A variable is relevant if it is not irrelevant.

25 - Relaxing assumptions

When trees are not totally random...

• There can be relevant variables with zero importances (due to

masking effects).

• The importance of relevant variables can be influenced by the

number of irrelevant variables.

When the learning set is finite...

• Importances are biased towards variables of high cardinality.

• This effect can be minimized by collecting impurity terms

measured from large enough sample only.

When splits are not multiway...

• i (t) does not actually measure the mutual information.

26 - Back to our example

MDI scores as computed from a forest of 1000 fixed-depth trees on

the Wine dataset (Extra-Trees, K = 1, max depth = 5).

alcohol

volatile acidity

color

de ns ity

total s ulfur dioxide

chloride s

citric acid

fre e s ulfur dioxide

s ulphate s

fixe d acidity

re s idual s ugar

pH

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Taking into account (some of) the biases

results in quite a different story !

27 - Implementation (Buitinck et al., 2013)

Scikit-Learn

• Open source machine learning library for

Python

scikit

• Classical and well-established

algorithms

• Emphasis on code quality and usability

A long team effort

Time for building a Random Forest (relative to version 0.10)

1

0.99

0.98

0.33

0.11

0.04

0.10

0.11

0.12

0.13

0.14

0.15

29 - Implementation overview

• Modular implementation, designed with a strict separation of

concerns

Builders : for building and connecting nodes into a tree

Splitters : for finding a split

Criteria : for evaluating the goodness of a split

Tree : dedicated data structure

• Efficient algorithmic formulation [See Louppe, 2014]

Dedicated sorting procedure

Efficient evaluation of consecutive splits

• Close to the metal, carefully coded, implementation

2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests

# But we kept it stupid simple for users!

clf = RandomForestClassifier()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

30 - A winning strategy

Scikit-Learn implementation proves to be one of the fastest

among all libraries and programming languages.

14000

13427.06

Scikit-Le arn-RF

Scikit-Le arn-ETs

randomForest

Ope nCV-RF

R, Fortran

12000

Ope nCV-ETs

OK3-RF

10941.72

OK3-ETs

Orange

10000

We ka-RF

R-RF

Python

Orange -RF

8000

(s)

e

tim

6000

Fit

OpenCV

C++

4464.65

4000

3342.83

OK3

C

Weka

2000

Scikit-Learn

1518.14 1711.94 Java

1027.91

Python, Cython

203.01 211.53

0

31 - Computational complexity (Louppe, 2014)

Average time complexity

CART

Θ(pN log2 N)

Random Forest

Θ(MK N log2 N)

Extra-Trees

Θ(MKN log N)

• N : number of samples in L

• p : number of input variables

• K : the number of variables randomly drawn at each node

• N = 0.632N.

32 - Improving scalability through randomization

Motivation

• Randomization and averaging allow to improve accuracy by

reducing variance.

• As a nice side-effect, the resulting algorithms are fast and

embarrassingly parallel.

• Why not purposely exploit randomization to make the

algorithm even more scalable (and at least as accurate) ?

Problem

• Let assume a supervised learning problem of Ns samples

defined over Nf features. Let also assume T computing

nodes, each with a memory capacity limited to Mmax , with

Mmax

Ns × Nf .

• How to best exploit the memory constraint to obtain the most

accurate model, as quickly as possible ?

33 - A straightforward solution : Random Patches (Louppe et al., 2012)

1. Draw a subsample r of ps Ns

random examples, with pf Nf

random features.

2. Build a base estimator on r .

3. Repeat 1-2 for a number T of

estimators.

4. Aggregate the predictions by

voting.

ps and pf are two meta-parameters that

should be selected

• such that psNs × pf Nf

Mmax

• to optimize accuracy

34 - Impact of memory constraint

0.94

0.92

0.90

y 0.88

ac

cur 0.86

Ac

0.84

0.82

RP-ET

RP-DT

0.80

ET

RF

0.1

0.2

0.3

0.4

0.5

Memory constraint

35 - Lessons learned from subsampling

• Training each estimator on the whole data is (often) useless.

The size of the random patches can be reduced without

(significant) loss in accuracy.

• As a result, both memory consumption and training time can

be reduced, at low cost.

• With strong memory constraints, RP can exploit data better

than the other methods.

• Sampling features is critical to improve accuracy. Sampling

the examples only is often ineffective.

36 - Opening the black box

• Random forests constitute one of the most robust and

effective machine learning algorithms for many problems.

• While simple in design and easy to use, random forests remain

however

hard to analyze theoretically,

non-trivial to interpret,

difficult to implement properly.

• Through an in-depth re-assessment of the method, this

dissertation has proposed original contributions on these

issues.

38 - Future works

Variable importances

• Theoretical characterization of variable importances in a finite

setting.

• (Re-analysis of) empirical studies based on variable

importances, in light of the results and conclusions of the

thesis.

• Study of variable importances in boosting.

Subsampling

• Finer study of subsampling statistical mechanisms.

• Smart sampling.

39 - Questions ?

40 - Backup slides

41 - Condorcet’s jury theorem

Let consider a group of M voters.

If each voter has an independent

probability p > 1 of voting for the correct

2

decision, then adding more voters increases

the probability of the majority decision to

be correct.

When M → ∞, the probability that the

decision taken by the group is correct

approaches 1.

42 - Interpretation of ρ(x) (Louppe, 2014)

Theorem. ρ(x) =

VL{Eθ|L{ϕL,θ(x)}}

VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}

In other words, it is the ratio between

• the variance due to the learning set and

• the total variance, accounting for random effects due to both

the learning set and the random perburbations.

ρ(x) → 1 when variance is mostly due to the learning set ;

ρ(x) → 0 when variance is mostly due to the random

perturbations ;

ρ(x)

0.

43