このページは http://www.slideshare.net/guestfee8698/crossvalidation の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Cross-validation for

detecting and preventing

overfitting

Note to other teachers and users of

Andrew W. Moore

these slides. Andrew would be delighted

if you found this source material useful in

giving your own lectures. Feel free to use

Professor

these slides verbatim, or to modify them

to fit your own needs. PowerPoint

originals are available. If you make use

School of Computer Science

of a significant portion of these slides in

your own lecture, please include this

message, or the following link to the

Carnegie Mellon University

source repository of Andrew’s tutorials:

http://www.cs.cmu.edu/~awm/tutorials .

www.cs.cmu.edu/~awm

Comments and corrections gratefully

received.

awm@cs.cmu.edu

412-268-7599

Copyright © Andrew W. Moore

Slide 1 - A Regression Problem

y = f(x) + noise

Can we learn f from this data?

y

x

Let’s consider three methods…

Copyright © Andrew W. Moore

Slide 2 - Linear Regression

y

x

Copyright © Andrew W. Moore

Slide 3 - Linear Regression

Univariate Linear regression with a constant term:

X

Y

X= 3

y= 7

Originally

3

7

1

3

discussed in the

previous Andrew

1

3

:

:

Lecture: “Neural

Nets”

:

:

x =(3)..

y =7..

1

1

Copyright © Andrew W. Moore

Slide 4 - Linear Regression

Univariate Linear regression with a constant term:

X

Y

X= 3

y= 7

3

7

1

3

1

3

:

:

:

:

x =(3)..

y =7..

1

1

Z= 1 3

y= 7

1

1

3

:

:

z =(1,3)..

y =7..

1

1

z =(1,x )

k

k

Copyright © Andrew W. Moore

Slide 5 - Linear Regression

Univariate Linear regression with a constant term:

X

Y

X= 3

y= 7

3

7

1

3

1

3

:

:

:

:

x =(3)..

y =7..

1

1

Z= 1 3

y= 7

1

1

3

:

:

β=(ZTZ)-1(ZTy)

z =(1,3)..

y =7..

1

1

yest = β + β x

0

1

z =(1,x )

k

k

Copyright © Andrew W. Moore

Slide 6 - Quadratic Regression

y

x

Copyright © Andrew W. Moore

Slide 7 - Quadratic Regression

Much more about

X

Y

X= 3

y= 7

this in the future

Andrew Lecture:

3

7

1

3

“Favorite

Regression

1

3

:

:

Algorithms”

:

:

x =(3,2)..

y =7..

1

1

1

3

9

y=

Z=

7

1

1

1

3

:

β=(ZTZ)-1(ZTy)

:

z=(1 , x, x2 )

,

yest = β + β x+ β x2

0

1

2

Copyright © Andrew W. Moore

Slide 8 - Join-the-dots

Also known as piecewise

linear nonparametric

regression if that makes

you feel better

y

x

Copyright © Andrew W. Moore

Slide 9 - Which is best?

y

y

x

x

Why not choose the method with the

best fit to the data?

Copyright © Andrew W. Moore

Slide 10 - What do we really want?

y

y

x

x

Why not choose the method with the

best fit to the data?

“How well are you going to predict

future data drawn from the same

distribution?”

Copyright © Andrew W. Moore

Slide 11 - The test set method

1. Randomly choose

30% of the data to be in a

test set

2. The remainder is a

y

training set

x

Copyright © Andrew W. Moore

Slide 12 - The test set method

1. Randomly choose

30% of the data to be in a

test set

2. The remainder is a

y

training set

3. Perform your

regression on the training

x

set

(Linear regression example)

Copyright © Andrew W. Moore

Slide 13 - The test set method

1. Randomly choose

30% of the data to be in a

test set

2. The remainder is a

y

training set

3. Perform your

regression on the training

x

set

4. Estimate your future

(Linear regression example)

performance with the test

Mean Squared Error = 2.4

set

Copyright © Andrew W. Moore

Slide 14 - The test set method

1. Randomly choose

30% of the data to be in a

test set

2. The remainder is a

y

training set

3. Perform your

regression on the training

x

set

4. Estimate your future

(Quadratic regression example) performance with the test

Mean Squared Error = 0.9

set

Copyright © Andrew W. Moore

Slide 15 - The test set method

1. Randomly choose

30% of the data to be in a

test set

2. The remainder is a

y

training set

3. Perform your

regression on the training

x

set

4. Estimate your future

(Join the dots example)

performance with the test

Mean Squared Error = 2.2

set

Copyright © Andrew W. Moore

Slide 16 - The test set method

Good news:

•Very very simple

•Can then simply choose the method with

the best test-set score

Bad news:

•What’s the downside?

Copyright © Andrew W. Moore

Slide 17 - The test set method

Good news:

•Very very simple

•Can then simply choose the method with

the best test-set score

Bad news:

•Wastes data: we get an estimate of the

We say the

“test-set

best method to apply to 30% less data

estimator of

performance

•If we don’t have much data, our test-set has high

might just be lucky or unlucky

variance”

Copyright © Andrew W. Moore

Slide 18 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be the kth record

k

k

y

x

Copyright © Andrew W. Moore

Slide 19 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be the kth record

k

k

2. Temporarily remove (x ,y )

k

k

from the dataset

y

x

Copyright © Andrew W. Moore

Slide 20 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be the kth record

k

k

2. Temporarily remove (x ,y )

k

k

from the dataset

3. Train on the remaining R-1

y

datapoints

x

Copyright © Andrew W. Moore

Slide 21 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be the kth record

k

k

2. Temporarily remove (x ,y )

k

k

from the dataset

3. Train on the remaining R-1

y

datapoints

4. Note your error (x ,y )

k

k

x

Copyright © Andrew W. Moore

Slide 22 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be the kth record

k

k

2. Temporarily remove (x ,y )

k

k

from the dataset

3. Train on the remaining R-1

y

datapoints

4. Note your error (x ,y )

k

k

x

When you’ve done all points,

report the mean error.

Copyright © Andrew W. Moore

Slide 23 - LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (x ,y ) be

k

k

the kth

record

2. Temporarily

y

y

y

remove

(x ,y ) from

k

k

the dataset

x

x

x

3. Train on the

remaining

R-1

datapoints

4. Note your

error (x ,y )

y

y

y

k

k

When you’ve

done all points,

report the mean

x

x

x

error.

MSELOOCV

= 2.12

y

y

y

x

x

x

Copyright © Andrew W. Moore

Slide 24 - LOOCV for Quadratic Regression

For k=1 to R

1. Let (x ,y ) be

k

k

the kth

record

2. Temporarily

y

y

y

remove

(x ,y ) from

k

k

the dataset

x

x

x

3. Train on the

remaining

R-1

datapoints

4. Note your

error (x ,y )

y

y

y

k

k

When you’ve

done all points,

report the mean

x

x

x

error.

MSELOOCV

=0.962

y

y

y

x

x

x

Copyright © Andrew W. Moore

Slide 25 - LOOCV for Join The Dots

For k=1 to R

1. Let (x ,y ) be

k

k

the kth

record

2. Temporarily

y

y

y

remove

(x ,y ) from

k

k

the dataset

x

x

x

3. Train on the

remaining

R-1

datapoints

4. Note your

error (x ,y )

y

y

y

k

k

When you’ve

done all points,

report the mean

x

x

x

error.

MSELOOCV

=3.33

y

y

y

x

x

x

Copyright © Andrew W. Moore

Slide 26 - Which kind of Cross Validation?

Downside

Upside

Test-set

Variance: unreliable Cheap

estimate of future

performance

Leave-

Expensive.

Doesn’t

one-out

waste data

Has some weird

behavior

..can we get the best of both worlds?

Copyright © Andrew W. Moore

Slide 27 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

y

x

Copyright © Andrew W. Moore

Slide 28 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

y

x

Copyright © Andrew W. Moore

Slide 29 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

For the green partition: Train on all the

points not in the green partition.

y

Find the test-set sum of errors on

the green points.

x

Copyright © Andrew W. Moore

Slide 30 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

For the green partition: Train on all the

points not in the green partition.

y

Find the test-set sum of errors on

the green points.

For the blue partition: Train on all the

points not in the blue partition. Find

x

the test-set sum of errors on the

blue points.

Copyright © Andrew W. Moore

Slide 31 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

For the green partition: Train on all the

points not in the green partition.

y

Find the test-set sum of errors on

the green points.

For the blue partition: Train on all the

points not in the blue partition. Find

x

the test-set sum of errors on the

blue points.

Linear Regression

MSE

=2.05

3FOLD

Then report the mean error

Copyright © Andrew W. Moore

Slide 32 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

For the green partition: Train on all the

points not in the green partition.

y

Find the test-set sum of errors on

the green points.

For the blue partition: Train on all the

points not in the blue partition. Find

x

the test-set sum of errors on the

blue points.

Quadratic Regression

MSE

=1.11

3FOLD

Then report the mean error

Copyright © Andrew W. Moore

Slide 33 - k-fold Cross Randomly break the dataset into k

partitions (in our example we’ll have k=3

Validation

partitions colored Red Green and Blue)

For the red partition: Train on all the

points not in the red partition. Find

the test-set sum of errors on the red

points.

For the green partition: Train on all the

points not in the green partition.

y

Find the test-set sum of errors on

the green points.

For the blue partition: Train on all the

points not in the blue partition. Find

x

the test-set sum of errors on the

blue points.

Joint-the-dots

MSE

=2.93

3FOLD

Then report the mean error

Copyright © Andrew W. Moore

Slide 34 - Which kind of Cross Validation?

Downside

Upside

Test-set

Variance: unreliable

Cheap

estimate of future

performance

Leave-

Expensive.

Doesn’t waste data

one-out

Has some weird behavior

10-fold

Wastes 10% of the data.

Only wastes 10%. Only

10 times more expensive 10 times more expensive

than test set

instead of R times.

3-fold

Wastier than 10-fold.

Slightly better than test-

Expensivier than test set

set

R-fold

Identical to Leave-one-out

Copyright © Andrew W. Moore

Slide 35 - Which kind of Cross Validation?

Downside

Upside

Test-set

Variance: unreliable

Cheap

estimate of future

performance

But note: One of

Leave-

Expensive.

Doesn’t waste data

Andrew’s joys in life is

one-out

Has some weird behavioralgorithmic tricks for

10-fold

Wastes 10% of the data.

Only wastes 10%. Only

making these cheap

10 times more expensive 10 times more expensive

than testset

instead of R times.

3-fold

Wastier than 10-fold.

Slightly better than test-

Expensivier than testset

set

R-fold

Identical to Leave-one-out

Copyright © Andrew W. Moore

Slide 36 - CV-based Model Selection

• We’re trying to decide which algorithm to use.

• We train each machine and make a table…

i

f

TRAINERR

10-FOLD-CV-ERR

Choice

i

1

f1

2

f2

3

f

⌦

3

4

f4

5

f5

6

f6

Copyright © Andrew W. Moore

Slide 37 - CV-based Model Selection

• Example: Choosing number of hidden units in a one-

hidden-layer neural net.

• Step 1: Compute 10-fold CV error for six different model

classes:

Algorithm

TRAINERR

10-FOLD-CV-ERR

Choice

0 hidden units

1 hidden units

2 hidden units

⌦

3 hidden units

4 hidden units

5 hidden units

• Step 2: Whichever model class gave best CV score: train it

with all the data, and that’s the predictive model you’ll use.

Copyright © Andrew W. Moore

Slide 38 - CV-based Model Selection

• Example: Choosing “k” for a k-nearest-neighbor regression.

• Step 1: Compute LOOCV error for six different model

classes:

Algorithm

TRAINERR

10-fold-CV-ERR

Choice

K=1

K=2

K=3

K=4

⌦

K=5

K=6

• Step 2: Whichever model class gave best CV score: train it

with all the data, and that’s the predictive model you’ll use.

Copyright © Andrew W. Moore

Slide 39 - CV-based Model Selection

• Example: Choosing “k” for a k-nearest-neighbor regression.

• Step 1: Compute LOOCV error for six different model

The reason is Computational. For k-

NN (and all other nonparametric

classes:

Why did we use 10-fold-CV for

methods) LOOCV happens to be as

cheap as regular predictions.

neural nets and LOOCV for k-

nearest neighbor?

No good reason, except it looked

Algorithm

TRAINERR

LOOCV-ERR

like thin

Choice

gs were getting worse as K

And why stop at K=6

was increasing

K=1

Are we guaranteed that a local

Sadly, no. And in fact, the

K=2

optimum of K vs LOOCV will be

relationship can be very bumpy.

K=3

the global optimum?

K=4

What should we do if we are

⌦

depressed at the expense of

Idea One: K=1, K=2, K=4, K=8,

K=5

doing LOOCV for K= 1 through

K=16, K=32, K=64 … K=1024

Idea Two: Hillclimbing from an initial

K=6

1000?

guess at K

• Step 2: Whichever model class gave best CV score: train it

with all the data, and that’s the predictive model you’ll use.

Copyright © Andrew W. Moore

Slide 40 - CV-based Model Selection

• Can you think of other decisions we can ask Cross

Validation to make for us, based on other machine learning

algorithms in the class so far?

Copyright © Andrew W. Moore

Slide 41 - CV-based Model Selection

• Can you think of other decisions we can ask Cross

Validation to make for us, based on other machine learning

algorithms in the class so far?

• Degree of polynomial in polynomial regression

• Whether to use full, diagonal or spherical Gaussians in a Gaussian

Bayes Classifier.

• The Kernel Width in Kernel Regression

• The Kernel Width in Locally Weighted Regression

• The Bayesian Prior in Bayesian Regression

These involve

choosing the value of a

real-valued parameter.

What should we do?

Copyright © Andrew W. Moore

Slide 42 - CV-based Model Selection

• Can you think of other decisions we can ask Cross

Validation to make for us, based on other machine learning

algorithms in the class so far?

• Degree of polynomial in polynomial regression

• Whether to use full, diagonal or spherical Gaussians in a Gaussian

Bayes Classifier.

• The Kernel Width in Kernel Regression

• The Kernel Width in Locally Weighted Regression

• The Bayesian Prior in Bayesian Regression

These involve

Idea One: Consider a discrete set of values

choosing the value of a (often best to consider a set of values with

exponentially increasing gaps, as in the K-NN

real-valued parameter. example).

∂

What should we do?

LOOCV

Idea Two: Compute and then

∂ Parameter

do gradianet descent.

Copyright © Andrew W. Moore

Slide 43 - CV-based Model Selection

• Can you think of other decisions we can ask Cross

Validation to make for us, based on other machine learning

algorithms in the class so far?

• Degree of polynomial in polynomial regression

• Whether to use full, diagonal or spherical Gaussians in a Gaussian

Bayes Classifier.

• The Kernel Width in Kernel Regression

• The Kernel Width in Locally Weighted Regression

• The Bayesian Prior in Bayesian Regression

Also: The scale factors of a non-

parametric distance metric

These involve

Idea One: Consider a discrete set of values

choosing the value of a (often best to consider a set of values with

exponentially increasing gaps, as in the K-NN

real-valued parameter. example).

∂

What should we do?

LOOCV

Idea Two: Compute and then

∂ Parameter

do gradianet descent.

Copyright © Andrew W. Moore

Slide 44 - CV-based Algorithm Choice

• Example: Choosing which regression algorithm to use

• Step 1: Compute 10-fold-CV error for six different model

classes:

Algorithm

TRAINERR

10-fold-CV-ERR

Choice

1-NN

10-NN

Linear Reg’n

Quad reg’n

⌦

LWR, KW=0.1

LWR, KW=0.5

• Step 2: Whichever algorithm gave best CV score: train it

with all the data, and that’s the predictive model you’ll use.

Copyright © Andrew W. Moore

Slide 45 - Alternatives to CV-based model selection

• Model selection methods:

1. Cross-validation

2. AIC (Akaike Information Criterion)

3. BIC (Bayesian Information Criterion)

4. VC-dimension (Vapnik-Chervonenkis Dimension)

Only directly applicable to

choosing classifiers

Described in a future

Lecture

Copyright © Andrew W. Moore

Slide 46 - Which model selection method is best?

1. (CV) Cross-validation

2. AIC (Akaike Information Criterion)

3. BIC (Bayesian Information Criterion)

4. (SRMVC) Structural Risk Minimize with VC-dimension

• AIC, BIC and SRMVC advantage: you only need the training

error.

• CV error might have more variance

• SRMVC is wildly conservative

• Asymptotically AIC and Leave-one-out CV should be the same

• Asymptotically BIC and carefully chosen k-fold should be same

• You want BIC if you want the best structure instead of the best

predictor (e.g. for clustering or Bayes Net structure finding)

• Many alternatives---including proper Bayesian approaches.

• It’s an emotional issue.

Copyright © Andrew W. Moore

Slide 47 - Other Cross-validation issues

• Can do “leave all pairs out” or “leave-all-

ntuples-out” if feeling resourceful.

• Some folks do k-folds in which each fold is

an independently-chosen subset of the data

• Do you know what AIC and BIC are?

If so…

• LOOCV behaves like AIC asymptotically.

• k-fold behaves like BIC if you choose k carefully

If not…

• Nyardely nyardely nyoo nyoo

Copyright © Andrew W. Moore

Slide 48 - Cross-Validation for regression

• Choosing the number of hidden units in a

neural net

• Feature selection (see later)

• Choosing a polynomial degree

• Choosing which regressor to use

Copyright © Andrew W. Moore

Slide 49 - Supervising Gradient Descent

• This is a weird but common use of Test-set

validation

• Suppose you have a neural net with too

many hidden units. It will overfit.

• As gradient descent progresses, maintain a

graph of MSE-testset-error vs. Iteration

Use the weights you

Training Set

found on this iteration

Test Set

Mean Squared

Error

Iteration of Gradient Descent

Copyright © Andrew W. Moore

Slide 50 - Supervising Gradient Descent

• This is a weird but common use of Test-set

validation

• Suppose you have a neural net with too

many hidden units. It will overfit.

Relies on an intuition that a not-fully-

minimized set of weights is somewhat like

• As gradient descent progresses, maintain a

having fewer parameters.

graph of MSE-testset-error vs. Iteration

Works pretty well in practice, apparently

Use the weights you

Training Set

found on this iteration

Test Set

Mean Squared

Error

Iteration of Gradient Descent

Copyright © Andrew W. Moore

Slide 51 - Cross-validation for classification

• Instead of computing the sum squared

errors on a test set, you should compute…

Copyright © Andrew W. Moore

Slide 52 - Cross-validation for classification

• Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on

a testset.

Copyright © Andrew W. Moore

Slide 53 - Cross-validation for classification

• Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on

a testset.

• What’s LOOCV of 1-NN?

• What’s LOOCV of 3-NN?

• What’s LOOCV of 22-NN?

Copyright © Andrew W. Moore

Slide 54 - Cross-validation for classification

• Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on

a testset.

• But there’s a more sensitive alternative:

Compute

log P(all test outputs|all test inputs, your model)

Copyright © Andrew W. Moore

Slide 55 - Cross-Validation for classification

• Choosing the pruning parameter for decision

trees

• Feature selection (see later)

• What kind of Gaussian to use in a Gaussian-

based Bayes Classifier

• Choosing which classifier to use

Copyright © Andrew W. Moore

Slide 56 - Cross-Validation for density

estimation

• Compute the sum of log-likelihoods of test

points

Example uses:

• Choosing what kind of Gaussian assumption

to use

• Choose the density estimator

• NOT Feature selection (testset density will

almost always look better with fewer

features)

Copyright © Andrew W. Moore

Slide 57 - Feature Selection

• Suppose you have a learning algorithm LA

and a set of input attributes { X , X .. X }

1

2

m

• You expect that LA will only find some

subset of the attributes useful.

• Question: How can we use cross-validation

to find a useful subset?

• Four ideas:

Another fun area in which

• Forward selection

Andrew has spent a lot of his

• Backward elimination

wild youth

• Hill Climbing

• Stochastic search (Simulated Annealing or GAs)

Copyright © Andrew W. Moore

Slide 58 - Very serious warning

• Intensive use of cross validation can overfit.

• How?

• What can be done about it?

Copyright © Andrew W. Moore

Slide 59 - Very serious warning

• Intensive use of cross validation can overfit.

• How?

• Imagine a dataset with 50 records and 1000

attributes.

• You try 1000 linear regression models, each one

using one of the attributes.

• What can be done about it?

Copyright © Andrew W. Moore

Slide 60 - Very serious warning

• Intensive use of cross validation can overfit.

• How?

• Imagine a dataset with 50 records and 1000

attributes.

• You try 1000 linear regression models, each one

using one of the attributes.

• The best of those 1000 looks good!

• What can be done about it?

Copyright © Andrew W. Moore

Slide 61 - Very serious warning

• Intensive use of cross validation can overfit.

• How?

• Imagine a dataset with 50 records and 1000

attributes.

• You try 1000 linear regression models, each one

using one of the attributes.

• The best of those 1000 looks good!

• But you realize it would have looked good even if the

output had been purely random!

• What can be done about it?

• Hold out an additional testset before doing any model

selection. Check the best model performs well even

on the additional testset.

• Or: Randomization Testing

Copyright © Andrew W. Moore

Slide 62 - What you should know

• Why you can’t use “training-set-error” to

estimate the quality of your learning

algorithm on your data.

• Why you can’t use “training set error” to

choose the learning algorithm

• Test-set cross-validation

• Leave-one-out cross-validation

• k-fold cross-validation

• Feature selection methods

• CV for classification, regression & densities

Copyright © Andrew W. Moore

Slide 63