このページは http://www.slideshare.net/ShangxuanZhang/ridge-regression-lasso-and-elastic-net の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/02/13)にアップロードin学び

It is given by Yunting Sun at NYC open data meetup, see more information at www.nycopendata.com o...

It is given by Yunting Sun at NYC open data meetup, see more information at www.nycopendata.com or join us at www.meetup.com/nyc-open-data

- 2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression, LASSO and Elastic Net

A talk given at NYC open data meetup, find more at www.nycopendata.com

Yunting Sun

Google Inc

file:///Users/ytsun/elasticnet/index.html#42

1/42 - Overview

· Linear Regression

· Ordinary Least Square

· Ridge Regression

· LASSO

· Elastic Net

· Examples

· Exercises

Note: make sure that you have installed elasticnet package

library(MASS)

library(elasticnet)

2/42

file:///Users/ytsun/elasticnet/index.html#42

2 - Linear Regression

n observations, each has one response variable and p predictors

Y = (y1, , yn)T,

n × 1

X = (X1, , Xp),

n × p

· We want to find a linear combination of predictors x = (x1,

, xp) to

- describe the actual relationship between y and x1,

, xp

- use y = xT to predict y

· Examples

- find relationship between pressure and water boiling point

- use GDP to predict interest rate (the accuracy of the prediction is important but the

actual relationship may not matter)

3/42

file:///Users/ytsun/elasticnet/index.html#42

3 - Quality of an estimator

Suppose is the true value and

0

y = xT 0 + ,

(0, 1)

· Prediction error at x0, the difference between the actual response and the model prediction

EPE(x0) = E[(y xT

|x = ]

0 )2

x0

EPE(x

2

0 ) =

+ E(xT0 0 xT0 )2

EPE(x

2

0) =

+ [Bias2(xT ) + Var(

)]

0

xT0

Where Bias(xT ) = E(

) .

0

x T0 0 xT0

· The second and third terms make up the mean squared error of xT in estimating

.

0

xT0 0

· How to estimate prediction error?

4/42

file:///Users/ytsun/elasticnet/index.html#42

4 - K-fold Cross Validation

· Split dataset into K groups

- leave one group out as test set

- use the rest K-1 groups as training set to train the model

- estimate prediction error of the model from the test set

5/42

file:///Users/ytsun/elasticnet/index.html#42

5 - K-fold Cross Validation

Let Ei be the prediction errors for the ith test group, the average prediction error is

1 K

E =

E

K ∑ i

i=1

6/42

file:///Users/ytsun/elasticnet/index.html#42

6 - Quality of an estimator

· Mean squared error of the estimator

MSE( ) = E[(

0)2]

MSE( ) = Bias2( ) + Var( )

· A biased estimator may achieve smaller MSE than an unbiased estimator

· useful when our goal is to understand the relationship instead of prediction

7/42

file:///Users/ytsun/elasticnet/index.html#42

7 - Least Squares Estimator (LSE)

Yn×1 = Xn×p p×1 + n×1

p

yi =

xij j + i, i = 1, , n

∑

j=1

i.i.d

i

(0, 1)

Minimize Residual Sum of Square (RSS)

= arg min(Y X )T(Y X ) = (XT X) 1XT Y

The solution is uniquely well defined when n > p and X T X inversible

8/42

file:///Users/ytsun/elasticnet/index.html#42

8 - Pros

E( ) =

unbiased

· LSE has the minimum MSE among unbiased linear estimator though a biased estimator may

have smaller MSE than LSE

· explicit form

· computation O(np2)

· confidence interval, significance of coefficient

9/42

file:///Users/ytsun/elasticnet/index.html#42

9 - Cons

Var( ) = (XT X) 1 2

· Multicollinearity leads to high variance of estimator

- exact or approximate linear relationship among predictors

- (X T X) 1 tends to have large entries

· Requires n > p, i.e., number of observations larger than the number of predictors

E

2

2

x

x

0 EPE( 0 ) =

(p/n) +

estimated prediction error

· Prediction error increases linearly as a function of p

· Hard to interpret when the number of predictors is large, need a smaller subset that exhibit

the strongest effects

10/42

file:///Users/ytsun/elasticnet/index.html#42

10 - Example: Leukemia classification

· Leukemia Data, Golub et al. Science 1999

· There are 38 training samples and 34 test samples with total p = 7129 genes (p >> n)

· Xij is the gene expression value for sample i and gene j

· Sample i either has tumor type AML or ALL

· We want to select genes relevant to tumor type

- eliminate the trivial genes

- grouped selection as many genes are highly correlated

· LSE does not work here!

11/42

file:///Users/ytsun/elasticnet/index.html#42

11 - Solution: regularization

· instead of minimizing RSS,

minimize (RSS + × penalty on the parameters)

· Trade bias for smaller variance, biased estimator when ! = 0

· Continuous variable selection (unlike AIC, BIC, subset selection)

· can be chosen by cross validation

12/42

file:///Users/ytsun/elasticnet/index.html#42

12 - Ridge Regression

ridge = arg min{ Y X 2

2

2 +

2}

ridge = (XTX + I) 1XTY

Pros:

· p >> n

· Multicollinearity

· biased but smaller variance and smaller MSE (Mean Squared Error)

· explicit solution

Cons:

· shrink coefficients to zero but can not produce a parsimonious model

13/42

file:///Users/ytsun/elasticnet/index.html#42

13 - Grouped Selection

· if two predictors are highly correlated among themselves, the estimated coefficients will be

similar for them.

· if some variables are exactly identical, they will have same coefficients

Ridge is good for grouped selection but not good for eliminating trivial genes

14/42

file:///Users/ytsun/elasticnet/index.html#42

14 - Example: Ridge Regression (Col inearity)

· multicollinearity x3 = x1 + x2

· show that ridge regression beats OLS in the multilinearity case

library(MASS)

n = 500

z = rnorm(n, 0, 1)

y = z + 0.2 * rnorm(n, 0, 1)

x1 = z + rnorm(n, 0, 1)

x2 = z + rnorm(n, 0, 1)

x3 = x1 + x2

d = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)

15/42

file:///Users/ytsun/elasticnet/index.html#42

15 - OLS

# OLS fail to calculate coefficient for x3

ols.model = lm(y ~ . - 1, d)

coef(ols.model)

## x1 x2 x3

## 0.3053 0.3187 NA

16/42

file:///Users/ytsun/elasticnet/index.html#42

16 - Ridge Regression

# choose tuning parameter

ridge.model = lm.ridge(y ~ . - 1, d, lambda = seq(0, 10, 0.1))

lambda.opt = ridge.model$lambda[which.min(ridge.model$GCV)]

# ridge regression (shrink coefficients)

coef(lm.ridge(y ~ . - 1, d, lambda = lambda.opt))

## x1 x2 x3

## 0.1771 0.1902 0.1258

17/42

file:///Users/ytsun/elasticnet/index.html#42

17 - Approximately multicol inear

· show that ridge regreesion correct coefficient signs and reduce mean squared error

x3 = x1 + x2 + 0.05 * rnorm(n, 0, 1)

d = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)

d.train = d[1:400, ]

d.test = d[401:500, ]

18/42

file:///Users/ytsun/elasticnet/index.html#42

18 - OLS

ols.train = lm(y ~ . - 1, d.train)

coef(ols.train)

## x1 x2 x3

## -0.3764 -0.3522 0.6839

# prediction errors

sum((d.test$y - predict(ols.train, newdata = d.test))^2)

## [1] 37.53

19/42

file:///Users/ytsun/elasticnet/index.html#42

19 - Ridge Regression

# choose tuning parameter for ridge regression

ridge.train = lm.ridge(y ~ . - 1, d.train, lambda = seq(0, 10, 0.1))

lambda.opt = ridge.train$lambda[which.min(ridge.train$GCV)]

ridge.model = lm.ridge(y ~ . - 1, d.train, lambda = lambda.opt)

coef(ridge.model) # correct signs

## x1 x2 x3

## 0.1713 0.1936 0.1340

coefs = coef(ridge.model)

sum((d.test$y - as.matrix(d.test[, -1]) %*% matrix(coefs, 3, 1))^2)

## [1] 36.87

20/42

file:///Users/ytsun/elasticnet/index.html#42

20 - LASSO

lasso = arg min{ Y X 22 +

1}

Or equivalently

p

min Y X 22 s.t.

1 =

| j| t

∑

j=1

Pros

· allow p >> n

· enforce sparcity in parameters

· quadratic programming problem, lars solution requires O(np2)

· goes to 0, t goes to , OLS solution

· goes to , t goes to 0, = 0

21/42

file:///Users/ytsun/elasticnet/index.html#42

21 - Cons

· if a group of predictors are highly correlated among themselves, LASSO tends to pick only

one of them and shrink the other to zero

· can not do grouped selection, tend to select one variable

LASSO is good for eliminating trivial genes but not good for grouped selection

22/42

file:///Users/ytsun/elasticnet/index.html#42

22 - LARS algorithm of Efron et al (2004)

· stepwise variable selection (Least angle regression and shrinkage)

· less greedy version of traditional forward selection methods

· solve the entire lasso solution path efficiently

· same order of computational efforts as a single OLS fit O(np2)

23/42

file:///Users/ytsun/elasticnet/index.html#42

23 - LARS Path

min Y X 2

OLS

2

s.t.

1 s

1, s

[0, 1]

24/42

file:///Users/ytsun/elasticnet/index.html#42

24 - parsimonious model

library(MASS)

n = 20

# beta is sparse

beta = matrix(c(3, 1.5, 0, 0, 2, 0, 0, 0), 8, 1)

p = length(beta)

rho = 0.3

corr = matrix(0, p, p)

for (i in seq(p)) {

for (j in seq(p)) {

corr[i, j] = rho^abs(i - j)

}

}

X = mvrnorm(n, mu = rep(0, p), Sigma = corr)

y = X %*% beta + 3 * rnorm(n, 0, 1)

d = as.data.frame(cbind(y, X))

colnames(d) = c("y", paste0("x", seq(p)))

25/42

file:///Users/ytsun/elasticnet/index.html#42

25 - OLS

n.sim = 100

mse = rep(0, n.sim)

for (i in seq(n.sim)) {

X = mvrnorm(n, mu = rep(0, p), Sigma = corr)

y = X %*% beta + 3 * rnorm(n, 0, 1)

d = as.data.frame(cbind(y, X))

colnames(d) = c("y", paste0("x", seq(p)))

# fit OLS without intercept

ols.model = lm(y ~ . - 1, d)

mse[i] = sum((coef(ols.model) - beta)^2)

}

median(mse)

## [1] 6.32

26/42

file:///Users/ytsun/elasticnet/index.html#42

26 - Ridge Regression

n.sim = 100

mse = rep(0, n.sim)

for (i in seq(n.sim)) {

X = mvrnorm(n, mu = rep(0, p), Sigma = corr)

y = X %*% beta + 3 * rnorm(n, 0, 1)

d = as.data.frame(cbind(y, X))

colnames(d) = c("y", paste0("x", seq(p)))

ridge.cv = lm.ridge(y ~ . - 1, d, lambda = seq(0, 10, 0.1))

lambda.opt = ridge.cv$lambda[which.min(ridge.cv$GCV)]

# fit ridge regression without intercept

ridge.model = lm.ridge(y ~ . - 1, d, lambda = lambda.opt)

mse[i] = sum((coef(ridge.model) - beta)^2)

}

median(mse)

## [1] 4.074

27/42

file:///Users/ytsun/elasticnet/index.html#42

27 - LASSO

library(elasticnet)

n.sim = 100

mse = rep(0, n.sim)

for (i in seq(n.sim)) {

X = mvrnorm(n, mu = rep(0, p), Sigma = corr)

y = X %*% beta + 3 * rnorm(n, 0, 1)

obj.cv = cv.enet(X, y, lambda = 0, s = seq(0.1, 1, length = 100), plot.it = FALSE,

mode = "fraction", trace = FALSE, max.steps = 80)

s.opt = obj.cv$s[which.min(obj.cv$cv)]

lasso.model = enet(X, y, lambda = 0, intercept = FALSE)

coefs = predict(lasso.model, s = s.opt, type = "coefficients", mode = "fraction")

mse[i] = sum((coefs$coefficients - beta)^2)

}

median(mse)

## [1] 3.393

28/42

file:///Users/ytsun/elasticnet/index.html#42

28 - Elastic Net

enet = arg min{(Y X )T (Y X ) +

2

1

1 + 2

2 }

Pros

· enforce sparsity

· no limitation on the number of selected variable

· encourage grouping effect in the presence of highly correlated predictors

Cons

· naive elastic net suffers from double shrinkage

Correction

enet = (1 + 2 )

29/42

file:///Users/ytsun/elasticnet/index.html#42

29 - LASSO vs Elastic Net

Construct a data set with grouped effects to show that Elastic Net outperform LASSO in

grouped selection

· response y

· 6 predictors fall into two group, x1, x2, x3 as dominant factors, x4, x5, x6 as minor factors

we would like to shrink to zero

Two independent "hidden" factors z1 and z2

y = z1 + 0.1 z2 + N(0, 1)

Correlated grouped covariates

x1 = z1 + 1, x2 = z1 + 2, x3 = z1 + 3

x4 = z2 + 4, x5 = z2 + 5, x6 = z2 + 6

X = (x1, x2, , x6)

30/42

file:///Users/ytsun/elasticnet/index.html#42

30 - Simulated data

N = 100

z1 = runif(N, min = 0, max = 20)

z2 = runif(N, min = 0, max = 20)

y = z1 + 0.1 * z2 + rnorm(N)

X = cbind(z1 %*% matrix(c(1, -1, 1), 1, 3), z2 %*% matrix(c(1, -1, 1), 1, 3))

X = X + matrix(rnorm(N * 6), N, 6)

31/42

file:///Users/ytsun/elasticnet/index.html#42

31 - LASSO path

library(elasticnet)

obj.lasso = enet(X, y, lambda = 0)

plot(obj.lasso, use.color = TRUE)

32/42

file:///Users/ytsun/elasticnet/index.html#42

32 - Elastic Net

library(elasticnet)

obj.enet = enet(X, y, lambda = 0.5)

plot(obj.enet, use.color = TRUE)

33/42

file:///Users/ytsun/elasticnet/index.html#42

33 - How to choose tuning parameter

For a sequence of , find the s that minimizer of the CV prediction error and then find the

which minimize the CV prediction error

library(elasticnet)

obj.cv = cv.enet(X, y, lambda = 0.5, s = seq(0, 1, length = 100), mode = "fraction",

trace = FALSE, max.steps = 80)

34/42

file:///Users/ytsun/elasticnet/index.html#42

34 - Prostate Cancer Example

· Predictors are eight clinical measures

· Training set with 67 observations

· Test set with 30 observations

· Modeling fitting and turning parameter selection by tenfold CV on training set

· Compare model performance by prediction mean-squared error on the test data

35/42

file:///Users/ytsun/elasticnet/index.html#42

35 - Compare models

· medium correlation among predictors and the highest correlation is 0.76

· elastic net beat LASSO and ridge regression beat OLS

36/42

file:///Users/ytsun/elasticnet/index.html#42

36 - Summary

· Ridge Regression:

- good for multicollinearity, grouped selection

- not good for variable selection

· LASSO

- good for variable selection

- not good for grouped selection for strongly correlated predictors

· Elastic Net

- combine strength between Ridge Regression and LASSO

· Regularization

- trade bias for variance reduction

- better prediction accuracy

37/42

file:///Users/ytsun/elasticnet/index.html#42

37 - Reference

Most of the materials covered in this slides are adapted from

· Paper: Regularization and variable selection via the elastic net

· Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf

· The Elements of Statistical Learning

38/42

file:///Users/ytsun/elasticnet/index.html#42

38 - Exercise 1: simulated data

beta = matrix(c(rep(3, 15), rep(0, 25)), 40, 1)

sigma = 15

n = 500

z1 = matrix(rnorm(n, 0, 1), n, 1)

z2 = matrix(rnorm(n, 0, 1), n, 1)

z3 = matrix(rnorm(n, 0, 1), n, 1)

X1 = z1 %*% matrix(rep(1, 5), 1, 5) + 0.01 * matrix(rnorm(n * 5), n, 5)

X2 = z2 %*% matrix(rep(1, 5), 1, 5) + 0.01 * matrix(rnorm(n * 5), n, 5)

X3 = z3 %*% matrix(rep(1, 5), 1, 5) + 0.01 * matrix(rnorm(n * 5), n, 5)

X4 = matrix(rnorm(n * 25, 0, 1), n, 25)

X = cbind(X1, X2, X3, X4)

Y = X %*% beta + sigma * rnorm(n, 0, 1)

Y.train = Y[1:400]

X.train = X[1:400, ]

Y.test = Y[400:500]

X.test = X[400:500, ]

39/42

file:///Users/ytsun/elasticnet/index.html#42

39 - Questions:

· Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the

prediction error from the test data

· Simulate the data set for 100 times and compare the median mean-squared errors for those

models

40/42

file:///Users/ytsun/elasticnet/index.html#42

40 - Exercise 2: Diabetes

· x a matrix with 10 columns

· y a numeric vector (442 rows)

· x2 a matrix with 64 columns

library(elasticnet)

data(diabetes)

colnames(diabetes)

## [1] "x" "y" "x2"

41/42

file:///Users/ytsun/elasticnet/index.html#42

41 - Questions

· Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross

validation.

· Compare solution paths for the two methods

42/42

file:///Users/ytsun/elasticnet/index.html#42

42