- PRUDENTIAL

LIFE

INSURANCE

RISK MODEL

A Kaggle competition for GA – PT Data Science ’15-’16

Patrick Kennedy – 2.15.16

patrick@structuredmotivation.com - What is the • Prudential life insurance

problem?

30 day process to establish risk

• What if we could …

make life insurance selection on-demand?

• Let’s build a model to predict levels of risk as measured

by application status - Leaderboard

• Show kaggle leaderboard with scores (as measured by

QWK)

• Goal? 30k - The Data

Anonymized:

– Train [59381, 128], Test[19765, 127]

– 13 continuous

– 65 categorical

– 4 discrete

– 48 other

– 1 Id, 1 Response

– Contains no apriori intuition

The real trick is that there are 8 classes of output… I choose to build models based on a

continuous target and then use a function to provide cut points before submitting final predictions

(…it seemed a little easier than building 8 separate models) - Initial

Exploration

… - Roadmap

1. Find a model

2. Build a network of models

4. Results?

3. Tune - Baseline

Rank: 138 / 1970

• XGBoost – Score of .669

Top 10%**

model

• XGBoost stands for eXtreme Gradient Boosting

(1/2)

• Parallelized tree boosting / FAST

• Has python wrappers for ease of use - Baseline • Process:

model

1) train model

2) train offsets

(2/2)

3) apply offsets to predicted test set

fmin_powell, quadratic weighted kappa

• fmin_powell is an optimization method – sequentially

minimizing each vector passed and updating iteratively

• QWK is inter-rater agreement measure. Except it takes

into account how wrong measures are and penalizes

greater disagreement - Actual

Predicted

New Predictions

8

7.35

12.48

6

6.72

5.99

7

7.11

11.22

3

1.32

2.56

6

5.49

5.56

5

5.12

Offset Guesses

optimize

5.11

round

-QWK

5

5.03

(applied per class)

sequentially

5.03

np.clip(data, 1, 8)

4

3.19

3.78

1

1.01

0.03

2

2.47

2.48

4

4.11

3.76

2

2.54

1.98

8

8.32

23.09

3

3.00

3.24

1.

2.

3.

4.

5.

6.

7. - MOAR • When one is good, how about 29?

models - Level 1

Level 2

Level 3

Level 4

Model 1

Model 2

XGBoost

Weighted

Model 3

Train / Apply offset

Predictions

Model 4

AdaBoost

…

Model 27

Stacking: [...] stacked generalization is a means of non-linearly combining generalizers to make a new generalizer,

to try to optimally integrate what each of the original generalizers has to say about the learning set.

The more each generalizer has to say (which isn’t duplicated in what the other generalizer’s have to say),

the better the resultant stacked generalization. Wolpert (1992) Stacked Generalization

Blending: A word introduced by the Netflix winners. It is very close to stacked generalization,

but a bit simpler and less risk of an information leak. Some researchers use “stacked ensembling”

and “blending” interchangeably. With blending, instead of creating out-of-fold predictions for

the train set, you create a small holdout set of say 10% of the train set. The stacker model then

trains on this holdout set only. (http://mlwave.com/kaggle-ensembling-guide/) - do this for each classifier…

TR

A

IN

1. Train Model

CV

2. Predict CV

TE

S

3. Predict Test

T

4. CV predictions become new train set

Avg. test predictions become new test set

5. Iterate

Or you can use [stacked_generalization] @ https://github.com/dustinstansbury/stacked_generalization

and do this automatically – and a lot faster! - Stay tuned

• Grid search, Random search

• hyperopt & BayesOpt

(others: MOE, spearmint require mongodb instance)

• Note: hyperopt also has the ability to select

preprocessing and classifiers too … pretty cool

Method

Score

Time

GridSearchCV

n/a

Too long

RandomizedSearchCV

0.473

24.4 hours

Hyperopt

0.613

13 hours

BayesOpt

0.663

62 minutes

scores for single XGBRegressor model - Back to my • Trying new paramswith network of models (but fewer

of them)… using ensemble based on optimizations

models…

• What are the results? (score and time)

• What is the level system like?

Level 1

Level 2

Level 3

Level 4

Model 1

Model 2

XGBoost

Weighted

Model 3

Train / Apply offset

Predictions

Model 4

AdaBoost

…

Model 27 - Auto-

• Eh?

sklearn - Final-ish

Results

Model

Best Score

Time

Single XGBoost

0.669*

15 minutes

4 level stack

0.665

~12 hours

Tuned single XGBoost

0.663

75 minutes

Auto-sklearn + XGBoost

0.667

60 minutes

In the mean time my position has gone from 138/1970 to 660/2695 ~ 24th percentile

* Lucky seed - Last ditch • If model optimization is a dead-end, what other

aspects can be optimized?

effort

• Offsets!

– 1a) Initial offset guesses (fmin is sensitive to these)

– 1b) Order in which the offsets are applied (fmin sensitive)

– 2) Binning predictions instead of applying offsets?

• Are there really no intuitions about the data? - Final

Results

Model

Best Score

Time

Single XGBoost

0.669

15 minutes

4 level stack

0.665

~12 hours

Tuned single XGBoost

0.663

75 minutes

Auto-sklearn + XGBoost

0.667

60 minutes

Optimize XGBoost offsets

0.667

15 minutes

+ ~12hrs for optimizations

Optimize XGBoost bins

0.664

15 minutes

+ ~4 hrs for optimizations - Roadmap

1. Find a model

2. Build a network of models

4. Results?

3. Tune - Next steps…

• 5 days left to....

– Explore potential structural intuitions

• (Count / Sum / Interactive effects)

– Explore additional models like Neural Networks...

• Down the road...

– Beef up skills stacking and blending (optimize time) -or-

Build my own

– Win a GD competition

• A note about insurance and risk... - PRUDENTIAL

LIFE

INSURANCE

RISK MODEL

A Kaggle competition for GA – PT Data Science ’15-’16

Patrick Kennedy – 2.15.16

patrick@structuredmotivation.com