このページは https://speakerdeck.com/smly/techniques-tricks-for-data-mining-competitions の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Techniques (Tricks) for

Data Mining Competitions

Kohei Ozaki

2015-10-15 @ Kyoto University - Kohei Ozaki

(screenshot on https://www.kaggle.com/confirm)

Kaggle Enthusiast

Work Experience:

• Insurance Fraud Detection

• Predictive Modeling for Online Advertising

• Recommendation System for SNS

• etc.

2 - Agenda

Data Mining Competitions

4 - 12

Techniques (Tricks) for competitions

13 - 28

Learning from Winning Solutions

29 - 72

Trend on Kaggle

73 - 78

3 - Agenda

Data Mining Competitions

4 - 12

Techniques (Tricks) for competitions

13 - 28

Learning from Winning Solutions

29 - 72

Trend on Kaggle

73 - 78

4 - Data Mining Competitions

Participants compete their score of predictive model.

A competition normally runs for 2 or 3 months.

Many kind of tasks/datasets in real world.

Insurance, Credit Scoring, Loan default, Medical, EEG, MEG,

Image Classification, HealthCare, High-energy physics,

Social Good, Marketing, Advertising, Trajectory, Telematics,

etc…

5 - Step1: Get the Data

Download datasets and Understand competitions task

6 - Step2: Make a Submission

Create your model and make a submission

7 - Step3: Check Your Rank

After you make the submission, your models are evaluated

immediately and ranked on the Public Leaderboard.

8 - Huge Amount of Prize Pool

Netflix Prize 2009 ($1M)

Recommend movies

Heritage Health Prize 2011 ($3M)

Predict days in hospital

GE Flight Quest Challenge

↑

Predic)ve

Part 1: Predict gate/arrival time 2012 ($250k)

Modeling

World

Part 2: Optimize flight plan 2014 $(220k)

Yet

DARPA Grand Challenge ($2M)

Another

Autonomous vehicle

World

↓

Google Lunar XPRIZE ($30M)

Autonomous robotic spacecraft

9 - Who Host Competitions?

Kaggle is a platform for data prediction competition.

(cloud sourcing community of 360k+ data scientists)

In addition to prize money, many data scientists use Kaggle

to learn and collaborate with experts.

10 - Great Place to Try out Your Ideas (1/2)

Many researchers/developers also use Kaggle.

(XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…)

Rie Johnson (RJ Research Consulting)

an prize winner in Heritage Health Prize

Our original motivation for entering the contest

was to try out our new tree ensemble

regularized greedy forest (RGF)

in a competitive setting.

(quote from h;p://www.heritagecaliforniaaco.com/?p=hpn-today&ar)cle=45) - Great Place to Try out Your Ideas (2/2)

Many researchers/developers also use Kaggle.

(XGBoost, LibFM, LibFFM, Lasagne, Keras, cxxnet, etc…)

Ming Liang (Tsinghua University)

an prize winner in Grasp-and-Lift EEG Detection

My intention of participating in this competition is

to evaluate the performance of recurrent convolutional

neural network (RCNN) in processing time series data.

(quote from h;ps://www.kaggle.com/c/grasp-and-liL-eeg-detec)on/forums/t/16617/team-daheimao-solu)on) - Agenda

Data Mining Competitions

4 - 12

Techniques (Tricks) for competitions

13 - 28

Learning from Winning Solutions

29 - 72

Trend on Kaggle

73 - 78

13 - Two Main Factors

The quality of the individual model & the ensemble idea.

No sophisticated individual models, no victory.

Both of the individual model & the ensemble idea are keys.

Individual Model

Ensemble Model

14 - Hyper Parameter Tuning &

Feature Engineering

Read Owen Zhang’s slide (textbook) carefully :-)

http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions

15 - Greedy Forward Selection (GFS)

Greedy Forward Selection (GFS) is simple and works well

for feature selection and model selection on ensemble.

1: Initialize feature set F k =

at k = 0

2: Iterate

3: Find best feature j /

2 F k to add t F k

o

with most significant cost reduction.

4:

k = k + 1 and F k = F k 1 [ {j}

16 - GBDT: RGF-L2 and XGBoost have

L2-Regularization for Leaf Coeﬃcient

Regression tree

(CART)

-0.2

+8.0 -0.2

+8.0 -0.2 +8.0 -0.2

+8.0 -0.2

ˆ

yi

=

f

+

1(xi)

f2(xi) + · · · + fK(xi)

Parameter ⇥ = {f1, f2, · · · , fK}

Objective Obj(⇥) = l(⇥) + ⌦(⇥)

⇥

Loss term

Regularization term

(heuristics including L0 (# of leaves) and L2)

L2 regularization works great on noisy dataset & ensemble model.

17 - Reminder: 5-Fold Cross Validation

Use K-1 parts for training and 1part for testing.

18 - Ensemble Techniques: Stacking (1/2)

Stacking uses diﬀerent methods’ predictions as “meta-features”.

To obtain meta-features for training of ensemble model,

use K-1 parts for training and 1 part for making a meta-feature.

1D Meta-feature - Ensemble Techniques: Stacking (2/2)

You can stack more stages :-) - Netflix Blending (Quiz Blending)

[1] Andreas Töescher and Michael Jahrer

“The BigChaos Solution to the Netflix Grand Prize”.

Assume that the task is regression

and the prediction is evaluated by RMSE.

What can we do for improving our score?

on

on

on

on

on

ic)

ic)

ic)

ic)

ic)

Pred

Pred

Pred

Pred

Pred

21 - zoom-in

22 - Actual setting:

We have feedback of quiz data (30% of test data)!

23 - Utilize Quiz feedback for blending 1/4

Our goal is to find the linear combination of predicted results

that best predicts y (target variable).

N-by-p matrix

predictions combined with p individual models

= ( on on on on on on on on on on on on

ic)

ic)

ic)

ic)

ic)

ic)

ic)

ic)

ic)

ic)

ic)

ic)

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred )

be the unobserved vector of true target values.

24 - Utilize Quiz feedback for blending 2/4

N-by-p matrix

predictions combined with p individual models

be the unobserved vector of true target values.

If y is known, then best estimation by linear combination is:

25 - Utilize Quiz feedback for blending 3/4

If y is known, then best estimation by linear combination is:

= ( )

j-th element

=

( )

=

Can be approx.

Can be computed

Can be approx. by

(All zero case)

exactly.

using quiz feedback.

(N times MSE)

26 - Utilize Quiz feedback for blending 4/4

Our goal is to find the linear combination of predicted results

that best predicts y (target variable).

Linear combination by using quiz feedback:

N-by-p matrix

(p x 1) matrix

predictions combined with p individual models

weight parameters

on

on

on

on

on

on

on

on

on

on

tion

( ic)ic) ic) ic) ic) ic) ic) ic) ic) ic)

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred

Pred )・

β = redicP

27 - OT: Amazon’s AWS for Modeling

c4.8xlarge (36 CPU cores with 64 GB RAM / $0.3 per hour)

My bagging GBDT model for KDDCup takes 6 hrs (= $1.8)

S 280yen (= $2.3)

* The above price is for spot-instance of us-west-1c on Oct 2015. The price is dynamically changing.

28 - Agenda

Data Mining Competitions

4 - 12

Techniques (Tricks) for competitions

13 - 28

Learning from Winning Solutions

29 - 73

Trend on Kaggle

74 - 78

29 - Learn from Winning Solutions

Today’s talk describes following competitions:

Competition Name

Description

KDD Cup 2015

Binary Classification, Access log

GE Flight Quest 2

Optimization

Grasp-and-Lift EEG Detection Multi-class Classification,

BCI, EEG recordings

30 - About KDD Cup 2015

Annual and most prestigious competition in data-mining.

821 teams joined.

Task:

Predict the probability that a student will drop-out course

in 10 days. The dataset is provided by XuetangX, one of

the largest MOOC platforms in China.

ds

or

Drop-out course or not

ec

cess r

# of ac

Date

31 - Winner: InterContinental Ensemble

Jeong Mert

Andreas Michael

Peng Xiaocong

Song

Tam

Kohei - Dataset (1 of 3)

(1) Enrollment data

(2) Access logs

(3) Object attributes

Pair of <username, course_id> for each enrollment_id.

33 - Dataset (2 of 3)

(1) Enrollment data

(2) Access logs

(3) Object attributes

Application logs. Source, Event and Object ID are provided.

34 - Dataset (3 of 3)

(1) Enrollment data

(2) Access logs

(3) Object attributes

Detailed information of Object ID.

35 - Analyze User Activities

Users who doesn’t access the course many times drop-out the

course.

ram)

og

t_id (hist

ollmen

# of enr

#of access logs (for each enrollent_id)

36 - Analyze Last Access

Obviously, users who recently accesses the course continue the

course.

37 - Initial Analysis

Base Features

User activity and last access make a big impact on the AUC score.

Features

Model

5-Fold CV

(AUC)

One Hot Encoding (course_id)

GBDT

0.6118

+ num_records (User Activity)

GBDT

0.8485

+ num_unique_object

GBDT

0.8507

+ num_unique_active_days

GBDT

0.8595

+ num_unique_active_hours

GBDT

0.8601

+ num_unique_problem_event

GBDT

0.8621

+ first and last timestamp (Last Access)

GBDT

0.8821

38 - Feature Engineering (MC)

Multiple Courses Features

Concept: Some user enrolled multiple courses.

ds

or

ec

cess r

ourses)

y c

# of ac

(b

Date

Features

Model

5-Fold CV

(AUC)

Base

GBDT

0.8821

+ (MC) first and last timestamp for each user

GBDT

0.8936

+ (MC) num_unique_active_days for each user

GBDT

0.8946

+ (MC) num_enrollment_courses for each user

GBDT

0.8953

39 - Feature Engineering (EP)

Evaluation Period Features (a bit leaky)

Concept: The activities after the end date of courses.

ds

or

ec

cess r

ourses)

y c

# of ac

(b

Date

Features

Model

5-Fold CV

(AUC)

Base + MC

GBDT

0.8953

Base + MC + EP

GBDT

0.9027

40 - Feature Engineering (PXJ)

Features from Teammates (Peng, Xiaocong and Jeong):

•

Max absent days

•

Min days from first visit to next course begin

•

Min days from 10 days after last visit to next course begin

•

Min days from last visit to next course end

•

Min days from next course to last visit

•

Min days from 10 days after course end to next course begin

•

Min days from 10 days after course end to next course end

•

Min days from course end to next visit

•

Active days from last visit to course end

•

Active days in 10 days from course end

•

Average hour per day

•

Course drop rate

•

Time span

Features

Model

5-Fold CV

(AUC)

Base + MC + EP

GBDT

0.9027

Base + MC + EP + PXJ

GBDT

0.9052

41 - Last 48

hours

We’re

in 3rd

Place

For

Long

Time - Feature Engineering (LD)

Label Dependent Features (a bit leaky)

Count number of dropped-out courses for each days on

evaluation period by using target variables in training set.

Features

Model

5-Fold CV

(AUC)

Base + MC + EP + PX1 + PX2

GBDT

0.9052

Base + MC + EP + PX1 + PX2 + LD

GBDT

0.9062

Base + MC + EP + PX1 + PX2 + LD

Bagging GBDT

0.9067

43 - Last 27

hours

Add

LD

Feature

Into

Ensemble

Model

44 - Feature Engineering (TAM)

Sliding window & Various aggregation + GFS (Tam’s work)

Use sliding window to generate many features automatically.

sliding window & various aggrega)on (by objects, events, etc.)

ds

or

ec

cess r

ourses)

y c

# of ac

(b

Date

Features

Model

5-Fold CV

(AUC)

Base + MC + EP + PXJ + LD

GBDT

0.9062

Base + MC + EP + PXJ + LD + TAM

GBDT

0.9067

45 - Last 8

Last 4

hours

Add

hours

TAM’

A

s

dd

M

T odel

am’s

Into

Single

Ensemble

Best

Model

Into

Ensemble

Model

46 - Three-Stage Ensemble

64 single + 15 ensemble + 2 ensemble + 1 blending

Models

5-Fold CV (AUC)

Single Best

0.9067

Final model (Three-Stage Ensemble)

0.9082

47 - To Avoid Over-fitting

Comparing LB and Local CV is important to avoid over-fitting.

Warning!

over-ﬁ_ng

48 - Team Framework/Guideline

(1) We shared the index file of 5fold CV at first.

(2) By using it, we uploaded the CV Prediction and Predicted

result for test data to Dropbox.

(3) Update the wiki to describe the CV score and LB score.

Then, we all can contribute the ensemble/blending part.

(If we didn’t use the same index of 5fold CV,

our ensemble model should be over-fit.)

49 - Summary

Feature Engineering is one of the key point for winning.

(Don’t give up a chance to improve your feature set)

People can work together internationally.

(The well-designed guideline is important to work as a team)

50 - Grasp-and-Lift EEG Detection

Task:

Identify hand motions (multi-class) from

time-series EEG records.

(Pic. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data)

51 - Dataset: EEG records

32 channels EEG data

6 events to detect (HandStart, FirstDigitTouch, LiftOﬀ, …)

(Fig. is from https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data and

52

https://www.kaggle.com/acshock/grasp-and-lift-eeg-detection/how-noisy-are-these-eegs) - Winners approaches

1st place: Alexandre Barachant & Rafał Cycoń

(expert in EEF & signal processing)

• Feature Extraction: Filter bank, Neural Oscillation, ERP

• Single Models: LR, LDA, RNN, CNN

2nd place: Ming Liang (expert in image processing)

• Feature Extraction: Nothing

• Single Models: CNN, Recurrent CNN

• Model Selection: Greedy Forward Selection

It seems the single best model on this contest is Recurrent CNN.

CNNs can perform as well as traditional paradigm.

53 - Classifying EEG signals with a

Convolutional Neural Network

A input sample is treated as height-1-images.

The input sample at time t is composed of the n-dimensiontal

data at times t - n + 1, t - n + 2, ..., t.

time t

n-dimensional data

54 - Recurrent CNN

[4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional

Neural Network for Object Recognition”, CVPR’15.

Current state of the art algorithm on image classification task.

RCL (Recurrent Convolution Layer) is a natural integration of

RNN and CNN. The feed-forward (blue line) and recurrent

computation (red line) both take the form of convolution.

(Fig. is from http://blog.kaggle.com/2015/09/29/grasp-and-lift-eeg-detection-winners-interview-2nd-place-daheimao/)

55 - Summary

Convolutional Neural Network works great on

time-series signal records (EEG).

Don’t fear the experts!

• Non-expert ML researcher might beat an expert researcher.

• Google Scholar is your friend.

56 - GE FQ2: Flight Route Optimization

Objective: produce a flight plan for each flight to

reduce the average cost of plains as low as possible.

(Pic. is from http://www.gequest.com/)

57 - Format of Flight Plan

List of 4D (Latitude, Longitude, Altitude and Speed) points for

each flight plan.

4:Speed

3:Altitude

1:Latitude and 2:Longitude

1:Latitude

3:Altitude

2:Longitude

4:Speed

2013-10-02 12:00:00 (Cut-oﬀ time)

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

58 - Evaluation Metric (1 of 2)

Objective: produce a flight plan for each flight to

reduce the average cost of plains as low as possible.

Ctotal = Cfuel + Cdelay + Coscillation + Cturbulence

揺れ

乱気流

Penalty for

Linear func)on of

changing al)tude

the elapsed )me in

turbulent zones.

59 - Evaluation Metric (2 of 2)

Evaluated by a flight simulator. A flight can take 3 kind of step:

“ascending”, “descending” and “cruising”.

Ctotal = Cfuel + Cdelay + Coscillation + Cturbulence

Fuel consump)on is depend on

the ﬂight instruc)on.

*Airspeed (IAS): 対気速度

(the speed of an aircraft relative to the air).

*Ground speed (GS): 対地速度

60 - Dataset (1 of 3)

Flight Information

List of test flights to optimize.

• Arrival Airport

• Current Location

• Parameters of Cost Model

61 - Dataset (2 of 3)

Airport Locations

produce a flight plan for each flight to reduce the

average cost of plains as low as possible.

62 - Dataset (3 of 3)

Restricted Zones

Turbulent Zones

Airspace which is reserved for

Airspace where flights experience

special use (restricted from

turbulence (accrue a USD cost for the

civilian aircraft)

time spent within these zones)

Weather (Wind data)

Vectors on 4-axes

representation.

(time, altitude, easting,

northing)

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

63 - Analyze Cost Model

The burned fuel is a function of the airspeed,

but the ground speed is the sum of the velocity relative to the

air and wind vector.

→ Taking advantage on the wind can significantly reduce

the fuel cost and the delay cost.

Ctotal = Cfuel + Cdelay + Coscillation + Cturbulence

*Airspeed (IAS): 対気速度

(the speed of an aircraft relative to the air).

*Ground speed (GS): 対地速度

64 - Example of Wind-Optimal Path

The blue line is wind-optimal path

(Reduce 15% of total cost from the red line).

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

65 - 5th Solution (1 of 5)

[3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic

Programming Approach for 4D Flight Route Optimization”

Procedure:

(1) Create an Initial Routes

(2) 2D optimization process (latitude and longitude)

(3) Set the altitudes and the airspeed of the flight

Optimize 4D parameters separately.

* Winning solution doesn’t open in this competition.

66 - 5th Solution (2 of 5)

Procedure:

(1) Create an Initial Routes

(2) 2D optimization process (latitude and longitude)

(3) Set the altitudes and the airspeed of the flight

Compute the shortest path problem (Dijkstra’s algorithm)

Vertex:

the current position

the destination airport

the vertices of the restricted zones

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

67 - 5th Solution (3 of 5)

Procedure:

(1) Create an Initial Routes

(2) 2D optimization process (latitude and longitude)

(3) Set the altitudes and the airspeed of the flight

How to find

the wind-optimal path?

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

68 - 5th Solution (4 of 5)

(2) 2D optimization process (latitude and longitude)

Create a grid in the airspace and

divides the initial path into N parts.

→ by Dynamic Programming

Perform it recursively.

(Fig. is from [3] Christian Kiss-Toth, Gabor Takacs, “A Dynamic Programming Approach for 4D Flight Route Optimization”)

69 - 5th Solution (5 of 5)

Procedure:

(1) Create an Initial Routes

(2) 2D optimization process (latitude and longitude)

(3) Set the altitudes and the airspeed of the flight

Optimize two variables: the descending distance and cruise speed.

For this 1D optimization, the solution used an exhaustive search.

70 - Report from GE

• a

(quote from h;p://www.gereports.com/post/93139010005/underdog-scien)st-cracks-code-to-reduce-ﬂight/)

71 - Summary

Deep understanding to the objective and evaluation metric is

important to solve the problem.

(i.e. taking advantage on the wind is key point.)

Basic knowledge of Computer Science (DP algorithm) and

engineering eﬀorts are also helpful for this kind of competitions.

72 - Agenda

Data Mining Competitions

4 - 12

Techniques (Tricks) for competitions

13 - 28

Learning from Winning Solutions

29 - 72

Trend on Kaggle

73 - 78

73 - Improved Kaggle Rankings (1/3)

Kaggle users receive points for their performance in competitions.

On May 2015, Kaggle rolled out an updated version of the ranking

system.

The old ranking system

Penalty on being

Popularity of the

Decay

part of a team

contest

The new ranking system

Penalty on being

Popularity of the

Decay

part of a team

contest

74 - Improved Kaggle Rankings (2/3)

The new formula imposes a smaller penalty on being part of a

team.

New Penalty Term

Old Penalty Term

on being part of a team

75 - Improved Kaggle Rankings (3/3)

New point system counts your achievements on past contests.

New Decay

Term

Old Decay

Term

76 - Forming a Team Seems Active

On CAT competition, Rank #1 ~ #7 are teams, no solo players.

Teaming-up is active when ensemble models work well.

77 - Take-away Messages

Join Kaggle competitions for fun and

learn techniques from expert data scientist in the world!

RGF and XGBoost have L2-regularization

and it might work well for noisy dataset (and ensemble model).

Ensemble/Blending techniques are tricky.

Some techniques are impractical in real world setting.

Deep Learning はほぼ使わない とは思いません。使います。

78 - References

[1] Andreas Töescher and Michael Jahrer “The BigChaos

Solution to the Netflix Grand Prize”.

[2] Rie Johnson and Tong Zhang “Learning Nonlinear

Functions Using Regularized Greedy Forest”, TPAMI’14.

[3] Christian Kiss-Toth and Gábor Takács, “A Dynamic

Programming Approach for 4D Flight Route Optimization”,

Big Data’14.

[4] Ming Liang and Xiaolin Hu, “Recurrent Convolutional

Neural Network for Object Recognition”, CVPR’15.

79