このページは http://www.slideshare.net/HJvanVeen/kaggle-presentation の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約1ヶ月前 (2016/09/18)にアップロードinテクノロジー

Some tips and tricks for winning Kaggle competitive data science competitions

- Winning Kaggle

Competitions

Hendrik Jacob van Veen - Nubank Brasil - About Kaggle

Biggest platform for competitive data science in the

world

Currently 500k + competitors

Great platform to learn about the latest techniques and

avoiding overfit

Great platform to share and meet up with other data

freaks - Approach

Get a good score as fast as possible

Using versatile libraries

Model ensembling - Get a good score as fast as

possible

Get the raw data into a universal format like SVMlight or

Numpy arrays.

Failing fast and failing often / Agile sprint / Iteration

Sub-linear debugging:

“output enough intermediate information as a

calculation is progressing to determine before it

finishes whether you've injected a major defect or

a significant improvement.” Paul Mineiro - Using versatile libraries

Scikit-learn

Vowpal Wabbit

XGBoost

Keras

Other tools get Scikit-learn API wrappers - Model Ensembling

Voting

Averaging

Bagging

Boosting

Binning

Blending

Stacking - General Strategy

Try to create “machine learning”-learning algorithms with optimized

pipelines that are:

Data agnostic (Sparse, dense, missing values, larger than memory)

Problem agnostic (Classification, regression, clustering)

Solution agnostic (Production-ready, PoC, latency)

Automated (Turn on and go to bed)

Memory-friendly (Don’t want to pay for AWS)

Robust (Good generalization, concept drift, consistent) - First Overview I

Classification? Regression?

Evaluation Metric

Description

Benchmark code

“Predict human activities based on their smartphone usage. Predict

if a user is sitting, walking etc.” - Smartphone User Activity Prediction

Given the HTML of ~337k websites served to users of

StumbleUpon, identify the paid content disguised as real content. -

Dato Truly Native? - First Overview II

Counts

Images

Text

Categorical

Floats

Dates

0.28309984, -0.025501173, … , -0.11118051, 0.37447712

<!Doctype html><html><head><meta charset=utf-8> … </html> - First Overview III

Data size?

Dimensionality?

Number of train samples & test samples?

Online or offline learning?

Linear problem or non-linear problem?

Previous competitions that were similar? - Branch

If: Issues with the data -> Tedious clean-up

Join JSON tables, Impute missing values, Curse Kaggle

and join another competition

Else: Get data into Numpy arrays, we want:

X_train, y, X_test - Local Evaluation

Set up local evaluation according to competition metric

Create a simple benchmark (useful for exploration and

discarding models)

5-fold stratified cross-validation usually does the trick

Very important step for fast iteration and saving submissions,

yet easy to be lazy and use leaderboard.

Area Under the Curve, Multi-Class Classification

Accuracy - Data Exploration

Min, Max, Mean, Percentiles, Std, Plotting

Can detect: leakage, golden features, feature

engineering tricks, data health issues.

Caveat: At least one top 50 Kaggler used to not look at

the data at all:

“It’s called machine learning for a reason.” - Feature Engineering I

Log-transform count features, tf-idf transform text features

Unsupervised transforms / dimensionality reduction

Manual inspection of data

Dates -> day of month, is_holiday, season, etc.

Create histograms and cluster similar features

Using VW-varinfo or XGBfi to check 2-3-way interactions

Row stats: mean, max, min, number of NA’s. - Feature Engineering II

Bin numerical features to categorical features

Bayesian encoding of categorical features to likelihood

Genetic programming

Random-swap feature elimination

Time binning (customer bought in last week, last month, last year …)

Expand data (Coates & Ng, Random Bit Regression)

Automate all of this - Feature Engineering III

Categorical features need some special treatment

Onehot-encode for linear models (sparsity)

Colhot-encode for tree-based models (density)

Counthot-encode for large cardinality features

Likelihood-encode for experts… - Algorithms I

A bias-variance trade-off between simple and complex models - Algorithms II

There is No Free Lunch in statistical inference

We show that all algorithms that search for an extremum of a cost

function perform exactly the same, when averaged over all possible

cost functions. – Wolpert & Macready, No free lunch theorems for search

Practical Solution for low-bias low-variance models:

Use prior knowledge / experience to limit search (Let algo’s play to their

known strengths for particular problems)

Remove or avoid their weaknesses

Combine/Bag their predictions - Random Forests I

A Random Forest is an ensemble of decision trees.

"Random forests are a combination of tree

predictors such that each tree depends on the

values of a random vector sampled

independently and with the same distribution for

all trees in the forest. […] More robust to noise -

“Random Forest" Breiman - Random Forests II

Strengths

Weaknesses

Fast

Memory Hungry

Easy to tune

Popular

Easy to inspect

Slower for test time

Easy to explore data with

Good Benchmark

Very wide applicability

Can introduce randomness / Diversity - GBM I

A GBM trains weak models on samples that previous

models got wrong

"A method is described for converting a weak

learning algorithm [the learner can produce an

hypothesis that performs only slightly better

than random guessing] into one that achieves

arbitrarily high accuracy." - “The Strength of Weak

Learnability" Schapire - GBM II

Strengths

Weaknesses

Can achieve very good results

Slower to train

Can model complex problems

Easier to overfit than RF

Works on wide variety of

Weak learner assumption is

problems

broken along the way

Use custom loss functions

Tricky to tune

No need to scale data

Popular - SVM I

Classification and Regression using Support Vectors

"Nothing is more practical than a good theory."

‘The Nature of Statistical Learning Theory’, Vapnik - SVM II

Strengths

Weaknesses

Strong theoretical guarantees

Slower to train

Tuning regularization parameter

Memory heavy

helps prevent overfit

Requires a tedious grid-search

Kernel Trick: Use custom kernels,

for best performance

turn linear kernel into non-linear

kernel

Will probably time-out on large

datasets

Achieve state-of-the-art on

select problems - Nearest Neighbours I

Look at the distance to other samples

"The nearest neighbor decision rule assigns to

an unclassified sample point the classification of

the nearest of a set of previously classified

points." ‘Nearest neighbor pattern classification’, Cover

et. al. - Nearest Neighbours II

Strengths

Weaknesses

Simple

Simple

Impopular

Does not work well on

average

Non-linear

Depending on data size:

Easy to tune

Slow

Detect near-duplicates - Perceptron I

Update weights when wrong prediction, else do nothing

The embryo of an electronic computer that [the

Navy] expects will be able to walk, talk, see,

write, reproduce itself and be conscious of its

existence. ‘New York Times’, Rosenblatt - Perceptron II

Strengths

Weaknesses

Cool / Street Cred

Other linear algo’s usually

beat it

Extremely Simple

Does not work well on

Fast / Sparse updates

average

Online Learning

No regularization

Works well with text - Neural Networks I

Inspired by biological systems (Connected neurons firing

when threshold is reached)

Because of the "all-or-none" character of nervous

activity, neural events and the relations among

them can be treated by means of propositional

logic. […] for any logical expression satisfying

certain conditions, one can find a net behaving in

the fashion it describes. ‘A Logical Calculus of the

Ideas Immanent in Nervous Activity’, McCulloch & Pitts - Neural Networks II

Strengths

Weaknesses

The best for images

Can be difficult to set up

Can model any function

Not very interpretable

End-to-end Training

Requires specialized

hardware

Amortizes feature

representation

Underfit / Overfit - Vowpal Wabbit I

Online learning while optimizing a loss function

We present a system and a set of techniques for

learning linear predictors with convex losses on

terascale datasets, with trillions of features,

billions of training examples and millions of

parameters in an hour using a cluster of 1000

machines. ‘A Reliable Effective Terascale Linear

Learning System’, Agarwal et al. - Vowpal Wabbit II

Strengths

Weaknesses

Fixed memory constraint

Different API

Extremely fast

Manual feature engineering

Feature expansion

Loses against boosting

Difficult to overfit

Requires practice

Versatile

Hashing can obscure - Others

Factorization Machines

Genetic Algorithms

PCA

Bayesian

t-SNE

Logistic Regression

SVD / LSA

Quantile Regression

Ridge Regression

AdaBoosting

GLMNet

SGD - Ensembles I

Combine models in a way that outperforms individual

models.

“That’s how almost all ML competitions are won” -

‘Dark Knowledge’ Hinton et al.

Ensembles reduce the chance of overfit.

Bagging / Averaging -> Lower variance, slightly lower bias

Blending / Stacking -> Remove biases of base models - Ensembles II

Practical tips:

Use diverse models

Use diverse feature sets

Use many models

Do not leak any information - Stacked Generalization I

Train one model on the predictions of another model

A scheme for minimizing the generalization error rate of

one or more generalizers. Stacked generalization works

by deducing the biases of the generalizer(s) with

respect to a provided learning set. This deduction

proceeds by generalizing in a second space whose

inputs are (for example) the guesses of the original

generalizers when taught with part of the learning set

and trying to guess the rest of it, and whose output is

(for example) the correct guess. - ‘Stacked Generalization’,

Wolpert - Stacked Generalization II

Train one model on the predictions of another model - Stacked Generalization III

Using weak base models vs. using strong base models

Using average of out-of-fold predictors vs. One model

for testing

One can also stack features when these are not

available in test set.

Can share train set predictions based on different folds - StackNet

We need to go deeper:

Splitting node: x1 > 5? 1 else 0

Decision tree: x1 > 5 AND x2 < 12?

Random forest: avg ( x1 > 5 AND x2 < 12?, x3 > 2? )

Stacking-1: avg ( RF1_pred > 0.9?, RF2_pred > 0.92? )

Stacking-2: avg ( S1_pred > 0.93?, S2_pred < 0.77? )

Stacking-3: avg ( SS1_pred > 0.98?, SS2_pred > 0.97? ) - Bagging Predictors I

Averaging submissions to reduce variance

"Bagging predictors is a method for generating

multiple versions of a predictor and using these

to get an aggregated predictor." - "Bagging

Predictors". Breiman - Bagging Predictors II

Train models with:

Different data sets

Different algorithms

Different features subsets

Different sample subsets

Then average / vote aggregate these - Bagging Predictors III

One can average with:

Plain average

Geometric mean

Rank mean

Harmonic mean

KazAnova’s brute-force weighted averaging

Caruana’s forward greedy model selection - Brute-Force Weighted

Average

Create out-of-fold predictions for train set for n models

Pick a stepsize s, and set n weights

Try every possible weight with stepsize s

Look which set of n weights improves the train set score

the most

Can do in cross-validation-style manner for extra

robustness. - Greedy forward model

selection (Caruana)

Create out-of-fold predictions for the train set

Start with a base ensemble of 3 best models

Loop: Add every model from library to ensemble and pick 4

models that give best train score performance

Using place-back of models, models can be picked multiple times

(weighing them)

Using random subset selection from library in loop avoids

overfitting to single best model. - Automated Stack ’n Bag I

Automatically train 1000s of models and 100s of

stackers, then average everything.

“Hodor!” - Hodor - Automated Stack ’n Bag II

Generalization

Train random models, random parameters, random data set transforms,

random feature sets, random sample sets.

Stacking

Train random models, random parameters, random base models, with and

without original features, random feature sets, random sample sets.

Bagging

Average random selection of Stackers and Generalizers. Either pick best

model, or create more random bags and keep averaging, ‘till no increase. - Automated Stack ’n Bag III

Strengths

Weaknesses

Wins Kaggle competitions

Extremely slow

Best generalization

Redundant

No tuning

Inelegant

No selection

Very complex

No human bias

Bad for environment - Leakage I

“The introduction of information about the data

mining target, which should not be legitimately

available to mine from.” - ‘Leakage in Data Mining:

Formulation, Detection, and Avoidance’, Kaufman et.

al.

“one of the top ten data mining mistakes” -

‘Handbook of Statistical Analysis and Data Mining

Applications.’, Nisbet et. al. - Leakage II

Exploiting Leakage:

In predictive modeling competitions: Allowed and

beneficial for results

In Science and Business: A very big NO NO!

In both: Accidental (Complex algo’s find leakage

automatically, or KNN finds duplicates) - Leakage III

Dato Truly Native?

This task suffered from data collection leakage:

Dates and certain keywords (Trump) were indicative, and generalized

to private LB (but not generalize to future data).

Smartphone activity prediction

This task had not enough randomization (order of samples in train and

test set was indicative)

Could manually change predictions, because classes were clustered. - Winning Dato Truly Native? I

Invented StackNet

“Data science is a team sport”: it helps to join up with #1 Kaggler :)

We used basic NLP: Cleaning, lowercasing, stemming, ngrams, chargrams, tf-

idf, SVD.

Trained a lot of different models on different datasets.

Started ensembling in the last 2 weeks.

Doing research and fun stuff, while waiting for models to complete.

XGBoost the big winner (somewhat rare to use boosting for sparse text) - Winning Dato Truly Native?

II - Winning Smartphone

Activity Prediction I

Prototyped Automated Stack ’n Bag (Kaggle Killer).

Let computer run for two days

Automatically inferred feature types

Did not look at the data

Beat very stiff competition - Winning Smartphone

Activity Prediction I - General strategy

Being #1 during competition sucks.

Team up

Go crazy with ensembling

Do not worry so much about replication that it freezes progress

Check previous competitions

Be patient and persistent (dont run out of steam)

Automate a lot

Stay up-to-date with State-of-the-art algorithms and tools - Complexity vs. Practicality I

Most Kaggle winner models are useless for production. It’s about

hyper-optimization. Top 10% probably good enough for business.

But what if we could use some Top 1% principles from Kaggle

models for business?

1-5% increase in accuracy can matter a lot!

Batch jobs allow us to overcome latency constraints

Ensembles are technically brittle, but give good generalization.

Leave no model behind! - Complexity vs. Practicality II
- Future

Use re-usable holdout set

Use contextual bandits for training the ensemble

Find more models to add to library

Ensemble pruning / compression

Interpretable black box models