- 讓數字說話：資料的公益責信應用約2年前 by 台灣資料科學年會
- 闕嘉宏/我在智慧交通資料解析的失敗歷程3ヶ月前 by 台灣資料科學年會
- 李育杰/The Growth of a Data Scientist3ヶ月前 by 台灣資料科學年會

- Learning for Big Data

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

(國立台灣大學資訊工程系)

slightly modified from my keynote talk

in IEEE BigData 2015 Taipei Satellite Session

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

1/30 - Introduction

About the Title

• “Learning for Big Data”

—my wife: you have made a typo

• do you mean “Learning from ✚

❩

Big Data”?

✚❩

—no, not a shameless sales campaign

for my co-authored best-selling book

(http://amlbook.com)

as machine learning

as machine learning

researcher

educator

machine learning for big data

human learning for big data

—easy?!

—hard!!

will focus on human learning for big data

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

2 - Introduction

Human Learning for Big Data

Todo

• some FAQs that I have encountered as

• educator (NTU and NTU@Coursera)

• team mentor (KDDCups, TSMC Big Data competition, etc.)

• researcher (CLLab@NTU)

• consultant (

), a real-time advertisement bidding startup

• my imperfect yet honest answers that hint what shall be learned

First Honest Claims

• must-learn for big data ≈ must-learn for small data in ML,

but the former with bigger seriousness

• system design/architecture very important, but somewhat

beyond my pay grade

I wish I had an answer to that

because I’m tired of answering that question.

—Yogi Berra (Athlete)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

3 - Asking Questions

Big Data FAQs (1/4)

how to ask good questions from

(my precious big) data?

My Polite Answer

My Honest Answer

good start already

, any more

I don’t know.

thoughts that you have in mind?

or a slightly longer answer:

if you don’t know, I don’t know.

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

4 - Asking Questions

A Similar Scenario

how to ask good questions from

(my precious big) data?

how to find a research topic for my thesis?

My Polite Answer

My Honest Answer

good start already

, any more

I don’t know.

thoughts that you have in mind?

or a slightly longer answer:

I don’t know, but perhaps you can start by

thinking about motivation and feasibility.

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

5 - Asking Questions

Finding (Big) Data Questions

≈ Finding Research Topics

• motivation: what are you interested in?

• feasibility: what can or cannot be done?

motivation

feasibility

• something publishable?

• modeling

oh, possibly just for

• computational

people in academia

• budget

• something that improves

• timeline

xyz performance

• . . .

• something that inspires

deeper study

—helps generate questions

—helps filter questions

brainstorm from motivation;

rationalize from feasibility

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

6 - Asking Questions

Finding Big Data Questions

generate questions

filter questions

from motivation

from feasibility

• variety: dream more in big

• volume: computational

data age

bottleneck

• velocity: evolving data,

• veracity: modeling with

evolving questions

non-textbook data

almost never find right question in your first try

—good questions come interactively

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

7 - Asking Questions

Interactive Question-Asking from Big Data:

Our KDDCup 2011 Experience (1/2)

Recommender System

• data: how users have rated movies

• goal: predict how a user would rate an unrated movie

A Hot Problem

• competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

• similar competition (movies → songs) held by Yahoo! in

KDDCup 2011, the most prestigious data mining competition

• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

National Taiwan University got two world

champions in KDDCup 2011—with Profs.

Chih-Jen Lin, Shou-De Lin, and many students.

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

8 - Asking Questions

Interactive Question-Asking from Big Data:

Our KDDCup 2011 Experience (2/2)

Q1 (pre-defined): can we improve rating prediction of (user, song)?

Q1.1 after data analysis:

Q1.2 after considering domain

two types of users, lazy 7% (same

knowledge: test data are newer

rating always) & normal

logs

—if a user gives 60, 60, . . . during

—shall we emphasize newer logs in

training, how’d she rate next item?

training data?

same (80%)

different (20%)

Q1.1.1: can we distinguish 80%

Q1.2.1: can we just give each log

using other features?

different weight? (but how?)

. . .

Q1.2.2: can we tune optimization

—failed (something you normally

to effectively emphasize newer

wouldn’t see in paper

)

logs? (yes this worked

)

our KDDCup experience: interactive

(good or bad) question-asking kept us going!

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

9 - Asking Questions

Learning to Ask Questions from Big Data

Must-learn Items

• true interest for motivation

—big data don’t generate questions, big interests do

• capability of machines (when to use ML?) for feasibility

Taught in ML Foundations on NTU@Coursera

1

exists underlying pattern to be learned

2

no easy/programmable definition of pattern

3

having data related to pattern

—ML isn’t cure-all

• research cycle for systematic steps

—a Ph.D. or serious research during M.S./undergraduate study

Computers are useless. They can only give

you answers.—Pablo Picasso (Artist)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

10 - Simple Model

Big Data FAQs (2/4)

what is the best machine learning model for

(my precious big) data?

My Polite Answer

My Honest Answer

the best model is

I don’t know.

data-dependent, let’s chat

about your data first

or a slightly longer answer:

I don’t know about best, but perhaps you can

start by thinking about simple models.

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

11 - Simple Model

Sophisticated Model for Big Data

what is the best machine learning model for

(my precious big) data?

what is the most sophisticated machine

learning model for (my precious big) data?

• myth: my big data work best with most sophisticated model

• partially true: deep learning for image recognition @ Google

—10 million images on 1 billion internal weights

(Le et al., Building High-level Features Using Large Scale Unsupervised Learning, ICML

2012)

Science must begin with myths,

and with the criticism of myths.

—Karl Popper (Philosopher)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

12 - Simple Model

Criticism of Sophisticated Model

myth: my big data work best

with most sophisticated model

Sophisticated Model

• time-consuming to train and predict

—often mismatch to big data

• difficult to tune or modify

—often exhausting to use

• point of no return

—often cannot “simplify” nor “analyze”

sophisticated model shouldn’t be

first-choice for big data

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

13 - Simple Model

Linear First (1/2)

what is the first machine learning model for

(my precious big) data?

Taught in ML Foundations on NTU@Coursera

linear model (or simpler) first:

• efficient to train and predict, e.g. (Lin et al., Large-scale logistic regression

and linear support vector machines using Spark. IEEE BigData 2014)

—my favorite in

, a real-time ad. bidding startup

• easy to tune or modify

—key of our KDDCup winning solutions in 2010 (educational

data mining) and 2012 (online ads)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

14 - Simple Model

Linear First (2/2)

what is the first machine learning model for

(my precious big) data?

Taught in ML Foundations on NTU@Coursera

linear model (or simpler) first:

• somewhat “analyzable”

—my students’ winning choice in TSMC Big Data Competition

(just old-fashioned linear regression!

)

• little risk

• if linear good enough, live happily thereafter

• otherwise, try something more complicated, with theoretically

nothing lost except “wasted” computation

My KISS Principle:

Keep It Simple, ❳❳

Stupid Safe

✘✘✘

✘

❳

❳

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

15 - Simple Model

Learning to Start Modeling for Big Data

Must-learn Items

• linear models, especially

• how to tune them

• how to interpret their outcomes

• simple models with frequency-based probability estimates,

such as Naïve Bayes

• decision tree (or perhaps even better, Random Forest) as a KISS

non-linear model

An explanation of the data should be

made as simple as possible, but no

simpler.—[?] Albert Einstein (Scientist)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

16 - Feature Construction

Big Data FAQs (3/4)

how should I improve ML performance with

(my precious big) data?

My Polite Answer

My Honest Answer

do we have domain knowledge

I don’t know.

about your problem?

or a slightly longer answer:

I don’t know for sure, but perhaps you can

start by encoding your human

intelligence/knowledge.

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

17 - Feature Construction

A Similar Scenario

how should I improve ML performance with

(my precious big) data?

how should I improve the performance of

my classroom students?

instructor teaching ≡ student learning

• teach more concretely −→ better performance

• teach more professionally −→ better performance

• teach more key points/aspects −→ better performance

to improve learning performance,

you should perhaps teach better

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

18 - Feature Construction

Teaching Your Machine Better with Big Data

• concrete:

good research questions, as discussed

• professional:

embed domain knowledge during data construction

• key:

facilitate your learner using proper data pruning/cleaning/hinting

IMHO, data construction is more important

for big data than machine learning is

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

19 - Feature Construction

Your Big Data Need Further Construction

Big Data Characteristics

many fields, and many abstract ones

Our KDDCup 2010 Experience

educational data mining

(Yu et al., Feature Engineering and Classifier Ensemble for KDD Cup 2010)

• “Because all feature meanings are available, we are able to

manually identify some useful pairs of features ...” :

• domain knowledge: “student s does step i of problem j in unit k ”

• hierarchical encoding: [has student s tried unit k ] more

meaningful than [has student s tried step i]

• “Correct First-Attempt Rate” cj of each problem j:

• domain knowledge: cj ≈ hardness

• condensed encoding: cj physically more meaningful than j

feature engineering: make your (feature) data

concrete by embedding domain knowledge

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

20 - Feature Construction

Learning to Construct Features for Big Data

Must-learn Items

• domain knowledge

• if available, great!

• if not, start by analyzing data first, not by learning from data

—correlations, co-occurrences, informative parts, frequent items,

etc.

• common feature construction techniques

• encoding

• combination

• importance estimation: linear models and Random Forests

especially useful (simple models, remember?

)

one secret in winning KDDCups:

ask interactive questions (remember?)

that allows encoding human intelligence

into feature construction

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

21 - Trap Escaping

Big Data FAQs (4/4)

how should I escape from the unsatisfactory

test performance on (my precious big) data?

My Step by Step Diagnosis

if (training performance okay) [> 90% of the time]

• combat overfitting

• correct training/testing mismatch

• check for misuse

else

• construct better features by asking more questions,

remember?

• now you can try more sophisticated models

will focus on the first part

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

22 - Trap Escaping

Combat Overfitting (1/2)

myth: my big data is so big that

Overfitting Hazard

overfitting is impossible

0.2

• no, big data usually

2

2

0.1

high-dimensional

• no, big data usually

Level,σ

0

1

heterogeneous

oise

N

-0.1

• no, big data usually

redundant

0

-0.2

Numbe80r of Da1t0a

0 Point1s2,0 N

• no, big data usually

(Learning from Data book)

noisy

data-size-to-noise ratio is

what matters!

big data still require

careful treatment of overfitting

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

23 - Trap Escaping

Combat Overfitting (2/2)

Driving Analogy of Overfitting

learning

driving

overfit

commit a car accident

sophisticated model

“drive too fast”

noise

bumpy road

limited data size

limited observations about road condition

—big data only cross out last line

Regularization

Validation

regularization

put brake

validation

monitor dashboard

—important to know

—important to

where the brake is

ensure correctness

Overfitting is real, and here to stay.—Learning

from Data (Book)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

24 - Trap Escaping

Correct Training/Testing Mismatch

A True Personal Story

• Netflix competition for movie recommender system:

10% improvement = 1M US dollars

• on my own validation data, first shot, showed 13% improvement

• why am I still teaching in NTU?

validation: random examples within data;

test: “last” user records “after” data

Technical Solutions

practical rule of thumb: match test scenario as much as possible

• training: emphasize later examples (KDDCup 2011)

• validation: use “late” user record

If the data is sampled in a biased way, learning

will produce a similarly biased

outcome.—Learning from Data (Book)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

25 - Trap Escaping

Biggest Misuse in Machine Learning: Data Snooping

• 8 years of currency trading data

30

snooping

• first 6 years for training,

% 20

last two 2 years for testing

rofit

P 10

• feature = previous 20 days,

ulative

label = 21th day

0

um

C

• snooping versus no snooping:

-10

no snooping

superior profit possible

0

100

200

300

400

500

Day

• snooping: shift-scale all values by training + testing

• no snooping: shift-scale all values by training only

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

26 - Trap Escaping

Data Snooping by Data Reusing

Data Snooping by Data Reusing: Research Scenario

with my precious data

• paper 1: propose algorithm 1 that works well on data

• paper 2: find room for improvement, propose algorithm 2

—and publish only if better than algorithm 2 on data

• paper 3: find room for improvement, propose algorithm 3

—and publish only if better than algorithm 2 on data

• . . .

• if all papers from the same author in one big paper: as if using a

super-sophisticated model that includes algorithms 1, 2, 3, . . .

• step-wise: later author snooped data by reading earlier papers,

bad generalization worsen by publish only if better

If you torture the data long enough, it will

confess.—Folklore in ML/DM

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

27 - Trap Escaping

Avoid Big Data Snooping

data snooping =⇒ human overfitting

Honesty Matters

Guidelines

• very hard to avoid data

• be blind: avoid making

snooping, unless being

modeling decision by

extremely honest

data

• extremely honest: lock

• be suspicious: interpret

your test data in safe

findings (including your

• less honest: reserve

own) by proper feeling of

validation and use

contamination—keep your

cautiously

data fresh if possible

one last secret to winning KDDCups:

“art” to carefully balance between

data-driven modeling (snooping) &

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

28 - Trap Escaping

Learning to Escape Traps for Big Data

Must-learn Items

• combat overfitting: regularization and validation

• correct training/testing mismatch: philosophy and perhaps

some heuristics

• avoid data snooping: philosophy and research cycle

(remember?

)

happy big data learning!

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

29 - Trap Escaping

Summary

• human must-learn ML topics for big data:

• procedure: research cycle

• tools: simple model, feature construction, overfitting elimination

• sense: philosophy behind machine learning

• foundations even more important in big data age

—now a shameless sales campaign for my co-authored book

and online course (to be re-run on September 8, 2015)

—special thanks to Prof. Yuh-Jye Lee and Mr. Yi-Hung Huang for

suggesting materials

Thank you!

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

30 - Trap Escaping

Appendix: ML Foundations on NTU@Coursera

https://www.coursera.org/course/ntumlone

When can machines learn?

How can machines learn?

• L1: the learning problem (

)

• L9: linear regression (

)

• L2: learning to answer yes/no

• L10: logistic regression (

)

(

)

• L11: linear models for

• L3: types of learning (

)

classification (

)

• L4: feasibility of learning

• L12: nonlinear transformation (

)

Why can machines learn?

How can machines learn better?

• L5: training versus testing

• L13: hazard of overfitting (

)

• L6: theory of generalization

• L14: regularization (

)

• L7: the VC dimension (

)

• L15: validation (

)

• L8: noise and error

• L16: three learning principles (

)

≈ must-learn

Hsuan-Tien Lin (NTU CSIE)

Learning for Big Data

30