このページは http://www.slideshare.net/tw_dsconf/ss-56071386 の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

by台灣資料科學年會

約2年前 (2015/12/12)にアップロードin学び

機器學習速遊 (Quick Tour of Machine Learning)

機器學習旨在讓電腦能由資料中累積的經驗來自我進步，近年來已廣泛應用於資料探勘、計算機視覺、自然語言處理、生物特徵...

機器學習速遊 (Quick Tour of Machine Learning)

機器學習旨在讓電腦能由資料中累積的經驗來自我進步，近年來已廣泛應用於資料探勘、計算機視覺、自然語言處理、生物特徵識別、搜尋引擎、醫學診斷、檢測信用卡欺詐、證券市場分析、DNA序列測序、語音和手寫識別、戰略遊戲和機器人等領域。它已成為資料科學的基礎學科之一，為任何資料科學家必備的工具。

這門課程將由台大資訊工程系林軒田教授利用短短的六個小時，快速地帶大家探索機器學習的基石、介紹核心的模型及一些熱門的技法，希望幫助大家有效率而紮實地了解這個領域，以妥善地使用各式機器學習的工具。此課程適合所有希望開始運用資料的資料分析者，推薦給所有有志於資料分析領域的資料科學愛好者。

- Big Education in the Era of Big Data－金國慶2年以上前 by 台灣資料科學年會
- Learning for Big Data－林軒田2年以上前 by 台灣資料科學年會
- 讓數字說話：資料的公益責信應用3年以上前 by 台灣資料科學年會

- Quick Tour of Machine Learning

(機器學習速遊)

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

(國立台灣大學資訊工程系)

資料科學愛好者年會系列活動, 2015/12/12

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

0/128 - Learning from Data

Disclaimer

• just super-condensed and shuffled version of

• my co-authored textbook “Learning from Data: A Short Course”

• my two NTU-Coursera Mandarin-teaching ML Massive Open

Online Courses

• “Machine Learning Foundations”:

www.coursera.org/course/ntumlone

• “Machine Learning Techniques”:

www.coursera.org/course/ntumltwo

—impossible to be complete, with most math details removed

• live interaction is important

goal: help you begin your journey with ML

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

1 - Learning from Data

What is Machine Learning

Learning from Data ::

What is Machine Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

3 - learning:

machine learning:

Learning from Data

What is Machine Learning

From Learning to Machine Learning

learning: acquiring skill

with experience accumulated from observations

observations

learning

skill

machine learning: acquiring skill

with experience accumulated/computed from data

data

ML

skill

What is skill?

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

4 - ⇔

machine learning:

Learning from Data

What is Machine Learning

A More Concrete Definition

skill

⇔ improve some performance measure (e.g. prediction accuracy)

machine learning: improving some performance measure

with experience computed from data

improved

data

ML

performance

measure

An Application in Computational Finance

stock data

ML

more investment gain

Why use machine learning?

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

5 - Learning from Data

What is Machine Learning

Yet Another Application: Tree Recognition

• ‘define’ trees and hand-program: difficult

• learn from data (observations) and

recognize: a 3-year-old can do so

• ‘ML-based tree recognition system’ can be

easier to build than hand-programmed

system

ML: an alternative route to

build complicated systems

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

6 - Learning from Data

What is Machine Learning

The Machine Learning Route

ML: an alternative route to build complicated systems

Some Use Scenarios

• when human cannot program the system manually

—navigating on Mars

• when human cannot ‘define the solution’ easily

—speech/visual recognition

• when needing rapid decisions that humans cannot do

—high-frequency trading

• when needing to be user-oriented in a massive scale

—consumer-targeted marketing

Give a computer a fish, you feed it for a day;

teach it how to fish, you feed it for a lifetime. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

7 - Learning from Data

What is Machine Learning

Machine Learning and Artificial Intelligence

Machine Learning

Artificial Intelligence

use data to compute something

compute something

that improves performance

that shows intelligent behavior

• improving performance is something that shows intelligent

behavior

—ML can realize AI, among other routes

• e.g. chess playing

• traditional AI: game tree

• ML for AI: ‘learning from board data’

ML is one possible

and popular route to realize AI

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

8 - Learning from Data

Components of Machine Learning

Learning from Data ::

Components of Machine Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

9 - Learning from Data

Components of Machine Learning

Components of Learning:

Metaphor Using Credit Approval

Applicant Information

age

23 years

gender

female

annual salary

NTD 1,000,000

year in residence

1 year

year in job

0.5 year

current debt

200,000

what to learn? (for improving performance):

‘approve credit card good for bank?’

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

10 - Learning from Data

Components of Machine Learning

Formalize the Learning Problem

Basic Notations

• input: x ∈ X (customer application)

• output: y ∈ Y (good/bad after approving credit card)

• unknown underlying pattern to be learned ⇔ target function:

f : X → Y (ideal credit approval formula)

• data ⇔ training examples: D = {(x1, y1), (x2, y2), · · · , (xN, yN)}

(historical records in bank)

• hypothesis ⇔ skill with hopefully good performance:

g : X → Y (‘learned’ formula to be used), i.e. approve if

• h1: annual salary > NTD 800,000

• h2: debt > NTD 100,000 (really?)

• h3: year in job ≤ 2 (really?)

—all candidate formula being considered: hypothesis set H

—procedure to learn best formula: algorithm A

{(xn, yn)} from f

ML (A, H)

g

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

11 - Learning from Data

Components of Machine Learning

Practical Definition of Machine Learning

unknown target function

f : X → Y

(ideal credit approval formula)

learning

training examples

final hypothesis

D

algorithm

: (x1, y1), · · · , (xN , yN )

g ≈ f

A

(historical records in bank)

(‘learned’ formula to be used)

hypothesis set

H

(set of candidate formula)

machine learning (A and H):

use data to compute hypothesis g

that approximates target f

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

12 - Learning from Data

Components of Machine Learning

Key Essence of Machine Learning

machine learning:

use data to compute hypothesis g

that approximates target f

improved

data

ML

performance

measure

1

exists some ‘underlying pattern’ to be learned

—so ‘performance measure’ can be improved

2

but no programmable (easy) definition

—so ‘ML’ is needed

3

somehow there is data about the pattern

—so ML has some ‘inputs’ to learn from

key essence: help decide whether to use ML

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

13 - Learning from Data

Types of Machine Learning

Learning from Data ::

Types of Machine Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

14 - Learning from Data

Types of Machine Learning

Visualizing Credit Card Problem

• customer features x:

points on the plane (or points in

d

R )

• labels y:

◦ (+1), × (-1)

called binary classification

• hypothesis h:

lines here, but possibly other curves

• different curve classifies customers differently

binary classification algorithm:

find good decision boundary curve g

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

15 - Learning from Data

Types of Machine Learning

More Binary Classification Problems

• credit approve/disapprove

• email spam/non-spam

• patient sick/not sick

• ad profitable/not profitable

core and important problem with

many tools as building block of other tools

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

16 - Learning from Data

Types of Machine Learning

Binary Classification for Education

data

ML

skill

• data: students’ records on quizzes on a Math tutoring system

• skill: predict whether a student can give a correct answer to

another quiz question

A Possible ML Solution

answer correctly ≈ recent strength of student > difficulty of question

• give ML 9 million records from 3000 students

• ML determines (reverse-engineers) strength and difficulty

automatically

key part of the world-champion system from

National Taiwan Univ. in KDDCup 2010

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

17 - Learning from Data

Types of Machine Learning

Multiclass Classification: Coin Recognition Problem

• classify US coins (1c, 5c, 10c, 25c)

25

by (size, mass)

Mass

• Y = {1c, 5c, 10c, 25c}, or

5

1

Y = {1, 2, · · · , K } (abstractly)

• binary classification: special case

10

with K = 2

Size

Other Multiclass Classification Problems

• written digits ⇒ 0, 1, · · · , 9

• pictures ⇒ apple, orange, strawberry

• emails ⇒ spam, primary, social, promotion, update (Google)

many applications in practice,

especially for ‘recognition’

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

18 - Learning from Data

Types of Machine Learning

Regression: Patient Recovery Prediction Problem

• binary classification: patient features ⇒ sick or not

• multiclass classification: patient features ⇒ which type of cancer

• regression: patient features ⇒ how many days before recovery

• Y = R or Y = [lower, upper] ⊂ R (bounded regression)

—deeply studied in statistics

Other Regression Problems

• company data ⇒ stock price

• climate data ⇒ temperature

also core and important with many ‘statistical’

tools as building block of other tools

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

19 - Learning from Data

Types of Machine Learning

Regression for Recommender System (1/2)

data

ML

skill

• data: how many users have rated some movies

• skill: predict how a user would rate an unrated movie

A Hot Problem

• competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

• similar competition (movies → songs) held by Yahoo! in KDDCup

2011

• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

How can machines learn our preferences?

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

20 - →

Learning from Data

Types of Machine Learning

Regression for Recommender System (2/2)

A Possible ML Solution

Cruise?

• pattern:

prefers blockbusters?

likes action?

likes comedy?

likes Tom

rating ← viewer/movie factors

viewer

• learning:

Match movie and

add contributions

predicted

known rating

viewer factors

from each factor

rating

→ learned factors

movie

→ unknown rating prediction

co a b

T

m c

lo

om

e t

d io

ck

y n

b

Cr

c c

u

u

o o

s

i

n n

te

s

t t

e

e e

r?

n n

in

t t

it?

key part of the world-champion (again!)

system from National Taiwan Univ.

in KDDCup 2011

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

21 - Learning from Data

Types of Machine Learning

Supervised versus Unsupervised

coin recognition with yn

coin recognition without yn

25

Mass

Mass

5

1

10

Size

Size

supervised multiclass classification

unsupervised multiclass classification

⇐⇒ ‘clustering’

Other Clustering Problems

• articles ⇒ topics

• consumer profiles ⇒ consumer groups

clustering: a challenging but useful problem

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

22 - Learning from Data

Types of Machine Learning

Supervised versus Unsupervised

coin recognition with yn

coin recognition without yn

25

Mass

Mass

5

1

10

Size

Size

supervised multiclass classification

unsupervised multiclass classification

⇐⇒ ‘clustering’

Other Clustering Problems

• articles ⇒ topics

• consumer profiles ⇒ consumer groups

clustering: a challenging but useful problem

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

22 - Learning from Data

Types of Machine Learning

Semi-supervised: Coin Recognition with Some yn

25

25

Mass

Mass

Mass

5

5

1

1

10

10

Size

Size

Size

supervised

semi-supervised

unsupervised (clustering)

Other Semi-supervised Learning Problems

• face images with a few labeled ⇒ face identifier (Facebook)

• medicine data with a few labeled ⇒ medicine effect predictor

semi-supervised learning: leverage

unlabeled data to avoid ‘expensive’ labeling

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

23 - Learning from Data

Types of Machine Learning

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog pees on the ground.

BAD DOG. THAT’S A VERY WRONG ACTION.

• cannot easily show the dog that yn = sit

when xn = ‘sit down’

• but can ‘punish’ to say ˜

yn = pee is wrong

Other Reinforcement Learning Problems Using (x, ˜

y , goodness)

• (customer, ad choice, ad click earning) ⇒ ad system

• (cards, strategy, winning amount) ⇒ black jack agent

reinforcement: learn with ‘partial/implicit

information’ (often sequentially)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

24 - Learning from Data

Types of Machine Learning

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog sits down.

Good Dog. Let me give you some cookies.

• still cannot show yn = sit

when xn = ‘sit down’

• but can ‘reward’ to say ˜

yn = sit is good

Other Reinforcement Learning Problems Using (x, ˜

y , goodness)

• (customer, ad choice, ad click earning) ⇒ ad system

• (cards, strategy, winning amount) ⇒ black jack agent

reinforcement: learn with ‘partial/implicit

information’ (often sequentially)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

24 - Learning from Data

Step-by-step Machine Learning

Learning from Data ::

Step-by-step Machine Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

25 - Learning from Data

Step-by-step Machine Learning

Step-by-step Machine Learning

unknown target function

f : X → Y

(ideal credit approval formula)

learning

training examples

final hypothesis

D

algorithm

: (x1, y1), · · · , (xN , yN )

g ≈ f

A

(historical records in bank)

(‘learned’ formula to be used)

hypothesis set

H

(set of candidate formula)

1

choose error measure: how g(x) ≈ f (x)

2

decide hypothesis set H

3

optimize error and more on D as A

4

pray for generalization:

whether g(x) ≈ f (x) for unseen x

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

26 - Learning from Data

Step-by-step Machine Learning

Choose Error Measure

g ≈ f can often evaluate by

averaged err (g(x), f (x)), called pointwise error measure

in-sample (within data)

out-of-sample (future data)

N

1

Ein(g) =

err(g(xn), f (xn))

Eout(g) =

E

err(g(x), f (x))

N

future x

n=1

yn

will start from 0/1 error err(˜

y , y ) = ˜

y = y

for classification

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

27 - Learning from Data

Step-by-step Machine Learning

Choose Hypothesis Set (for Credit Approval)

age

23 years

annual salary

NTD 1,000,000

year in job

0.5 year

current debt

200,000

• For x = (x1, x2, · · · , xd ) ‘features of customer’, compute a

weighted ‘score’ and

d

approve credit if

wi xi > threshold

i=1

d

deny credit if

wi xi < threshold

i=1

• Y: +1(good), −1(bad) , 0 ignored—linear formula h ∈ H are

d

h(x) = sign

wi xi

− threshold

i=1

linear (binary) classifier,

called ‘perceptron’ historically

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

28 - Learning from Data

Step-by-step Machine Learning

Optimize Error (and More) on Data

H = all possible perceptrons, g =?

• want: g ≈ f (hard when f unknown)

• almost necessary: g ≈ f on D, ideally

g(xn) = f (xn) = yn

• difficult: H is of infinite size

• idea: start from some g0, and ‘correct’ its

mistakes on D

let’s visualize without math

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

29 - update: 1

update: 2

update: 3

update: 4

update: 5

update: 6

update: 7

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

initially

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 2

update: 3

update: 4

update: 5

update: 6

update: 7

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t)

w(t+1)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 1

w(t+1)

x1

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 3

update: 4

update: 5

update: 6

update: 7

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 2

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 4

update: 5

update: 6

update: 7

update: 8

update: 9

finally

x3

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t+1)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 3

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 5

update: 6

update: 7

update: 8

update: 9

finally

w(t+1)

w(t+1)

w(t)

w(t+1)

w(t)

w(t)

w(t+1)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 4

x3

w(t)

w(t+1)

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 6

update: 7

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 5

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 5

update: 7

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 6

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 5

update: 6

update: 8

update: 9

finally

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 7

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 5

update: 6

update: 7

update: 9

finally

x3

w(t+1)

w(t)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 8

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 5

update: 6

update: 7

update: 8

finally

x3

w(t+1)

w(t)

w(t+1)

w(t+1)

w(t)

w(t+1)

w(t)

wPLA

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

update: 9

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - initially

update: 1

update: 2

update: 3

update: 4

update: 5

update: 6

update: 7

update: 8

update: 9

x3

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

w(t+1)

w(t)

x1

x9

x14

Learning from Data

Step-by-step Machine Learning

Seeing is Believing

finally

wPLA

worked like a charm with < 20 lines!!

—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

30 - Learning from Data

Step-by-step Machine Learning

Pray for Generalization

(pictures from Google Image Search)

Parent

Target f (x) + noise

❄

❄

(picture, label) pairs

examples (picture xn, label yn)

✬ ❄

✩

✬ ❄

✩

Kid’s

✲ good

learning

✲ good

brain

hypothesis

algorithm

hypothesis

✫

✪

✫

g(x)

✪ ≈ f (x)

✻

✻

alternatives

hypothesis set H

challenge:

see only {(xn, yn)} without knowing f nor noise

?

=⇒ generalize to unseen (x, y) w.r.t. f (x)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

31 - Learning from Data

Step-by-step Machine Learning

Generalization Is Non-trivial

Bob impresses Alice by memorizing every given (movie, rank);

but too nervous about a new movie and guesses randomly

(pictures from Google Image Search)

memorize

=

generalize

perfect from Bob’s view

=

good for Alice

perfect during training

=

good when testing

take-home message: if H is simple (like lines),

generalization is usually possible

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

32 - Learning from Data

Step-by-step Machine Learning

Mini-Summary

Learning from Data

What is Machine Learning

use data to approximate target

Components of Machine Learning

algorithm A takes data D and hypotheses H to get hypothesis g

Types of Machine Learning

variety of problems almost everywhere

Step-by-step Machine Learning

error, hypotheses, optimize, generalize

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

33 - Fundamental Machine Learning Models

Linear Regression

Fundamental Machine Learning Models ::

Linear Regression

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

35 - Fundamental Machine Learning Models

Linear Regression

Credit Limit Problem

age

23 years

gender

female

annual salary

NTD 1,000,000

year in residence

1 year

unknown target function

year in job

0.5 year

f : X → Y

current debt

200,000

(ideal credit limit formula)

credit limit? 100,000

learning

training examples

final hypothesis

D

algorithm

: (x1, y1), · · · , (xN , yN )

g ≈ f

A

(historical records in bank)

(‘learned’ formula to be used)

hypothesis set

H

(set of candidate formula)

Y = R: regression

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

36 - Fundamental Machine Learning Models

Linear Regression

Linear Regression Hypothesis

age

23 years

annual salary

NTD 1,000,000

year in job

0.5 year

current debt

200,000

• For x = (x0, x1, x2, · · · , xd ) ‘features of customer’,

approximate the desired credit limit with a weighted sum:

d

y ≈

wi xi

i=0

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

37 - Fundamental Machine Learning Models

Linear Regression

Illustration of Linear Regression

x = (x ) ∈ R

x = (x

2

1, x2) ∈ R

y

y

x1

x

x

x2

x

linear regression:

find lines/hyperplanes with small residuals

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

38 - Fundamental Machine Learning Models

Linear Regression

The Error Measure

popular/historical error measure:

squared error err(ˆ

y , y ) = (ˆ

y − y)2

in-sample

out-of-sample

N

1

Ein(hw) =

(h(xn) − yn)2

E

(wT x − y)2

N

out(w) =

E

(x,y )∼P

n=1

wT xn

next: how to minimize Ein(w)?

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

39 - Fundamental Machine Learning Models

Linear Regression

Minimize Ein

N

1

min Ein(w) =

(wT xn − yn)2

w

N n=1

• Ein(w): continuous, differentiable, convex

• necessary condition of ‘best’ w

E

∂Ein

in

(w)

0

∂w

0

∂E

in (w)

0

∇

E

∂w

in(w) ≡

1

=

. . .

. . .

∂Ein

w

(w)

0

∂w d

—not possible to ‘roll down’

task: find wLIN such that ∇Ein(wLIN) = 0

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

40 - Fundamental Machine Learning Models

Linear Regression

Linear Regression Algorithm

1

from D, construct input matrix X and output vector y by

− − xT

y

1 − −

1

− − xT

y

X =

2 − −

2

y =

· · ·

· · ·

− − xT

y

N − −

N

N×(d +1)

N×1

2

calculate pseudo-inverse

X†

(d +1)×N

3

return wLIN = X†y

(d +1)×1

simple and efficient

with good † routine

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

41 - Fundamental Machine Learning Models

Linear Regression

Is Linear Regression a ‘Learning Algorithm’?

wLIN = X†y

No!

Yes!

• analytic (closed-form)

• good Ein?

solution, ‘instantaneous’

yes, optimal!

• not improving Ein nor

• good Eout?

Eout iteratively

yes, ‘simple’ like perceptrons

• improving iteratively?

somewhat, within an iterative

pseudo-inverse routine

if Eout(wLIN) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

42 - Fundamental Machine Learning Models

Logistic Regression

Fundamental Machine Learning Models ::

Logistic Regression

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

43 - Fundamental Machine Learning Models

Logistic Regression

Heart Attack Prediction Problem (1/2)

age

40 years

gender

male

blood pressure

130/85

cholesterol level

240

unknown target

weight

70

distribution P(y |x)

heart disease? yes

containing f (x) + noise

learning

training examples

final hypothesis

D

algorithm

: (x1, y1), · · · , (xN , yN )

g ≈ f

A err

hypothesis set

error measure

H

err

binary classification:

ideal f (x) = sign P(+1|x) − 1 ∈ {−1, +1}

2

because of classification err

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

44 - Fundamental Machine Learning Models

Logistic Regression

Heart Attack Prediction Problem (2/2)

age

40 years

gender

male

blood pressure

130/85

cholesterol level

240

unknown target

weight

70

distribution P(y |x)

heart attack? 80% risk

containing f (x) + noise

learning

training examples

final hypothesis

D

algorithm

: (x1, y1), · · · , (xN , yN )

g ≈ f

A err

hypothesis set

error measure

H

err

‘soft’ binary classification:

f (x) = P(+1|x) ∈ [0, 1]

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

45 - Fundamental Machine Learning Models

Logistic Regression

Soft Binary Classification

target function f (x) = P(+1|x) ∈ [0, 1]

ideal (noiseless) data

actual (noisy) data

x1, y

= 0.9 = P(+1

x

1

|x1)

1, y1

= ◦ ∼ P(y|x1)

x2, y

= 0.2 = P(+1

x

2

|x2)

2, y2

= × ∼ P(y|x2)

..

.

.

..

xN, y

= 0.6 = P(+1

x

N

|xN)

N , yN

= × ∼ P(y|xN)

same data as hard binary classification,

different target function

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

46 - Fundamental Machine Learning Models

Logistic Regression

Soft Binary Classification

target function f (x) = P(+1|x) ∈ [0, 1]

ideal (noiseless) data

actual (noisy) data

x1, y

= 0.9 = P(+1

x

= 1

=

1

|x1)

1, y1

◦ ?∼ P(y|x1)

x2, y

= 0.2 = P(+1

x

2

|x2)

2, y

= 0

=

2

◦ ?∼ P(y|x2)

..

.

.

..

xN, y

= 0.6 = P(+1

N

|xN)

xN, y

= 0

=

N

◦ ?∼ P(y|xN)

same data as hard binary classification,

different target function

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

46 - Fundamental Machine Learning Models

Logistic Regression

Logistic Hypothesis

age

40 years

gender

male

blood pressure

130/85

cholesterol level

240

• For x = (x0, x1, x2, · · · , xd ) ‘features of

patient’, calculate a weighted ‘risk score’:

1

d

θ(s)

s =

wi xi

i=0

0

s

• convert the score to estimated probability

by logistic function θ(s)

logistic hypothesis:

h(x) = θ(wT x) =

1

1+exp(−wT x)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

47 - Fundamental Machine Learning Models

Logistic Regression

Minimizing Ein(w)

a popular error: E

N

in(w) = 1

ln 1 + exp(−y

N

n=1

nwT xn)

called cross-

entropy derived from maximum likelihood

• Ein(w): continuous, differentiable,

twice-differentiable, convex

• how to minimize? locate valley

Ein

want ∇Ein(w) = 0

w

most basic algorithm:

gradient descent (roll downhill)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

48 - Fundamental Machine Learning Models

Logistic Regression

Gradient Descent

For t = 0, 1, . . .

wt+1 ← wt + ηv

when stop, return last w as g

• PLA: v comes from mistake correction

in

• smooth Ein(w) for logistic regression:

choose v to get the ball roll ‘downhill’?

rror,E

E

• direction v:

ple

(assumed) of unit length

• step size η:

In-sam

(assumed) positive

Weights, w

gradient descent: v ∝ −∇Ein(wt )

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

49 - Fundamental Machine Learning Models

Logistic Regression

Putting Everything Together

Logistic Regression Algorithm

initialize w0

For t = 0, 1, · · ·

1

compute

N

1

∇Ein(wt) =

θ −ynwT

N

t xn

−ynxn

n=1

2

update by

wt+1 ← wt − η∇Ein(wt)

...until ∇Ein(wt+1) ≈ 0 or enough iterations

return last wt+1 as g

can use more sophisticated tools to speed up

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

50 - Fundamental Machine Learning Models

Logistic Regression

Linear Models Summarized

linear scoring function: s = wT x

linear classification

linear regression

logistic regression

h(x) = sign(s)

h(x) = s

h(x) = θ(s)

x 0

x

x

0

0

x 1

x

x

s

1

1

s

s

x

h

2

x

( )

x

h

x

h

2

x

( )

2

x

( )

x d

x

x

d

d

plausible err = 0/1

friendly err = squared

plausible err = cross-entropy

discrete Ein(w):

quadratic convex Ein(w):

smooth convex Ein(w):

solvable in special case

closed-form solution

gradient descent

my ‘secret’: linear first!!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

51 - Fundamental Machine Learning Models

Nonlinear Transform

Fundamental Machine Learning Models ::

Nonlinear Transform

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

52 - Fundamental Machine Learning Models

Nonlinear Transform

Linear Hypotheses

up to now: linear hypotheses

but limited . . .

1

0

−1−1

0

1

• visually: ‘line’-like

• theoretically: complexity

boundary

under control :-)

• mathematically: linear

• practically: on some D,

scores s = wT x

large Ein for every line :-(

how to break the limit of linear hypotheses

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

53 - Fundamental Machine Learning Models

Nonlinear Transform

Circular Separable

1

1

0

0

−1

−1

−1

0

1

−1

0

1

• D not linear separable

• but circular separable by a circle of

√

radius

0.6 centered at origin:

hSEP(x) = sign −x21 − x22 + 0.6

re-derive Circular-PLA, Circular-Regression,

blahblah . . . all over again? :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

54 - Fundamental Machine Learning Models

Nonlinear Transform

Circular Separable and Linear Separable

h(x) = sign 0.6 · 1

+(−1) · x21

+(−1) · x22

˜

w0

z0

˜

w1

z

˜

1

w2

z2

=

sign ˜

wT z

1

1

• {(x

x

n, yn)} circular separable

2

=⇒ {(zn, yn)} linear separable

x1

z2

0

• x ∈ X Φ

−→ z ∈ Z:

0.5

(nonlinear) feature

transform Φ

z1

0

−1−1

0

1

0

0.5

1

circular separable in X =⇒ linear separable in Z

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

55 - ⇐=

Fundamental Machine Learning Models

Nonlinear Transform

General Quadratic Hypothesis Set

a ‘bigger’ Z-space with Φ2(x) = (1, x1, x2, x2, x

)

1

1x2, x 2

2

perceptrons in Z-space ⇐⇒ quadratic hypotheses in X -space

HΦ = h(x): h(x) = ˜h(Φ

2

2(x)) for some linear ˜

h on Z

• can implement all possible quadratic curve boundaries:

circle, ellipse, rotated ellipse, hyperbola, parabola, . . .

ellipse 2(x1 + x2 − 3)2 + (x1 − x2 − 4)2 = 1

⇐= ˜

wT = [33, −20, −4, 3, 2, 3]

include lines and constants as degenerate

cases

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

56 - Fundamental Machine Learning Models

Nonlinear Transform

Good Quadratic Hypothesis

Z-space

X -space

perceptrons

⇐⇒

quadratic hypotheses

good perceptron

⇐⇒

good quadratic hypothesis

separating perceptron

⇐⇒ separating quadratic hypothesis

1

1

x2

z

x

2

1

0.5

⇐⇒

0

z1

0

−1

0

0.5

1

−1

0

1

• want: get good perceptron in Z-space

• known: get good perceptron in X -space with data {(xn, yn)}

solution: get good perceptron in Z-space with data

{(zn = Φ2(xn), yn)}

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

57 - Fundamental Machine Learning Models

Nonlinear Transform

The Nonlinear Transform Steps

1

1

Φ

0

−→

0.5

0

−1−1

0

1

0

0.5

1

↓ A

1

1

Φ−1

←−

0

0.5

Φ

−→

0

−1−1

0

1

0

0.5

1

1

transform original data {(xn, yn)} to {(zn = Φ(xn), yn)} by Φ

2

get a good perceptron ˜

w using {(zn, yn)}

and your favorite linear algorithm A

3

return g(x) = sign ˜

wT Φ(x)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

58 - Fundamental Machine Learning Models

Nonlinear Transform

Nonlinear Model via Nonlinear Φ + Linear Models

1

1

Φ

0

−→

0.5

two choices:

• feature transform

0

−1−1

0

1

0

0.5

1

Φ

↓ A

1

1

• linear model A,

Φ−1

not just binary

←−

classification

0

0.5

Φ

−→

0

−1−1

0

1

0

0.5

1

Pandora’s box :-):

can now freely do quadratic PLA, quadratic regression,

cubic regression, . . ., polynomial regression

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

59 - Fundamental Machine Learning Models

Nonlinear Transform

Feature Transform Φ

Φ

−→

etry

m

Sym

not 1

1

Average Intensity

↓ A

Φ−1

←−

etry

m

Φ

−→

Sym

Average Intensity

more generally, not just polynomial:

domain knowledge

raw (pixels)

−→

concrete (intensity, symmetry)

the force, too good to be true? :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

60 - =

Fundamental Machine Learning Models

Nonlinear Transform

Computation/Storage Price

Q-th order polynomial transform: ΦQ(x) =

1,

x1, x2, . . . , xd ,

x 2

1 , x1x2, . . . , x 2

d ,

. . . ,

x Q, xQ−1x

1

1

2, . . . , x Q

d

1 + ˜

d

dimensions

˜

w0

others

= # ways of ≤ Q-combination from d kinds with repetitions

= Q+d = Q+d = O Qd

Q

d

= efforts needed for computing/storing z = ΦQ(x) and ˜

w

Q large =⇒ difficult to compute/store

AND curve too complicated

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

61 - Fundamental Machine Learning Models

Nonlinear Transform

Generalization Issue

which one do you prefer? :-)

• Φ1 ‘visually’ preferred

• Φ4: Ein(g) = 0 but overkill

Φ1 (original x)

Φ4

how to pick Q?

model selection (to be discussed) important

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

62 - Fundamental Machine Learning Models

Decision Tree

Fundamental Machine Learning Models ::

Decision Tree

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

63 - Fundamental Machine Learning Models

Decision Tree

Decision Tree for Watching MOOC Lectures

T

quitting

G(x) =

qt (x) · gt (x)

time?

t=1

< 18:30

between

> 21:30

• base hypothesis gt (x):

leaf at end of path t,

has a

Y

deadline?

date?

a constant here

• condition q

> 2 days

between

< −2 days

t (x):

true

false

is x on path t?

N

Y

N

Y

N

• usually with simple

internal nodes

decision tree: arguably one of the most

human-mimicking models

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

64 - Fundamental Machine Learning Models

Decision Tree

Recursive View of Decision Tree

Path View: G(x) =

T

x on path t

t=1

· leaft(x)

Recursive View

C

quitting

time?

G(x) =

b(x) = c · Gc(x)

< 18:30

between

> 21:30

c=1

has a

• G(x): full-tree hypothesis

Y

deadline?

date?

• b(x): branching criteria

> 2 days

< −2 days

true

false

between

• Gc(x): sub-tree hypothesis at

N

Y

N

Y

N

the c-th branch

tree = (root, sub-trees), just like what

your data structure instructor would say :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

65 - Fundamental Machine Learning Models

Decision Tree

A Basic Decision Tree Algorithm

C

G(x) =

b(x) = c Gc(x)

c=1

function DecisionTree data D = {(xn, yn)}Nn=1

if termination criteria met

return base hypothesis gt (x)

else

1

learn branching criteria b(x)

2

split D to C parts Dc = {(xn, yn): b(xn) = c}

3

build sub-tree Gc ← DecisionTree(Dc)

C

4

return G(x) =

b(x) = c Gc(x)

c=1

four choices: number of branches, branching

criteria, termination criteria, & base hypothesis

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

66 - Fundamental Machine Learning Models

Decision Tree

Classification and Regression Tree (C&RT)

function DecisionTree(data D = {(xn, yn)}N )

n=1

if termination criteria met

return base hypothesis gt (x)

else ...

2

split D to C parts Dc = {(xn, yn): b(xn) = c}

choices

• C = 2 (binary tree)

• gt (x) = Ein-optimal constant

• binary/multiclass classification (0/1 error): majority of {yn}

• regression (squared error): average of {yn}

• branching: threshold some selected dimension

• termination: fully-grown, or better pruned

disclaimer:

C&RT here is based on selected components

of CARTTM of California Statistical Software

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

67 - C&RT

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68 - C&RT

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68 - C&RT

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68

C&RT: ‘divide-and-conquer’

Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68- Fundamental Machine Learning Models

Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

68 - Fundamental Machine Learning Models

Decision Tree

Practical Specialties of C&RT

• human-explainable

• multiclass easily

• categorical features easily

• missing features easily

• efficient non-linear training (and testing)

—almost no other learning model share all such specialties,

except for other decision trees

another popular decision tree algorithm:

C4.5, with different choices of heuristics

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

69 - Fundamental Machine Learning Models

Decision Tree

Mini-Summary

Fundamental Machine Learning Models

Linear Regression

analytic solution by pseudo inverse

Logistic Regression

minimize cross-entropy error with gradient descent

Nonlinear Transform

the secrete ‘force’ to enrich your model

Decision Tree

human-like explainable model learned recursively

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

70 - Hazard of Overfitting

Roadmap

Hazard of Overfitting

Overfitting

Data Manipulation and Regularization

Validation

Principles of Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

71 - Hazard of Overfitting

Overfitting

Hazard of Overfitting ::

Overfitting

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

72 - Hazard of Overfitting

Overfitting

Theoretical Foundation of Statistical Learning

if training and testing from same distribution, with a high probability,

Eout(g) ≤

Ein(g)

+

8 ln 4(2N)dVC(H)

N

δ

test error

training error

Ω:price of using H

out-of-sample error

• dVC(H): complexity of H,

≈ # of parameters to

model complexity

describe H

rror

E

• dVC ↑: Ein ↓ but Ω ↑

• dVC ↓: Ω ↓ but Ein ↑

in-sample error

• best d∗ in the middle

VC

d∗vc

VC dimension, dvc

powerful H not always good!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

73 - Hazard of Overfitting

Overfitting

Bad Generalization

• regression for x ∈

Data

R with N = 5

Target

examples

Fit

• target f (x) = 2nd order polynomial

y

• label yn = f (xn) + very small noise

• linear regression in Z-space +

Φ = 4th order polynomial

x

• unique solution passing all examples

=⇒ Ein(g) = 0

• Eout(g) huge

bad generalization: low Ein, high Eout

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

74 - Hazard of Overfitting

Overfitting

Bad Generalization and Overfitting

• take d

out-of-sample error

VC = 1126 for learning:

bad generalization

—(E

model complexity

out - Ein) large

rror

• switch from d

E

VC = d ∗

to d

VC

VC = 1126:

overfitting

in-sample error

—Ein ↓, Eout ↑

• switch from d

d∗vc

VC dimension, dvc

VC = d ∗

to d

VC

VC = 1:

underfitting

—Ein ↑, Eout ↑

bad generalization: low Ein, high Eout;

overfitting: lower Ein, higher Eout

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

75 - Hazard of Overfitting

Overfitting

Cause of Overfitting: A Driving Analogy

Data

Target

Fit

y

y

x

x

‘good fit’

=⇒

overfit

learning

driving

overfit

commit a car accident

use excessive dVC

‘drive too fast’

noise

bumpy road

limited data size N

limited observations about road condition

let’s ‘visualize’ overfitting

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

76 - Hazard of Overfitting

Overfitting

Impact of Noise and Data Size

impact of σ2 versus N:

stochastic noise

0.2

2

2

0.1

Level,σ

0

1

oise

N

-0.1

0

-0.2

Numbe80r of Da1t0a

0 Point1s2,0 N

data size N ↓

overfit ↑

reasons of serious overfitting:

stochastic noise ↑ overfit ↑

overfitting ‘easily’ happens

(more on ML Foundations, Lecture 13)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

77 - Hazard of Overfitting

Overfitting

Linear Model First

out-of-sample error

model complexity

rror

E

in-sample error

d∗vc

VC dimension, dvc

• tempting sin: use H1126, low Ein(g1126) to fool your boss

—really? :-( a dangerous path of no return

• safe route: H1 first

• if Ein(g1) good enough, live happily thereafter :-)

• otherwise, move right of the curve

with nothing lost except ‘wasted’ computation

linear model first:

simple, efficient, safe, and workable!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

78 - Hazard of Overfitting

Overfitting

Driving Analogy Revisited

learning

driving

overfit

commit a car accident

use excessive dVC

‘drive too fast’

noise

bumpy road

limited data size N

limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very practical techniques

to combat overfitting

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

79 - Hazard of Overfitting

Data Manipulation and Regularization

Hazard of Overfitting ::

Data Manipulation and Regularization

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

80 - Hazard of Overfitting

Data Manipulation and Regularization

Data Cleaning/Pruning

• if ‘detect’ the outlier 5 at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

• possible action 1: correct the label (data cleaning)

• possible action 2: remove the example (data pruning)

possibly helps, but effect varies

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

81 - Hazard of Overfitting

Data Manipulation and Regularization

Data Hinting

• slightly shifted/rotated digits carry the same meaning

• possible action: add virtual examples by shifting/rotating the

given digits (data hinting)

possibly helps, but watch out

not to steal the thunder

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

82 - Hazard of Overfitting

Data Manipulation and Regularization

Regularization: The Magic

Data

Target

Fit

y

y

x

x

‘regularized fit’

⇐=

overfit

• idea: ‘step back’ from 10-th order polynomials to 2-nd order ones

H0

H1

H2

H3

· · ·

• name history: function approximation for ill-posed problems

how to step back?

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

83 - = Ω(w)

Hazard of Overfitting

Data Manipulation and Regularization

Step Back by Minimizing the Augmented Error

Augmented Error

VC Bound

Eaug(w) = Ein(w) + λ wT w

E

N

out(w) ≤ Ein(w) + Ω(H)

• regularizer wT w

: complexity of a single hypothesis

• generalization price Ω(H): complexity of a hypothesis set

• if λ Ω(w) ‘represents’ Ω(H) well,

N

Eaug is a better proxy of Eout than Ein

minimizing Eaug:

(heuristically) operating with the better proxy;

(technically) enjoying flexibility of whole H

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

84 - Hazard of Overfitting

Data Manipulation and Regularization

The Optimal λ

stochastic noise

1

t

σ2 = 0.5

ou 0.75

E

0.5

σ2 = 0.25

xpected

E 0.25

σ2 = 0

Regu

0. l5arizatio1n Para1.m

5 eter, 2λ

• more noise ⇐⇒ more regularization needed

—more bumpy road ⇐⇒ putting brakes more

• noise unknown—important to make proper choices

how to choose?

validation!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

85 - Hazard of Overfitting

Validation

Hazard of Overfitting ::

Validation

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

86 - Hazard of Overfitting

Validation

Model Selection Problem

which one do you prefer? :-)

H1

H2

• given: M models H1, H2, . . . , HM, each with corresponding

algorithm A1, A2, . . . , AM

• goal: select Hm∗ such that gm∗ = Am∗(D) is of low Eout(gm∗)

• unknown Eout, as always :-)

• arguably the most important practical problem of ML

how to select? visually?

—no, can you really visualize

1126

R

? :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

87 - Hazard of Overfitting

Validation

Validation Set Dval

Ein(h)

Eval(h)

↑

↑

D

→

Dtrain

∪

Dval

size N

size N−K

size K

↓

↓

gm = Am(D)

g−

m = Am(Dtrain)

• Dval ⊂ D: called validation set—‘on-hand’ simulation of test set

• to connect Eval with Eout:

select K examples from D at random

• to make sure Dval ‘clean’:

feed only Dtrain to Am for model selection

Eout(g−

m ) ≤ Eval(g−

m ) + ‘small

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

88 - Hazard of Overfitting

Validation

Model Selection by Best Eval

H

· · ·

m∗ = argmin(E

1

H2

HM

m = Eval(Am(Dtrain)))

1≤m≤M

Dtrain

• generalization guarantee for all m:

g1

g2 · · · gM

Eout(g−

m ) ≤ Eval(g−

m ) + ‘small

Dval

• heuristic gain from N − K to N:

E1

E2 · · · EM

pick the best

(

E

Hm∗, Em∗)

out gm∗ ≤ Eout

g−

m∗

D

Am∗ (D)

Am∗ (Dtrain)

gm∗

Eout(gm∗) ≤ Eout(g−

m∗ ) ≤ Eval(g−

m∗ ) + ‘small

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

89 - Hazard of Overfitting

Validation

V -fold Cross Validation

making validation more stable

• V -fold cross-validation: random-partition of D to V equal parts,

D

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

train

validate

train

take V − 1 for training and 1 for validation orderly

V

1

Ecv(H, A) =

E (v)(g−

V

val

v )

v =1

• selection by Ecv: m∗ = argmin(Em = Ecv(Hm, Am))

1≤m≤M

practical rule of thumb: V = 10

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

90 - Hazard of Overfitting

Validation

Final Words on Validation

‘Selecting’ Validation Tool

• V -Fold generally preferred over single validation if computation

allows

• 5-Fold or 10-Fold generally works well

Nature of Validation

• all training models: select among hypotheses

• all validation schemes: select among finalists

• all testing methods: just evaluate

validation still more optimistic than testing

do not fool yourself and others :-),

report test result, not best validation result

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

91 - Hazard of Overfitting

Principles of Learning

Hazard of Overfitting ::

Principles of Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

92 - Hazard of Overfitting

Principles of Learning

Occam’s Razor for Learning

The simplest model that fits the data is also the most

plausible.

which one do you prefer? :-)

My KISS Principle:

❳

Keep It Simple, ✘✘✘

✘

❳❳❳

Stupid Safe

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

93 - Hazard of Overfitting

Principles of Learning

Sampling Bias

If the data is sampled in a biased way, learning will pro-

duce a similarly biased outcome.

philosophical explanation:

study Math hard but test English: no strong test guarantee

A True Personal Story

• Netflix competition for movie recommender system:

10% improvement = 1M US dollars

• on my own validation data, first shot, showed 13% improvement

• why am I still teaching in NTU? :-)

validation: random examples within data;

test: “last” user records “after” data

practical rule of thumb: match test scenario

as much as possible

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

94 - Hazard of Overfitting

Principles of Learning

Visual Data Snooping

If a data set has affected any step in the learning pro-

cess, its ability to assess the outcome has been com-

promised.

1

Visualize X = 2

R

• full Φ2: z = (1, x1, x2, x2, x

), d

1

1x2, x 2

2

VC = 6

• or z = (1, x2, x2), d

0

1

2

VC = 3, after visualizing?

• or better z = (1, x2 + x2) , d

1

2

VC = 2?

• or even better z = sign(0.6 − x2

) ?

1 − x 2

2

−1−1

0

1

—careful about your brain’s ‘model complexity’

if you torture the data long enough, it will

confess :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

95 - Hazard of Overfitting

Principles of Learning

Dealing with Data Snooping

• truth—very hard to avoid, unless being extremely honest

• extremely honest: lock your test data in safe

• less honest: reserve validation and use cautiously

• be blind: avoid making modeling decision by data

• be suspicious: interpret research results (including your own) by

proper feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping) and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

96 - Hazard of Overfitting

Principles of Learning

Mini-Summary

Hazard of Overfitting

Overfitting

the ‘accident’ that is everywhere in learning

Data Manipulation and Regularization

clean data, synthetic data, or augmented error

Validation

honestly simulate testing procedure for proper model selection

Principles of Learning

simple model, matching test scenario, and no snooping

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

97 - Modern Machine Learning Models

Support Vector Machine

Modern Machine Learning Models ::

Support Vector Machine

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

99 - max

margin(w)

w

subject to

every ynwT xn > 0

margin(w) =

min distance(xn, w)

n=1,...,N

Modern Machine Learning Models

Support Vector Machine

Motivation: Large-Margin Separating Hyperplane

max

fatness(w)

w

subject to

w classifies every (xn, yn) correctly

fatness(w) =

min distance(xn, w)

n=1,...,N

• fatness: formally called margin

• correctness: yn = sign(wT xn)

initial goal: find largest-margin

separating hyperplane

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

100 - Modern Machine Learning Models

Support Vector Machine

Motivation: Large-Margin Separating Hyperplane

max

margin(w)

w

subject to

every ynwT xn > 0

margin(w) =

min distance(xn, w)

n=1,...,N

• fatness: formally called margin

• correctness: yn = sign(wT xn)

initial goal: find largest-margin

separating hyperplane

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

100 - Modern Machine Learning Models

Support Vector Machine

Soft-Margin Support Vector Machine

initial goal: find largest-margin separating hyperplane

• soft-margin (practical) SVM: not insisting on separating:

• minimize large-margin regularizer + C· separation error,

• just like regularization with augmented error

min Eaug(w) = Ein(w) + λ wT w

N

• two forms:

• finding hyperplane in original space (linear first!!)

LIBLINEAR www.csie.ntu.edu.tw/~cjlin/liblinear

• or in mysterious transformed space hidden in ‘kernels’

LIBSVM www.csie.ntu.edu.tw/~cjlin/libsvm

linear: ‘best’ linear classification model;

non-linear: ‘leading’ non-linear classification model for mid-sized data

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

101 - Modern Machine Learning Models

Support Vector Machine

Hypothesis of Gaussian SVM

Gaussian kernel K (x, x ) = exp −γ x − x 2

gSVM(x) = sign

αnynK (xn, x) + b

SV

=

sign

α

2

nynexp

−γ x − xn

+ b

SV

• linear combination of Gaussians centered at SVs xn

• also called Radial Basis Function (RBF) kernel

Gaussian SVM:

find αn to combine Gaussians centered at xn

& achieve large margin in infinite-dim. space

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

102 - Modern Machine Learning Models

Support Vector Machine

Support Vector Mechanism

large-margin

hyperplanes

+ higher-order transforms with kernel trick

+ noise tolerance of soft-margin

#

not many

boundary

sophisticated

• transformed vector z = Φ(x) =⇒ efficient kernel K (x, x )

• store optimal w =⇒ store a few SVs and αn

new possibility by Gaussian SVM:

infinite-dimensional linear classification, with

generalization ‘guarded by’ large-margin :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

103 - Modern Machine Learning Models

Support Vector Machine

Practical Need: Model Selection

replacemen

• large γ =⇒ sharp

Gaussians =⇒ ‘overfit’?

• complicated even for (C, γ)

of Gaussian SVM

• more combinations if

including other kernels or

parameters

how to select? validation :-)

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

104 - Modern Machine Learning Models

Support Vector Machine

Step-by-step Use of SVM

strongly recommended: ‘A Practical Guide to Support Vector

Classification’

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

1

scale each feature of your data to a suitable range (say, [−1, 1])

2

use a Gaussian RBF kernel

3

use cross validation and grid search to choose good (γ, C)

4

use the best (γ, C) on your data

5

do testing with the learned SVM classifier

all included in easy.py of LIBSVM

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

105 - Modern Machine Learning Models

Random Forest

Modern Machine Learning Models ::

Random Forest

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

106 - Modern Machine Learning Models

Random Forest

Random Forest (RF)

random forest (RF) =

bagging (random sampling) + fully-grown C&RT decision tree

function RandomForest(D)

function DTree(D)

For t = 1, 2, . . . , T

if termination return base gt

else

1

request size-N data ˜

Dt by

1

learn b(x) and split D to

bootstrapping with D

Dc by b(x)

2

obtain tree gt by DTree( ˜

Dt)

2

build Gc ← DTree(Dc)

return G = Uniform({gt })

3

return G(x) =

C

b(x) = c Gc(x)

c=1

• highly parallel/efficient to learn

• inherit pros of C&RT

• eliminate cons of fully-grown tree

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

107 - Modern Machine Learning Models

Random Forest

Feature Selection

for x = (x1, x2, . . . , xd ), want to remove

• redundant features: like keeping one of ‘age’ and ‘full birthday’

• irrelevant features: like insurance type for cancer prediction

and only ‘learn’ subset-transform Φ(x) = (xi , x , x )

1

i2

id

with d < d for g(Φ(x))

advantages:

disadvantages:

• efficiency: simpler

• computation:

hypothesis and shorter

‘combinatorial’ optimization

prediction time

in training

• generalization: ‘feature

• overfit: ‘combinatorial’

noise’ removed

selection

• interpretability

• mis-interpretability

decision tree: a rare model

with built-in feature selection

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

108 - Modern Machine Learning Models

Random Forest

Feature Selection by Importance

idea: if possible to calculate

importance(i) for i = 1, 2, . . . , d

then can select i1, i2, . . . , id of top-d importance

importance by linear model

d

score = wT x =

wi xi

i=1

• intuitive estimate: importance(i) = |wi| with some ‘good’ w

• getting ‘good’ w: learned from data

• non-linear models? often much harder

but ‘easy’ feature selection in RF

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

109 - Modern Machine Learning Models

Random Forest

Feature Importance by Permutation Test

idea: random test

—if feature i needed, ‘random’ values of xn,i degrades performance

permutation test:

importance(i) = performance(D) − performance(D(p))

with D(p) is D with {xn,i} replaced by permuted {xn,i}Nn=1

permutation test: a general statistical tool that

can be easily coupled with RF

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

110 - ‘easy yet robust’ nonlinear model

Modern Machine Learning Models

Random Forest

A Complicated Data Set

gt (N = N/2)

G with first t trees

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

111 - ‘easy yet robust’ nonlinear model

Modern Machine Learning Models

Random Forest

A Complicated Data Set

gt (N = N/2)

G with first t trees

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

111 - ‘easy yet robust’ nonlinear model

Modern Machine Learning Models

Random Forest

A Complicated Data Set

gt (N = N/2)

G with first t trees

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

111

Modern Machine Learning Models

Random Forest

A Complicated Data Set

gt (N = N/2)

G with first t trees

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

111- Modern Machine Learning Models

Random Forest

A Complicated Data Set

gt (N = N/2)

G with first t trees

‘easy yet robust’ nonlinear model

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

111 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Modern Machine Learning Models ::

Adaptive (or Gradient) Boosting

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

112 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Apple Recognition Problem

• is this a picture of an apple?

• say, want to teach a class of 6 year olds

• gather photos under CC-BY-2.0 license on Flicker

(thanks to the authors below!)

(APAL stands for Apple and Pear Australia Ltd)

Dan Foy

APAL

adrianbartel

ANdrzej cH.

Stuart Webster

https:

https:

https:

https:

https:

//flic.

//flic.

//flic.

//flic.

//flic.

kr/p/jNQ55

kr/p/jzP1VB

kr/p/bdy2hZ

kr/p/51DKA8

kr/p/9C3Ybd

nachans

APAL

Jo Jakeman

APAL

APAL

https:

https:

https:

https:

https:

//flic.

//flic.

//flic.

//flic.

//flic.

kr/p/9XD7Ag

kr/p/jzRe4u

kr/p/7jwtGp

kr/p/jzPYNr

kr/p/jzScif

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

113 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Apple Recognition Problem

• is this a picture of an apple?

• say, want to teach a class of 6 year olds

• gather photos under CC-BY-2.0 license on Flicker

(thanks to the authors below!)

Mr. Roboto.

Richard North

Richard North

Emilian Robert

Nathaniel

Mc-

Vicol

Queen

https:

https:

https:

https:

https:

//flic.

//flic.

//flic.

//flic.

//flic.

kr/p/i5BN85

kr/p/bHhPkB

kr/p/d8tGou

kr/p/bpmGXW

kr/p/pZv1Mf

Crystal

jfh686

skyseeker

Janet Hudson

Rennett Stowe

https:

https:

https:

https:

https:

//flic.

//flic.

//flic.

//flic.

//flic.

kr/p/kaPYp

kr/p/6vjRFH

kr/p/2MynV

kr/p/7QDBbm

kr/p/agmnrk

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

113 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Our Fruit Class Begins

• Teacher: Please look at the pictures of apples and non-apples

below. Based on those pictures, how would you describe an

apple? Michael?

• Michael: I think apples are circular.

(Class): Apples are circular.

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

114 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Our Fruit Class Continues

• Teacher: Being circular is a good feature for the apples. However,

if you only say circular, you could make several mistakes. What

else can we say for an apple? Tina?

• Tina: It looks like apples are red.

(Class): Apples are somewhat circular and

somewhat red.

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

115 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Our Fruit Class Continues More

• Teacher: Yes. Many apples are red. However, you could still make

mistakes based on circular and red. Do you have any other

suggestions, Joey?

• Joey: Apples could also be green.

(Class): Apples are somewhat circular and

somewhat red and possibly green.

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

116 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Our Fruit Class Ends

• Teacher: Yes. It seems that apples might be circular, red, green.

But you may confuse them with tomatoes or peaches, right? Any

more suggestions, Jessica?

• Jessica: Apples have stems at the top.

(Class): Apples are somewhat circular, somewhat red, possibly green,

and may have stems at the top.

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

117 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Motivation

• students: simple hypotheses gt (like vertical/horizontal lines)

• (Class): sophisticated hypothesis G (like black curve)

• Teacher: a tactic learning algorithm that directs the students to

focus on key examples

next: demo of such an algorithm

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

118 - ‘Teacher’-like algorithm works!

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119 - ‘Teacher’-like algorithm works!

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119 - ‘Teacher’-like algorithm works!

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119

Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119- Modern Machine Learning Models

Adaptive (or Gradient) Boosting

A Simple Data Set

‘Teacher’-like algorithm works!

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

119 - Modern Machine Learning Models

Adaptive (or Gradient) Boosting

Putting Everything Together

Gradient Boosted Decision Tree (GBDT)

s1 = s2 = . . . = sN = 0

for t = 1, 2, . . . , T

1

obtain gt by A({(xn, yn − sn)}) where A is a (squared-error)

regression algorithm

—such as ‘weak’ C&RT?

2

compute αt = OneVarLinearRegression({(gt (xn), yn − sn)})

3

update sn ← sn + αt gt (xn)

return G(x) =

T

α

t=1

t gt (x)

GBDT: ‘regression sibling’ of AdaBoost +

decision tree

—very popular in practice

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

120 - Modern Machine Learning Models

Deep Learning

Modern Machine Learning Models ::

Deep Learning

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

121 - Modern Machine Learning Models

Deep Learning

Physical Interpretation of Neural Network

x0 = 1

+1

+1

x1

tanh

x

tanh

2

w (1)

w (2)

w (3)

ij

jk

kq

.

tanh

..

tanh

xd

s(2)

tanh

x (2)

3

3

• each layer: pattern feature extracted from data, remember? :-)

• how many neurons? how many layers?

—more generally, what structure?

• subjectively, your design!

• objectively, validation, maybe?

structural decisions:

key issue for applying NNet

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

122 - Modern Machine Learning Models

Deep Learning

Shallow versus Deep Neural Networks

shallow: few (hidden) layers; deep: many layers

Shallow NNet

Deep NNet

• more efficient to train ( )

• challenging to train (×)

• simpler structural

• sophisticated structural

decisions (

)

decisions (×)

• theoretically powerful

• ‘arbitrarily’ powerful ( )

enough (

)

• more ‘meaningful’? (see

next slide)

deep NNet (deep learning)

gaining attention in recent years

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

123 - Modern Machine Learning Models

Deep Learning

Meaningfulness of Deep Learning

✻

positive weight

negative weight

is it a ‘1’? ✲

✛

z1

z5

is it a ‘5’?

φ

φ

2

φ3

φ4

φ5

1

φ6

,

• ‘less burden’ for each layer: simple to complex features

• natural for difficult learning task with raw features, like vision

deep NNet: currently popular in

vision/speech/. . .

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

124 - Modern Machine Learning Models

Deep Learning

Challenges and Key Techniques for Deep Learning

• difficult structural decisions:

• subjective with domain knowledge: like convolutional NNet for

images

• high model complexity:

• no big worries if big enough data

• regularization towards noise-tolerant: like

• dropout (tolerant when network corrupted)

• denoising (tolerant when input corrupted)

• hard optimization problem:

• careful initialization to avoid bad local minimum:

called pre-training

• huge computational complexity (worsen with big data):

• novel hardware/architecture: like mini-batch with GPU

IMHO, careful regularization and

initialization are key techniques

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

125 - Modern Machine Learning Models

Deep Learning

A Two-Step Deep Learning Framework

Simple Deep Learning

1

for

= 1, . . . , L, pre-train

w ( )

assuming w (1)

ij

∗

, . . . w ( −1)

∗

fixed

(a)

(b)

(c)

(d)

2

train with backprop on pre-trained NNet to fine-tune all

w ( )

ij

different deep learning models deal with the

steps somewhat differently

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

126 - Modern Machine Learning Models

Deep Learning

Mini-Summary

Modern Machine Learning Models

Support Vector Machine

large-margin boundary ranging from linear to non-linear

Random Forest

uniform blending of many many decision trees

Adaptive (or Gradient) Boosting

keep adding simple hypotheses to gang

Deep Learning

neural network with deep architecture and careful design

Hsuan-Tien Lin (NTU CSIE)

Quick Tour of Machine Learning

127