このページは http://www.slideshare.net/ksankar/data-science-folk-knowledge の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/03/30)にアップロードinテクノロジー

Data Science Insights, Model Evaluation, ROC Curves et al.

Snippets from my pycon 2004 tutorial.

- Data Science “folk knowledge” (1 of A)

o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer

Mediated Transactions

o Learning = Representation + Evaluation + Optimization

o It’s Generalization that counts

• The fundamental goal of machine learning is to generalize beyond the

examples in the training set

o Data alone is not enough

• Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond

it

o Machine Learning is not magic – one cannot get something from nothing

• In order to infer, one needs the knobs & the dials

• One also needs a rich expressive dataset

A few useful things to know about machine learning - by Pedro Domingos

http://dl.acm.org/citation.cfm?id=2347755 - Data Science “folk knowledge” (2 of A)

o Over fitting has many faces

• Bias – Model not strong enough. So the learner has the tendency to learn the

same wrong things

• Variance – Learning too much from one dataset; model will fall apart (ie much

less accurate) on a different dataset

• Sampling Bias

o Intuition Fails in high Dimensions –Bellman

• Blessing of non-conformity & lower effective dimension; many applications

have examples not uniformly spread but concentrated near a lower dimensional

manifold eg. Space of digits is much smaller then the space of images

o Theoretical Guarantees are not What they seem

• One of the major developments o f recent decades has been the realization that

we can have guarantees on the results of induction, particularly if we are

willing to settle for probabilistic guarantees.

o Feature engineering is the Key

A few useful things to know about machine learning - by Pedro Domingos

http://dl.acm.org/citation.cfm?id=2347755 - Data Science “folk knowledge” (3 of A)

o More Data Beats a Cleverer Algorithm

• Or conversely select algorithms that improve with data

• Don’t optimize prematurely without getting more data

o Learn many models, not Just One

• Ensembles ! – Change the hypothesis space

• Netflix prize

• E.g. Bagging, Boosting, Stacking

o Simplicity Does not necessarily imply Accuracy

o Representable Does not imply Learnable

• Just because a function can be represented does not mean

it can be learned

o Correlation Does not imply Causation

o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o A few useful things to know about machine learning - by Pedro Domingos

§ http://dl.acm.org/citation.cfm?id=2347755 - Data Science “folk knowledge” (4 of A)

o The simplest hypothesis that fits the data is also the most

plausible

• Occam’s Razor

• Don’t go for a 4 layer Neural Network unless

you have that complex data

• But that doesn’t also mean that one should

choose the simplest hypothesis

§ Simple Model

• Match the impedance of the domain, data & the

§ High Error line that cannot be

algorithms

compensated with more data

§ Gets to a lower error rate with less data

o Think of over fitting as memorizing as opposed to learning.

points

o Data leakage has many forms

§ Complex Model

o Sometimes the Absence of Something is Everything

§ Lower Error Line

§ But needs more data points to reach

o [Corollary] Absence of Evidence is not the Evidence of

Absence

decent error

New to Machine Learning? Avoid these three mistakes, James Faghmous

Ref: Andrew Ng/Stanford, Yaser S./CalTech https://medium.com/about-data/73258b3848a4 - Check your assumptions

o The decisions a model makes, is directly related to the it’s assumptions about the

statistical distribution of the underlying data

o For example, for regression one should check that:

① Variables are normally distributed

• Test for normality via visual inspection, skew & kurtosis, outlier inspections via

plots, z-scores et al

② There is a linear relationship between the dependent & independent

variables

• Inspect residual plots, try quadratic relationships, try log plots et al

③ Variables are measured without error

④ Assumption of Homoscedasticity

§ Homoscedasticity assumes constant or near constant error variance

§ Check the standard residual plots and look for heteroscedasticity

§ For example in the figure, left box has the errors scattered randomly around zero; while the

right two diagrams have the errors unevenly distributed

Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,

http://pareonline.net/getvn.asp?v=8&n=2 - Data Science “folk knowledge” (5 of A)

Donald Rumsfeld is an armchair Data Scientist !

The

You

World

UnKnown

Known

Knowns

o Others know, you don’t

o What we do

o Facts, outcomes or

o PotenHal facts,

scenarios we have not

outcomes we

o Known Knowns

encountered, nor

are aware, but

o There are things we know that we know

considered

not with

o Known Unknowns

certainty

Unknowns o “Black swans”, outliers,

long tails of probability

o That is to say, there are things that we

o StochasHc

distribuHons

processes,

now know we don't know

o Lack of experience,

ProbabiliHes

o But there are also Unknown Unknowns

imaginaHon

o There are things we do not know we

don't know

http://smartorg.com/2013/07/valuepoint19/ - Data Science “folk knowledge” (6 of A) - Pipeline

Collect

Store

Transform

Reason

Model

Deploy

o Volume

o Metadata

o Canonical form

o Flexible & Selectabl

oe Reﬁne model with

o Scalable Model

o Velocity

o Monitor counters &

o Data catalog

§

Data Subsets

§

Extended Data

Deployment

o Streaming Data

Metrics

o Data Fabric across the

§

Attribute sets

subsets

o Big Data

o Structured vs. Multi- ‐

organization

§

Engineered

automation &

structured

o Access to multiple

Attribute sets

purpose built

sources of data

o Validation run across a

o Think Hybrid – Big Data

appliances (soft/

larger data set

Apps, Appliances &

hard)

Data Management

Infrastructure

o Manage SLAs &

response times

Data Science

¤

Visualize

Recommend

Predict

Explore

Bytes to Business

a.k.a. Build the full

stack

o Dynamic Data Sets

o Performance

o Advanced Visualization

o 2 way key- ‐value tagging of

¤ Find Relevant Data

o Scalability

o Interactive Dashboards

datasets

For Business

o Refresh Latency

o Map Overlay

o Extended attribute sets

o In- ‐memory Analytics

o Infographics

¤ Connect the Dots

o Advanced Analytics - Data Science “folk knowledge” (7 of A)

o Three Amigos

o Interface = Cognition

o Intelligence = Compute(CPU) & Computational(GPU)

Velocity

o Infer Significance & Causality

Variety

Interface

Context

Volume

Inference

Connect

Intelligence

edness

“Data of unusual size”

that can't be brute forced - Data Science “folk knowledge” (8 of A)

Jeremy’s Axioms

o Iteratively explore data

o Tools

• Excel Format, Perl, Perl Book

o Get your head around data

• Pivot Table

o Don’t over-complicate

o If people give you data, don’t assume that you

need to use all of it

o Look at pictures !

o History of your submissions – keep a tab

o Don’t be afraid to submit simple solutions

• We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/ - Data Science “folk knowledge” (9 of A)

① Common Sense (some features make more sense then others)

② Carefully read these forums to get a peak at other peoples’ mindset

③ Visualizations

④ Train a classifier (e.g. logistic regression) and look at the feature weights

⑤ Train a decision tree and visualize it

⑥ Cluster the data and look at what clusters you get out

⑦ Just look at the raw data

⑧ Train a simple classifier, see what mistakes it makes

⑨ Write a classifier using handwritten rules

⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet)

-- Maarten Bosma

-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data - Data Science “folk knowledge” (A of A)

Lessons from Kaggle Winners

① Don’t over-fit

② All predictors are not needed

• All data rows are not needed, either

③ Tuning the algorithms will give different results

④ Reduce the dataset (Average, select transition data,…)

⑤ Test set & training set can differ

⑥ Iteratively explore & get your head around data

⑦ Don’t be afraid to submit simple solutions

⑧ Keep a tab & history your submissions - The curious case of the Data Scientist

o Data Scientist is multi-faceted & Contextual

o Data Scientist should be building Data Products

o Data Scientist should tell a story

Data Scientist (noun): Person who is better at )

statistics than any software engineer & better

Cloudera

– Josh Wills (

at software engineering than any statistician

Data Scientist (noun): Person who is worse at

statistics than any statistician & worse at

Large is hard; Infinite is much easier !

software engineering than any software (Kaggle)

– Titus Brown

engineer

– Will Cukierski

http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/ - Essential Reading List

o A few useful things to know about machine learning - by Pedro Domingos

• http://dl.acm.org/citation.cfm?id=2347755

o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert

• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/

lack_of_a_priori_distinctions_wolpert.pdf

o http://www.no-free-lunch.org/

o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,

Y. C

• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y

%20FDR.pdf

o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe

• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o Avoid these three mistakes, James Faghmo

• https://medium.com/about-data/73258b3848a4

o Leakage in Data Mining: Formulation, Detection, and Avoidance

• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/

cs670_Tran_PreferredPaper_LeakingInDataMining.pdf - For your reading & viewing pleasure … An ordered List

① An Introduction to Statistical Learning

• http://www-bcf.usc.edu/~‾gareth/ISL/

② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning

• http://online.stanford.edu/course/statistical-learning-winter-2014

③ Prof. Pedro Domingo

• https://class.coursera.org/machlearning-001/lecture/preview

④ Prof. Andrew Ng

• https://class.coursera.org/ml-003/lecture/preview

⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data

• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥ Mathematicalmonk @ YouTube

• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦ The Elements Of Statistical Learning

• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/ - Of Models,

Performance, Evaluation

& Interpretation - What does it mean ? Let us ponder ….

o We have a training data set representing a domain

• We reason over the dataset & develop a model to predict outcomes

o How good is our prediction when it comes to real life scenarios ?

o The assumption is that the dataset is taken at random

• Or Is it ? Is there a Sampling Bias ?

• i.i.d ? Independent ? Identically Distributed ?

• What about homoscedasticity ? Do they have the same finite variance ?

o Can we assure that another dataset (from the same domain) will give us the same

result ?

o Will our model & it’s parameters remain the same if we get another data set ?

o How can we evaluate our model ?

o How can we select the right parameters for a selected model ? - Bias/Variance (1 of 2)

o Model Complexity

Learning Curve

• Complex Model increases the

training data fit

• But then it overfits & doesn't

perform as well with real data

o Bias vs. Variance

o Classical diagram

o From ELSII, By Hastie, Tibshirani & Friedman

o Bias – Model learns wrong things; not

complex enough; error gap small; more

Prediction Error

data by itself won’t help

Training

o Variance – Different dataset will give

Error

different error rate; over fitted model;

larger error gap; more data could help

Ref: Andrew Ng/Stanford, Yaser S./CalTech - Bias/Variance (2 of 2)

Need more data to improve

Need more features or more

complex model to improve

o High Bias

Learning Curve

• Due to Underfitting

• Add more features

• More sophisticated model

• Quadratic Terms, complex equations,…

• Decrease regularization

o High Variance

• Due to Overfitting

• Use fewer features

Prediction Error

• Use more training sample

Training

Error

• Increase Regularization

Ref: Strata 2013 Tutorial by Olivier Grisel - � Goal

Data Partition &

◦ Model Complexity (-)

Cross-Validation

◦ Variance (-)

◦ Prediction Accuracy (+)

Partition Data !

• Training (60%)

Train

Validate

Test

• Validation(20%) &

K- ‐fold CV (k=5)

• “Vault” Test (20%) Data sets

Train

Validate

k-fold Cross-Validation

#1

#2

#3

#4

#5

• Split data into k equal parts

#1

#2

#3

#5

#4

• Fit model to k-1 parts &

calculate prediction error on kth

#1

#2

#4

#5

#3

part

#1

#3

#4

#5

#2

• Non-overlapping dataset

#2

#3

#4

#5

#1 - � Goal

◦

Bootstrap & Bagging

Model Complexity (-)

◦ Variance (-)

◦ Prediction Accuracy (+)

Bootstrap

• Draw datasets (with replacement) and fit model for each dataset

• Remember : Data Partitioning (#1) & Cross Validation (#2) are without

replacement

Bagging (Bootstrap aggregation)

◦ Average prediction over a collection of

bootstrap-ed samples, thus reducing

variance - � Goal

Boosting

◦ Model Complexity (-)

◦ Variance (-)

◦ Prediction Accuracy (+)

◦ “Output of weak classiﬁers into a powerful commiSee”

◦ Final PredicHon = weighted majority vote

◦ Later classiﬁers get misclassiﬁed points

�

With higher weight,

�

So they are forced

�

To concentrate on them

◦ AdaBoost (AdapHveBoosting)

◦ BoosHng vs Bagging

�

Bagging – independent trees

�

BoosHng – successively weighted - � Goal

Random Forests+

◦ Model Complexity (-)

◦ Variance (-)

◦ Prediction Accuracy (+)

◦ Builds large col ecHon of de- ‐correlated trees & averages

them

◦ Improves Bagging by selecHng i.i.d* random variables for

spli_ng

◦ Simpler to train & tune

◦ “Do remarkably wel , with very li6le tuning required” – ESLII

◦ Less suscepHble to over ﬁ_ng (than boosHng)

◦ Many RF implementaHons

�

Original version - ‐ Fortran- ‐77 ! By Breiman/Cutler

�

Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab

* i.i.d – independent identically distributed

+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm - Random Forests

o While Boosting splits based on best among all variables, RF splits based on best among

randomly chosen variables

o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees

(500 for large dataset, 150 for smaller)

o Error prediction

• For each iteration, predict for dataset that is not in the sample (OOB data)

• Aggregate OOB predictions

• Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate

• Can use this to search for optimal # of predictors

• We will see how close this is to the actual error in the Heritage Health Prize

o Assumes equal cost for mis-prediction. Can add a cost function

o Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002

Statistical Learning from a Regression Perspective : Berk

A Brief Overview of RF by Dan Steinberg - � Goal

Ensemble Methods

◦ Model Complexity (-)

◦ Variance (-)

◦ Prediction Accuracy (+)

◦ Two Step

�

Develop a set of learners

�

Combine the results to develop a composite predictor

◦ Ensemble methods can take the form of:

�

Using diﬀerent algorithms,

�

Using the same algorithm with diﬀerent se_ngs

�

Assigning diﬀerent parts of the dataset to diﬀerent classiﬁers

◦ Bagging & Random Forests are examples of ensemble

method

Ref: Machine Learning In Action - 2:30

Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …

Algorithms for the

Amateur Data Scientist

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any

man who can hitch the length and breadth of the Galaxy, rough it … win through, and still

know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. - Data Scientists apply different techniques

• Support Vector Machine

• Genetic Algorithms

• adaBoost

• Monte Carlo Methods

• Bayesian Networks

• Principal Component Analysis

• Decision Trees

• Kalman Filter

• Ensemble Methods

• Evolutionary Fuzzy Modelling

• Random Forest

• Neural Networks

• Logistic Regression

Quora

•

http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms

Ref: Anthony’s Kaggle Presentation - Algorithm spectrum

o Regression

o Clustering

o Collab

o NNet

o Logit

o KNN

Filtering

o Boltzman

o CART

o Genetic Alg

o SVM

Machine

o Ensemble :

Random

o Simulated

o Kernels

o Feature

Forest

Annealing

o SVD

Learning

Machine Learning

Cute Math

Ar?ﬁcial

Intel igence - Classifying Classifiers

Statistical

Structural

Regression

Boosting

SVM

Naïve

Bayesian

Logistic

Bayes

Networks

Regression1

Neural

Networks

Multi- ‐layer

Perception

Rule- ‐based

Distance- ‐based

Ensemble

Production Rules

Decision Trees

Random Forests

1

Functional

Nearest Neighbor

Max Entropy Classiﬁer

Linear

Spectral

kNN

Learning vector

Wavelet

Quantization

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko - Bias

Continuous

Regression

Variables

Variance

Model Complexity

Categorical

Over-fitting

Variables

Classiﬁers

Decision

k- ‐NN(Nearest

Bagging

BoosHng

Trees

Neighbors)

CART - Model Evaluation &

Interpretation

2:50

Relevant Digression

3:10 - Cross Validation

Refer to iPython notebook <2- ‐Model- ‐EvaluaHon>

at hSps://github.com/xsankar/freezing- ‐bear

o Reference:

• https://www.kaggle.com/wiki/

GettingStartedWithPythonForDataScience

• Chris Clark ‘s blog :

http://blog.kaggle.com/2012/07/02/up-and-running-with-python-

my-first-kaggle-entry/

• Predicive Modelling in py with scikit-learning, Olivier Grisel Strata

2013

• titanic from pycon2014/parallelmaster/An introduction to Predictive

Modeling in Python - Model Evaluation - Accuracy

Predicted=1

Predicted=0

Actual =1

True+ (tp)

False- ‐ (fn)

Actual=0

False+ (fp)

True- ‐ (tn)

tp + tn

o Accuracy =

tp+fp+fn+tn

o For cases where tn is large compared tp, a degenerate return(false) will be

very accurate !

o Hence the F-measure is a better reflection of the model strength - Model Evaluation – Precision & Recall

Predicted=1

Predicted=0

• Recal

Actual=1

True +ve - ‐ tp

False - ‐ve - ‐ fn

• True +ve Rate

tp

Actual=0

False +ve - ‐ fp

True –ve - ‐ tn

• Coverage

•

tp+fn

Precision

tp

• Sensitivity

• Accuracy

• Hit Rate

tp+fp

• Relevancy

• Type 1 Error

Rate

fp

o Precision = How many items we identified are relevant

• False +ve Rate

•

fp+tn

o Recall = How many relevant items did we identify

False Alarm Rate

o Inverse relationship – Tradeoff depends on situations

•

• Speciﬁcity = 1 – fp rate

Legal – Coverage is important than correctness

• Search – Accuracy is more important

• Fraud

• Type 1 Error = fp

• Support cost (high fp) vs. wrath of credit card co.(high fn) • Type 2 Error = fn

http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html - Confusion Matrix

Predicted

C1

C2

C3

C4

Actual

C1

10

5

9

3

C2

4

20

3

7

C3

6

4

13

3

C4

2

1

4

15

Correct Ones (cii)

c

Precision =

ii

c

Σ

ij

c

Columns

Recal =

ii

i

Σ

cij

Rows

j - Model Evaluation : F-Measure

Predicted=1

Predicted=0

Actual=1

True+ (tp)

False- ‐ (fn)

Actual=0

False+ (fp)

True- ‐ (tn)

Precision = tp / (tp+fp) : Recall = tp / (tp+fn)

F-Measure

Balanced, Combined, Weighted Harmonic Mean, measures effectiveness

1

(β2 + 1)PR

=

α 1 + (1 – α) 1

β2 P + R

P

R

Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R - Hands-on Walkthru - Model Evaluation

712 (80%)

179

Train

Test

891

Refer to iPython notebook <2- ‐Model- ‐EvaluaHon>

at hSps://github.com/xsankar/freezing- ‐bear - ROC Analysis

o “How good is my model?”

o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf

o “A receiver operating characteristics (ROC) graph is a technique for visualizing,

organizing and selecting classifiers based on their performance”

o Much better than evaluating a model based on simple classification accuracy

o Plots tp rate vs. fp rate

o After understanding the ROC Graph, we will draw a few for our models in

iPython notebook <2-Model-Evaluation> at https://github.com/xsankar/

freezing-bear - ROC Graph - Discussion

o E = Conservative, Everything

NO

F

H

o H = Liberal, Everything YES

o Am not making any

political statement !

o F = Ideal

o G = Worst

o The diagonal is the chance

o North West Corner is good

o South-East is bad

• For example E

• Believe it or Not - I have

actually seen a graph

E

G

with the curve in this

region ! - ROC Graph – Clinical Example

Ifcc : Measures of diagnostic accuracy: basic deﬁnitions - ROC Graph Walk thru

o iPython notebook <2-Model-Evaluation> at https://github.com/xsankar/

freezing-bear