このページは https://speakerdeck.com/jakevdp/statistics-for-hackers の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

5ヶ月前 (2016/05/31)にアップロードinテクノロジー

(Presented at PyCon 2016. Early version presented at StitchFix, Sept 2015)

The field of statist...

(Presented at PyCon 2016. Early version presented at StitchFix, Sept 2015)

The field of statistics has a reputation for being difficult to crack: it revolves around a seemingly endless jargon of distributions, test statistics, confidence intervals, p-values, and more, with each concept subject to its own subtle assumptions. But it doesn't have to be this way: today we have access to computers that Neyman and Pearson could only dream of, and many of the conceptual challenges in the field can be overcome through judicious use of these CPU cycles. In this talk I'll discuss how you can use your coding skills to "hack statistics" – to replace some of the theory and jargon with intuitive computational approaches such as sampling, shuffling, cross-validation, and Bayesian methods – and show that with a grasp of just a few fundamental concepts, if you can write a for-loop you can do statistical analysis.

- PRML Reading Chapter 11 - Sampling Method10ヶ月前 by Ha Phuong
- Caret Package for R5年以上前 by kmettler
- Blinkdb2年弱前 by Nitish Upreti

- Jake VanderPlas

PyCon 2016 - < About Me >

- Astronomer by training

- Statistician by accident

- Active in Python science & open source

- Data Scientist at UW eScience Institute

- @jakevdp on Twitter & Github - Hacker (n.)

1. A person who is trying to steal

your grandma’s bank password. - Hacker (n.)

1. A person who is trying to steal

your grandma’s bank password.

2. A person whose natural approach

to problem-solving involves

writing code. - Statistics is Hard.
- Statistics is Hard.

Using programming skills,

it can be easy. - My thesis today:

If you can write a for-loop,

you can do statistics - Statistics is fundamentally about

Asking the Right Question. - – Dr. Seuss (attr)
- Warm-up
- Warm-up:

Coin Toss

You toss a coin 30

times and see 22

heads. Is it a fair coin? - A fair coin should

Even a fair coin

show 15 heads in 30

could show 22 heads

tosses. This coin is

in 30 tosses. It might

biased.

be just chance. - Classic Method:

Assume the Skeptic is correct:

test the Null Hypothesis.

What is the probability of a fair

coin showing 22 heads simply

by chance? - Classic Method:

Start computing probabilities . . . - Classic Method:
- Classic Method:

Number of

arrangements

(binomial

coefficient)

Probability of

N heads

H

Probability of

N tails

T - Classic Method:
- Classic Method:
- Classic Method:

0.8 % - Classic Method:

Probability of 0.8% (i.e. p = 0.008) of

observations given a fair coin.

→ reject fair coin hypothesis at p < 0.05

0.8 % - Could there be

an easier way? - Easier Method:

Just simulate it!

M = 0

for i in range(10000):

trials = randint(2, size=30)

if (trials.sum() >= 22):

M += 1

p = M / 10000 # 0.008149

→ reject fair coin at p = 0.008 - In general . . .

Computing the Sampling

Distribution is Hard. - In general . . .

Computing the Sampling

Distribution is Hard.

Simulating the Sampling

Distribution is Easy. - Four Recipes for

Hacking Statistics:

1. Direct Simulation

2. Shuffling

3. Bootstrapping

4. Cross Validation - Sneeches:

Stars and

Intelligence

Now, the Star-Belly Sneetches

had bellies with stars.

The Plain-Belly Sneetches

had none upon thars . . .

*inspired by John Rauser’s

Statistics Without All The Agonizing Pain - Sneeches:

Test Scores

★

❌

Stars and

Intelligence

84

72

81

69

57

46

74

61

63

76

56

87

99

91

69

65

66

44

62

69

★ mean: 73.5

❌ mean: 66.9

difference: 6.6 - Is this difference of 6.6

statistically significant?

★ mean: 73.5

❌ mean: 66.9

difference: 6.6 - Classic

(Welch’s t-test)

Method - Classic

(Welch’s t-test)

Method - Classic

(Student’s t distribution)

Method - Classic

(Student’s t distribution)

Method

Degree of Freedom: “The number of independent

ways by which a dynamic system can move,

without violating any constraint imposed on it.”

-Wikipedia - Classic

(Student’s t distribution)

Method

Degree of Freedom: “The number of independent

ways by which a dynamic system can move,

without violating any constraint imposed on it.”

-Wikipedia - Classic

Method - Classic

Method - Classic

Method

1.7959 - Classic

Method - Classic

Method - Classic

Method - “The difference of 6.6 is not

significant at the p=0.05 level” - The biggest problem:

We’ve entirely lost-track

of what question we’re

answering! - < One popular alternative . . . >

“Why don’t you just . . .”

from statsmodels.stats.weightstats import ttest_ind

t, p, dof = ttest_ind(group1, group2,

alternative='larger',

usevar='unequal')

print(p) # 0.186 - < One popular alternative . . . >

“Why don’t you just . . .”

from statsmodels.stats.weightstats import ttest_ind

t, p, dof = ttest_ind(group1, group2,

alternative='larger',

usevar='unequal')

print(p) # 0.186

. . . But what question is

this answering? - Stepping Back...

The deep meaning lies in the

sampling distribution:

Same principle as

the coin example:

0.8 % - Let’s use a sampling

method instead - The Problem:

Unlike coin flipping, we don’t

have a generative model . . . - The Problem:

Unlike coin flipping, we don’t

have a generative model . . .

Solution:

Shuffling - ★

❌

Idea:

Simulate the distribution

84

72

81

69

by shuffling the labels

57

46

74

61

repeatedly and computing

63

76

56

87

the desired statistic.

99

91

69

65

Motivation:

66

44

if the labels really don’t

matter, then switching

62

69

them shouldn’t change

the result! - ★

❌

1. Shuffle Labels

2. Rearrange

84

72

81

69

3. Compute means

57

46

74

61

63

76

56

87

99

91

69

65

66

44

62

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

72

81

69

3. Compute means

57

46

74

61

63

76

56

87

99

91

69

65

66

44

62

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

72

69

3. Compute means

61

69

74

57

65

76

56

87

99

44

46

63

66

91

62

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

72

69

3. Compute means

61

69

74

57

65

76

56

87

99

44

46

63

66

91

62

69

★ mean: 72.4

❌ mean: 67.6

difference: 4.8 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

72

69

3. Compute means

61

69

74

57

65

76

56

87

99

44

46

63

66

91

62

69

★ mean: 72.4

❌ mean: 67.6

difference: 4.8 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

72

69

3. Compute means

61

69

74

57

65

76

56

87

99

44

46

63

66

91

62

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

56

72

69

3. Compute means

61

63

74

57

65

66

81

87

62

44

46

69

76

91

99

69

★ mean: 62.6

❌ mean: 74.1

difference: -11.6 - ★

❌

1. Shuffle Labels

2. Rearrange

84

56

72

69

3. Compute means

61

63

74

57

65

66

81

87

62

44

46

69

76

91

99

69 - ★

❌

1. Shuffle Labels

2. Rearrange

74

56

72

69

3. Compute means

61

63

84

57

87

76

81

65

91

99

46

69

66

62

44

69

★ mean: 75.9

❌ mean: 65.3

difference: 10.6 - ★

❌

1. Shuffle Labels

2. Rearrange

84

56

72

69

3. Compute means

61

63

74

57

65

66

81

87

62

44

46

69

76

91

99

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

69

69

3. Compute means

61

69

87

74

65

76

56

57

99

44

46

63

66

91

62

72 - ★

❌

1. Shuffle Labels

2. Rearrange

74

62

72

57

3. Compute means

61

63

84

69

87

81

76

65

91

99

46

69

66

56

44

69 - ★

❌

1. Shuffle Labels

2. Rearrange

84

81

72

69

3. Compute means

61

69

74

57

65

76

56

87

99

44

46

63

66

91

62

69 - number

score difference - number

score difference - 16 %

number

score difference - “A difference of 6.6 is not

significant at p = 0.05.”

That day, all the Sneetches

forgot about stars

And whether they had one,

or not, upon thars. - Notes on Shuffling:

- Works when the Null Hypothesis assumes

two groups are equivalent

- Like all methods, it will only work if your

samples are representative – always be

careful about selection biases!

- Needs care for non-independent trials.

Good discussion in Simon’s Resampling:

The New Statistics - Four Recipes for

Hacking Statistics:

1. Direct Simulation

2. Shuffling

3. Bootstrapping

4. Cross Validation - Yertle’s Turtle Tower

On the far-away island

of Sala-ma-Sond,

Yertle the Turtle

was king of the pond. . . - How High can Yertle

stack his turtles?

Observe 20 of Yertle’s turtle towers . . .

48 24 32 61 51 12 32 18 19 24

# of turtles 21

41 29 21 25 23 42 18 23 13

- What is the mean of the number of

turtles in Yertle’s stack?

- What is the uncertainty on this

estimate? - Classic Method:

Sample Mean:

Standard Error of the Mean: - What assumptions go into

these formulae?

Can we use

sampling instead? - Problem:

As before, we don’t have a

generating model . . . - Problem:

As before, we don’t have a

generating model . . .

Solution:

Bootstrap Resampling - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution. - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution. - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 19 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 19 18 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 19 18 61 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 19 18 61 29 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

61 12 42 42 42 19 18 61 29 41 - Bootstrap Resampling:

48 24 51 12

Idea:

Simulate the distribution

21 41 25 23

by drawing samples with

32 61 19 24

replacement.

29 21 23 13

Motivation:

The data estimates its

32 18 42 18

own distribution – we

draw random samples

from this distribution.

21 19 25 24 23 19 41 23 41 18

→ 31.05

61 12 42 42 42 19 18 61 29 41 - Repeat this

several thousand times . . . - Recovers The Analytic Estimate!

for i in range(10000):

sample = N[randint(20, size=20)]

xbar[i] = mean(sample)

mean(xbar), std(xbar)

# (28.9, 2.9)

Height = 29 ± 3 turtles - Bootstrap sampling

can be applied even to

more involved statistics - Bootstrap on Linear

Regression:

What is the relationship between speed of wind

and the height of the Yertle’s turtle tower? - Bootstrap on Linear

Regression:

for i in range(10000):

i = randint(20, size=20)

slope, intercept = fit(x[i], y[i])

results[i] = (slope, intercept) - Notes on Bootstrapping:

- Bootstrap resampling is well-studied and

rests on solid theoretical grounds.

- Bootstrapping often doesn’t work well for

rank-based statistics (e.g. maximum value)

- Works poorly with very few samples

(N > 20 is a good rule of thumb)

- As always, be careful about selection

biases & non-independent data! - Four Recipes for

Hacking Statistics:

1. Direct Simulation

2. Shuffling

3. Bootstrapping

4. Cross Validation - Onceler Industries:

Sales of Thneeds

I'm being quite useful!

This thing is a Thneed.

A Thneed's a Fine-Something-

That-All-People-Need! - Thneed sales seem to show a

trend with temperature . . . - But which model is a better fit?

y = a + bx

y = a + bx + cx2 - Can we judge by root-mean-

square error?

RMS error = 63.0

RMS error = 51.5

y = a + bx

y = a + bx + cx2 - In general, more flexible models will

always have a lower RMS error.

y = a + bx

y = a + bx + cx2

y = a + bx + cx2 + dx3

y = a + bx + cx2 + dx3 + ex4

y = a + ⋯ - RMS error does not

tell the whole story.

y = a + bx + cx2 + dx3 + ex4 + fx5 + ⋯ + nx14 - Not to worry:

Statistics has figured this out. - Classic Method

Difference in Mean

Squared Error follows

chi-square distribution: - Classic Method

Difference in Mean

Squared Error follows

chi-square distribution:

Can estimate degrees of

freedom easily because

the models are nested . . . - Classic Method

Difference in Mean

Squared Error follows

chi-square distribution:

Can estimate degrees of

freedom easily because

the models are nested . . .

Plug in our numbers . . . - Classic Method

Difference in Mean

Wait… what question

Squared Error follows

chi-square distribution:

were we trying to

answer again?

Can estimate degrees of

freedom easily because

the models are nested . . .

Plug in our numbers . . . - Another Approach:

Cross Validation - Cross-Validation
- Cross-Validation

1. Randomly Split data - Cross-Validation

1. Randomly Split data - Cross-Validation

2. Find the best model for each subset - Cross-Validation

3. Compare models across subsets - Cross-Validation

3. Compare models across subsets - Cross-Validation

3. Compare models across subsets - Cross-Validation

3. Compare models across subsets - Cross-Validation

4. Compute RMS error for each

RMS = 55.1

RMS = 48.9

RMS estimate = 52.1 - Cross-Validation

Repeat for as long as

you have patience . . . - Cross-Validation

5. Compare cross-validated RMS for models: - Cross-Validation

5. Compare cross-validated RMS for models:

Best model minimizes the

cross-validated error. - . . . I biggered the loads

of the thneeds I shipped out!

I was shipping them forth,

to the South, to the East

to the West, to the North! - Notes on Cross-Validation:

- This was “2-fold” cross-validation; other

CV schemes exist & may perform better

for your data (see e.g. scikit-learn docs)

- Cross-validation is the go-to method for

model evaluation in machine learning,

as statistics of the models are often not

known in the classical sense.

- Again: caveats about selection bias and

independence in data.

Hacking Statistics:

1. Direct Simulation

2. Shuffling

3. Bootstrapping

4. Cross Validation- Sampling Methods

allow you to use intuitive computational

approaches in place of often

non-intuitive statistical rules.

If you can write a for-loop

you can do statistical analysis. - Things I didn’t have time for:

- Bayesian Methods: very intuitive & powerful

approaches to more sophisticated modeling.

(see e.g. Bayesian Methods for Hackers by Cam Davidson-Pilon)

- Selection Bias: if you get data selection

wrong, you’ll have a bad time.

(See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science)

- Detailed considerations on use of sampling,

shuffling, and bootstrapping.

(I recommend Statistics Is Easy by Shasha & Wilson

And Resampling: The New Statistics by Julian Simon) - – Dr. Seuss (attr)
- ~ Thank You! ~

Email:

jakevdp@uw.edu

Twitter: @jakevdp

Github:

jakevdp

Web:

http://vanderplas.com/

Blog:

http://jakevdp.github.io/

Slides available at

http://speakerdeck.com/jakevdp/statistics-for-hackers/