このページは http://www.slideshare.net/seanjtaylor/putting-the-magic-in-data-science の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Putting the Magic in Data Science

11/04/2014

Sean J. Taylor

QCon SF - http://en.wikipedia.org/wiki/Pasteur's_quadrant

“The mission of CDS is to provide research and innovation that

fundamentally increase the magnitude of Facebook’s success.” - It’s not a trick, it’s an illusion.
- Any sufficiently advanced

technology is indistinguishable

from magic.

— Arthur C. Clarke - 1. create technology:

people who are not experts can

use it easily with little difficulty

and trust the output

!

2. make it “sufficiently advanced” - The Data Science Venn Diagram

Drew Conway - Basic

Maybe someday, someone can use this.

Research

Applied

I might be able to use this.

Research

Working

I can use this (sometimes).

Prototype

Quality

Software engineers can use this.

Code

Tool or

People can use this.

Service - People can use it → People want to use it

Data Science Impact =

Value * (Num People) * (Frequency of Use)

Very difficult to demand that people use new tech — must

make a compelling value proposition for people and

educate them. - What can data do?

Data can’t do anything.

People do things with data.

(usually they make decisions) - The Last Mile Problem

!

It works for you. Can you get people to use it?

Without considering this last step, all subsequent steps are

useless. - Counterfactuals and Causal Inference

Morgan and Winship

Existence of

Data

Creation of

Decision

Technologies

Existence of

Quality

Capabilities - Magical + Effective Data Science Tools

• Planout: language for expressing / deploying experimental

designs

• Deltoid: analyzing the results of experiments

• ClustR: generic document clustering

• Prophet: completely automatic forecasting procedure

• Crystal Ball: large scale, interpretable regression models

• Hive / Presto / Scuba: SQL engines for different problems - Outline

!

1. Sources of Magic

2. Solving the Last Mile - Tricks:

Sources of

Data Science

Magic - Of Data

Trick 1: invest in data collection

!

Novel sources of data are magic. - Making your own quality data is

better than being a data alchemist.

> - Let’s say you have a billion users…

and you want to listen to them all - Trick 2: Dimensionality Reduction

Increasingly individual observations can be very high

dimensional: text documents, images, audio.

!

!

!

!

Clustering and classification techniques can find/extract a

smaller dimensional representation that retains meaning. - Deep Learning is just (very)

fancy dimensionality reduction - Problem:

Estimate the probability

of rare events or events

pertaining to new

objects.

E.g. click, like, comment,

share - Trick 3: Be a (Practical) Bayesian

!

• If you have rare or new things you’d like to learn about, it’s

often hard to say much.

• But it’s sometimes easy to think of cases which are similar

to the one you are trying to predict.

• James-Stein estimators demonstrate that weighted

averages including related observations will help improve

predictions. - 0

14

14 billion

Philadelphia Eagles Wins

Facebook Revenue - Trick 4: Bootstrap all the statistics

!

• The bootstrap allows you to get a sampling distribution

over almost any statistic you can compute from your data.

• Embarrassingly parallelizable / computable online.

confidence intervals - Bootstrapping in Practice

7.5

R

5.0

1

s1

} Count2.5

0.0

All Your

R

-2

-1

0

1

2

2

s2

Statistic

Data

Get a distribution

…

…

over statistic of interest

(usually the prediction)

R500

s500

- take mean

Generate random

Compute statistics

- CIs == 95% quantiles

sub-samples

or estimate model

- SEs == standard deviation

parameters - Grab bag of tricks

• Everything is linear if you use enough features.

• Matrix factorizations: NMF, SVD.

• Probabilistic data structures: LSH, min-hash.

• Exploit distributed, online algorithms as much as possible.

• “A little bit of ridge never hurts.” — Trevor Hastie

• Label propagation: use data about network neighbors.

• Data reduction: create bins & analyze weighted bin stats. - Last Mile of

Data Science

Magic - Principle 1: Reliability

!

“60% of the time, it works every time” - Test-driven data science

Learn how to build reliable data science systems from

software engineers.

1. Write test fixtures with simulated or case-study data

sets.

2. Write automated tests that check that your system

works on fixtures, and add new ones when it doesn’t.

3. (Bonus) Test input data to ensure it meets all

assumptions. - Principle 2: Latency + Interactivity

!

“how many hypotheses per second

are you testing/generating?” - Answer more questions

People have good intuitions and tend to search effectively

given understandable tools.

First order effect of speed: more answers per second.

Second order effect of speed: more questions asked.

Deltoid: effortless experimentation

Scuba: in-memory, distributed, sampled database.

Presto: aggressive caching, distributed SQL query engine - Principle 3: Simplicity + Modularity
- Choose one thing to do very well

!

• It makes it easier to optimize your technology.

!

• It makes it easier for people to understand what it does.

!

• It makes people more likely to build around it. - Principle 4: Unexpectedness
- Show people the most interesting things
- Tricks Explained

• Planout: simplicity + modularity

• Deltoid: effortless experiment analysis + bootstrap

• ClustR: dimensionality reduction + interactivity

• Prophet: everything’s linear + basis expansion + new data

• Crystal Ball: everything’s linear + regularization + speed

• Hive / Presto / Scuba: reliability/latency tradeoffs - 1. Learn as many tricks as you can

2. Combine them in novel ways

3. Consider the last mile - sjt@fb.com

http://seanjtaylor.com

(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0