Any sufficiently advanced technology is indistinguishable from magic. — Arthur C. Clarke
1. create technology: people who are not experts can use it easily with little difficulty and trust the output ! 2. make it “sufficiently advanced”
The Data Science Venn Diagram Drew Conway
Basic Maybe someday, someone can use this. Research Applied I might be able to use this. Research Working I can use this (sometimes). Prototype Quality Software engineers can use this. Code Tool or People can use this. Service
People can use it → People want to use it Data Science Impact = Value * (Num People) * (Frequency of Use) Very difficult to demand that people use new tech — must make a compelling value proposition for people and educate them.
What can data do? Data can’t do anything. People do things with data. (usually they make decisions)
The Last Mile Problem ! It works for you. Can you get people to use it? Without considering this last step, all subsequent steps are useless.
Counterfactuals and Causal Inference Morgan and Winship Existence of Data Creation of Decision Technologies Existence of Quality Capabilities
Magical + Effective Data Science Tools • Planout: language for expressing / deploying experimental designs • Deltoid: analyzing the results of experiments • ClustR: generic document clustering • Prophet: completely automatic forecasting procedure • Crystal Ball: large scale, interpretable regression models • Hive / Presto / Scuba: SQL engines for different problems
Outline ! 1. Sources of Magic 2. Solving the Last Mile
Of Data Trick 1: invest in data collection ! Novel sources of data are magic.
Making your own quality data is better than being a data alchemist. >
Let’s say you have a billion users… and you want to listen to them all
Trick 2: Dimensionality Reduction Increasingly individual observations can be very high dimensional: text documents, images, audio. ! ! ! ! Clustering and classification techniques can find/extract a smaller dimensional representation that retains meaning.
Deep Learning is just (very) fancy dimensionality reduction
Problem: Estimate the probability of rare events or events pertaining to new objects. E.g. click, like, comment, share
Trick 3: Be a (Practical) Bayesian ! • If you have rare or new things you’d like to learn about, it’s often hard to say much. • But it’s sometimes easy to think of cases which are similar to the one you are trying to predict. • James-Stein estimators demonstrate that weighted averages including related observations will help improve predictions.
0 14 14 billion Philadelphia Eagles Wins Facebook Revenue
Trick 4: Bootstrap all the statistics ! • The bootstrap allows you to get a sampling distribution over almost any statistic you can compute from your data. • Embarrassingly parallelizable / computable online. confidence intervals
Bootstrapping in Practice 7.5 R 5.0 1 s1 } Count2.5 0.0 All Your R -2 -1 0 1 2 2 s2 Statistic Data Get a distribution … … over statistic of interest (usually the prediction) R500 s500 - take mean Generate random Compute statistics - CIs == 95% quantiles sub-samples or estimate model - SEs == standard deviation parameters
Grab bag of tricks • Everything is linear if you use enough features. • Matrix factorizations: NMF, SVD. • Probabilistic data structures: LSH, min-hash. • Exploit distributed, online algorithms as much as possible. • “A little bit of ridge never hurts.” — Trevor Hastie • Label propagation: use data about network neighbors. • Data reduction: create bins & analyze weighted bin stats.
Last Mile of Data Science Magic
Principle 1: Reliability ! “60% of the time, it works every time”
Test-driven data science Learn how to build reliable data science systems from software engineers. 1. Write test fixtures with simulated or case-study data sets. 2. Write automated tests that check that your system works on fixtures, and add new ones when it doesn’t. 3. (Bonus) Test input data to ensure it meets all assumptions.
Principle 2: Latency + Interactivity ! “how many hypotheses per second are you testing/generating?”
Answer more questions People have good intuitions and tend to search effectively given understandable tools. First order effect of speed: more answers per second. Second order effect of speed: more questions asked. Deltoid: effortless experimentation Scuba: in-memory, distributed, sampled database. Presto: aggressive caching, distributed SQL query engine
Principle 3: Simplicity + Modularity
Choose one thing to do very well ! • It makes it easier to optimize your technology. ! • It makes it easier for people to understand what it does. ! • It makes people more likely to build around it.