このページは https://speakerdeck.com/alyssafrazee/am-i-a-data-scientist の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約1年前 (2015/08/11)にアップロードinテクノロジー

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really D...

Slides from my talk at JSM 2015, in the session: "The Statistics Identity Crisis: Are We Really Data Scientists?" https://www.amstat.org/meetings/jsm/2015/onlineprogram/ActivityDetails.cfm?SessionID=211266

- © 2009 Lisa Slavid
- © 2009 Lisa Slavid

statistician

a data scientist - Where I’m coming from

Math

undergrad

Biostatistics PhD

“Machine

Learning

Engineer”

Recurse Center

(née Hacker School)

2010

today - Am I a data scientist?
- Am I a data scientist?

What do I really mean by this question? - Am I a data scientist?

What do I really mean by this question?

Could I get a job offer with a title of “data

scientist?” - Am I a data scientist?

What do I really mean by this question?

Am I preparing my students to be able to

get job offers with a title of “data

scientist?” - Am I a data scientist?

What do I really mean by this question?

Could I get a job offer with a title of “data

scientist?”

→ sometimes implicitly industry

→ and sometimes specifically tech - What’s “data science”?
- data skills

spectrum - theoretical statistics

software engineering - theoretical statistics

data science

software engineering - understanding quantitative data

data science

building a product - output: numerical results

data science

output: usable software - Am I a statistician?

points for:

● Am in a grad program called [bio]statistics

● Know things about martingales and the delta method

● Can explain what a p-value is and interpret linear regression

coefficients

points against:

● Haven’t proved a theorem since 2011

● Spend more time writing bash scripts than inventing

estimators

● No publications in statistics journals - Or am I a data scientist?

points for:

● Can program in more than one language

● Actively use git & GitHub

● Have written R packages and reproducible reports

● Once made a web app and also a D3.js graph

points against:

● Not working in industry

● Have never written a SQL query more complicated than

select * from table

● Understanding of Hadoop, Spark, and AWS is vague at best

● Have never written production code - Idea! I will listen to what experts in our field say!

Camp #1: Data science is just a rebranding of

applied statistics.

Camp #2: Statistics and data science are

overlapping. Neither is a subset of the other.

Camp #3: Statistics is irrelevant to data science. - First: do I want to be a

data scientist? - Second: Does it

matter? - Am I on the job market?

Am I hiring? - If you decide it matters:

some distinguishing features - Intentionality about

programming - Intentionality about

programming

Spending time thinking primarily about:

● code efficiency

● version control

● code quality (cleanliness, modularity)

● documentation / usability

● unit testing

● systematic debugging

● giving and receiving code review

● and other principles of software engineering - Interest in schleppy-

but-practical

projects - Interest in schleppy-

but-practical

projects

● figuring out how to get the data you need

● combining existing tools/methods in new ways

● finding the simplest solution that works in

practice - Focus on concrete

decision-making - Focus on concrete

decision-making

less about inference and parameter estimation,

more about what action should be taken - Camp #1: Data science is just a rebranding of

applied statistics.

Camp #2: Statistics and data science are

overlapping. Neither is a subset of the other.

Camp #3: Statistics is irrelevant to data science. - Perspective from the other side

Camp #1: Data science is just a rebranding of

applied statistics. - Perspective from the other side

Camp #1: Data science is just a rebranding of

applied statistics.

Intentionality about

programming - Perspective from the other side

Camp #1: Data science is just a rebranding of

applied statistics.

The day-to-day work is

different! - Perspective from the other side

Last month I:

● wrote Ruby, Scala, Coffeescript, and Python

● fought with maven

● backfilled some busted tables in our databases

● investigated the mystery of why some of our

cluster boxes are overworked

● learned how to be on call (so I can fix some of

Stripe if it breaks at 3am)

● helped teach a SQL class

● and did some statistics - Perspective from the other side

Camp #3: Statistics is irrelevant to data science. - Perspective from the other side

Camp #3: Statistics is irrelevant to data science. - Perspective from the other side

Statistics and data science are

overlapping. Neither is a subset

of the other. - About that identity crisis:

Program intentionally and be a

data scientist, if you want! - About that identity crisis:

Or don’t! Statistics is hugely

important and relevant in its

own right! - Further reading:

● http://andrewgelman.com/2013/11/14/statistics-least-

important-part-data-science/

● http://bulletin.imstat.org/2014/09/data-science-how-is-it-

different-to-statistics%E2%80%89/

● https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-

21st-century/

● http://datascopeanalytics.com/blog/what-is-a-data-scientist/