このページは http://www.slideshare.net/BoozAllen/booz-allen-field-guide-to-data-science の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

9ヶ月前 (2016/02/11)にアップロードin学び

Booz Allen Hamilton created the Field Guide to Data Science to help organizations and missions un...

Booz Allen Hamilton created the Field Guide to Data Science to help organizations and missions understand how to make use of data as a resource. The Second Edition of the Field Guide, updated with new features and content, delivers our latest insights in a fast-changing field. http://bit.ly/1O78U42

- THE

FIELD GUIDE

to DATA SCIENCE

S E C O N D

E D I T I O N

© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED. - ›› FOREWORD

Data Science touches every aspect of our lives on a

daily basis. When we visit the doctor, drive our cars,

get on an airplane, or shop for services, Data Science

is changing the way we interact with and explore

our world.

Our world is now measured,

As we move into this new

mapped, and recorded in digital future, it is clearer than ever, that

bits. Entire lives, from birth to

businesses must adjust to these

death, are now catalogued in

changes or risk being left behind.

the digital realm. These data,

From influencing retail markets,

originating from such diverse

to setting public health and safety

sources as connected vehicles,

policies, or to addressing social

underwater microscopic cameras, unrest, organizations of all types

and photos we post to social

are generating value through

media, have propelled us into

Data Science. Data is our new

the greatest age of discovery

currency and Data Science is

humanity has ever known. It is

the mechanism by which we tap

through Data Science that we

into it.

are unlocking the secrets hidden

within these data. We are making Data Science is an auspicious and

discoveries that will forever

profound way of applying our

change how we live and interact

curiosity and technical tradecraft

with the world around us.

to solve humanity’s toughest

challenges. The growing power,

The impact of these changes

importance, and responsibility

is having a profound effect on

of applying Data Science

humanity. We have propelled

methodologies to these challenges

ourselves into this age of

is unimaginable. Our own

discovery through our incremental biases and assumptions can have

technological improvements.

profound outcomes on business,

Data Science has become the

national security, and our daily

catalyzing force behind our next

lives. A new class of practitioners

evolutionary leap. Our own

and leaders are needed to navigate

evolution is now inextricably

this new future. Data Scientists

linked to that of computers. The

are our guides on this journey as

way we live our lives and the skills they are creating radical new ways

that are important to our very

of thinking about data and the

existence are directly dependent

world around us.

upon the functions Data Science

can achieve on our behalf.

We want to share our passion for Data Science and start a

conversation with you. This is a journey worth taking. - ›› THE STORY

of T H E F I E L D

G U I D E

Several years ago we created The Field Guide to Data Science because

we wanted to help organizations of all types and sizes. There were

countless industry and academic publications describing what Data

Science is and why we should care, but very little information was

available to explain how to make use of data as a resource. We find

that situation to be just as true today as we did two years ago, when

we created the first edition of the field guide.

At Booz Allen Hamilton, we built an industry-leading team of

Data Scientists. Over the course of hundreds of analytic challenges

for countless clients, we’ve unraveled the DNA of Data Science.

Many people have put forth their thoughts on single aspects of

Data Science. We believe we can offer a broad perspective on the

conceptual models, tradecraft, processes and culture of Data Science

– the what, the why, the who and the how. Companies with strong

Data Science teams often focus on a single class of problems – graph

algorithms for social network analysis, and recommender models for

online shopping are two notable examples. Booz Allen is different.

In our role as consultants, we support a diverse set of government

and commercial clients across a variety of domains. This allows us to

uniquely understand the DNA of Data Science.

Our goal in creating The Field Guide to Data Science was to

capture what we have learned and to share it broadly. The field

of Data Science has continued to advance since we first released

the field guide. As a result, we decided to release this second edition,

incorporating a few new and important concepts. We also added

technical depth and richness that we believe practitioners will

find useful.

We want this effort to continue driving forward the science and

art of Data Science.

This field guide came from the passion our team feels for its

work. It is not a textbook nor is it a superficial treatment.

Senior leaders will walk away with a deeper understanding of

the concepts at the heart of Data Science. Practitioners will

add to their toolbox. We hope everyone will enjoy the journey. - ››WE ARE ALL

AUTHORS of T H I S

S T O R Y

We recognize that Data Science is a team sport. The Field Guide

to Data Science provides Booz Allen Hamilton’s perspective on the

complex and sometimes mysterious field of Data Science. We cannot

capture all that is Data Science. Nor can we keep up - the pace at

which this field progresses outdates work as fast as it is produced.

As a result, we opened this field guide to the world as a living

document to bend and grow with technology, expertise, and

evolving techniques.

Thank you to all the people that have emailed us your ideas as

well as the 100+ people who have watched, starred, or forked our

GitHub repository. We truly value the input of the community, as

we work together to advance the science and art of Data Science.

This is why we have included authors from outside Booz Allen

Hamilton on this second edition of The Field Guide to Data Science.

If you find the guide to be useful, neat, or even lacking, then

we encourage you to add your expertise, including:

››

Case studies from which you have learned

››

Citations from journal articles or papers that inspire you

››

Algorithms and techniques that you love

››

Your thoughts and comments on other people’s additions

Email us your ideas and perspectives at data_science@bah.com

or submit them via a pull request on the GitHub repository.

Join our conversation and take the journey with us. Tell us and

the world what you know. Become an author of this story. - ACKNOWLEDGEMENTS

›› We would like to express our sincerest gratitude to

all those who have made The Field Guide to Data

Science such a success.

Thank you to the nearly 15,000

Thank you to the educators and

people who have downloaded

academics who have incorporated

the digital copy from our website The Field Guide into your course

and the 100+ people who have

work. We appreciate your trusting

connected with The Field Guide this guide as a way to introduce

on our GitHub page. We have

your students to Data Science.

been overwhelmed by the

It is an honor to know that we are

popularity of the work within the shaping the next generation of

Data Science community.

Data Scientists.

Thank you to all of the

Thank you to the organizational

practitioners who are using The

leaders who have shared your

Field Guide as a resource. We are feedback, encouragement, and

excited to know that the work has success stories. We are thrilled

had such a strong influence, from to know that The Field Guide

shaping technical approaches to has helped so many organizations,

serving as the foundation for the from energy, to life sciences,

very definition and role of Data

to retail, to begin their Data

Science within major government Science journeys.

and commercial organizations.

We hope you will all continue to find value from The Field

Guide to Data Science and to share in our excitement around the

release of this second edition. Please continue to be part of the

conversation and take this journey with us. - ››THE OUTLINE

of O U R S T O R Y

12 › Meet Your Guides

17 ›› The Short Version – The Core Concepts of Data Science

18 ›› Start Here for the Basics – An Introduction to Data Science

What Do We Mean by Data Science?

How Does Data Science Actually Work?

What Does It Take to Create a Data Science Capability?

46 ›› Take off the Training Wheels – The Practitioner’s Guide to Data Science

Guiding Principles

The Importance of Reason

Component Parts of Data Science

Fractal Analytic Model

The Analytic Selection Process

Guide to Analytic Selection

Detailed Table of Analytics

84 ›› Life in the Trenches – Navigating Neck Deep in Data

Going Deep into Machine Learning

Feature Engineering

Feature Selection

Ensemble Models

Data Veracity

Application of Domain Knowledge

The Curse of Dimensionality

Model Validation

102 ›› Putting it all Together – Our Case Studies

Streamlining Medication Review

Reducing Flight Delays

Making Vaccines Safer

Forecasting the Relative Risk for the Onset of

Mass Killings to Help Prevent Future Atrocities

Predicting Customer Response

114 ›› Closing Time

The Future of Data Science

Parting Thoughts

References

About Booz Allen Hamilton - MEET your G U I D E S

››

Fred Blackburn

Josh Sullivan

Peter Guerra

(@boozallen)

(@joshdsullivan)

(@petrguerra)

Data Science is a field that is

Leading our Data Science team

Data Science is the most fascinating

evolving at a very rapid pace…be

shows me every day the incredible

blend of art and math and code

part of the journey.

power of discovery and human

and sweat and tears. It can take

curiosity. Don’t be afraid to blend

you to the highest heights and the

art and science to advance your

lowest depths in an instant, but it

own view of data analytics – it

is the only way we will be able to

can be a powerful mixture.

understand and describe the why.

Angela Zutavern

Steve Escaravage

Ezmeralda Khalil

(@angelazutavern)

(@sescarav)

(@ezmeraldakhalil)

Data Science is about asking bigger

Invest your time and energy

The power of data science

questions, seeing future possibilities,

in data that is difficult to

lies in the execution.

and creating outcomes you desire.

assemble. If it doesn’t exist,

find a way to make it exist.

THE FIELD GUIDE to D A T A S C I E N C E - Steven Mills

Alex Cosmas

Brian Keller

(@stevndmills)

(@boozallen)

(@boozallen)

Data Science truly can

Data scientists should be truth-

Grit will get you farther than talent.

change the world.

seekers, not fact-seekers.

Stephanie Beben

Kirk Borne

Drew Farris

(@boozallen)

(@KirkDBorne)

(@drewfarris)

Begin every new data challenge

Focus on value, not volume.

Don’t forget to play. Play with

with deep curiosity along with

tools, play with data, and play with

a healthy dose of skepticism.

algorithms. You just might discover

something that will help you solve

that next nagging problem.

Meet Your Guides

13 - Paul Yacci

Charles Glover

Michael Kim

(@paulyacci)

(@MindAfterMath)

(@boozallen)

In the jungle of data, don't

The beauty of data science lies

Data science is both an art

miss the forest for the trees,

in satisfying curiosities about

and science.

or the trees for the forest.

important problems by playing

with data and algorithms.

Stephanie Rivera

Aaron Sander

(@boozallen)

(@ajsander)

I treat Data Science like I do rock

Data science is changing corporate

climbing: awesome dedication

culture to be more like the open

leads to incremental improvement.

source environment. More open,

Persistence leads to the top.

more collaborative, and faster paced.

We would like to thank the following people for their

contributions and edits:

Tim Andrews, Mike Delurey, Greg Dupier, Jason Escaravage,

Christine Fantaskey, Juergen Klenk, Dan Liebermann, Mark

Rockley and Katie Wilks.

THE FIELD GUIDE to D A T A S C I E N C E - COMMUNITY C O N T R I B U T O R S

››

Will Cukierski, kaggle

Mark Herman

Ed Kohlwey

(@kaggle)

(@cloudEBITDA)

(@ekohlwey)

Two roads diverged in a wood, and I—

End every analysis with…

Data Science is about formally

I took the one in the direction

‘and therefore.’

analyzing everything around you

of the negative gradient,

and becoming data driven.

And that has made all the difference.

Armen Kherlopian

(@akherlopian)

A Data Scientist must

continuously seek truth in spite

of ambiguity; therein rests the

basis of rigor and insight.

Meet Your Guides

15 - The SHORT

››

V E R S I O N

› Data Science is the art of turning data into actions.

It’s all about the tradecraft. Tradecraft is the process, tools and

technologies for humans and computers to work together to

transform data into insights.

› Data Science tradecraft creates data products.

Data products provide actionable information without exposing

decision makers to the underlying data or analytics (e.g., buy/sell

strategies for financial instruments, a set of actions to improve

product yield, or steps to improve product marketing).

› Data Science supports and encourages shifting between

deductive (hypothesis-based) and inductive (pattern-

based) reasoning.

This is a fundamental change from traditional analysis approaches.

Inductive reasoning and exploratory data analysis provide a means

to form or refine hypotheses and discover new analytic paths.

Models of reality no longer need to be static. They are constantly

tested, updated and improved until better models are found.

› Data Science is necessary for companies to stay with the

pack and compete in the future.

Organizations are constantly making decisions based on gut

instinct, loudest voice and best argument – sometimes they are

even informed by real information. The winners and the losers in

the emerging data economy are going to be determined by their

Data Science teams.

› Data Science capabilities can be built over time.

Organizations mature through a series of stages – Collect,

Describe, Discover, Predict, Advise – as they move from data

deluge to full Data Science maturity. At each stage, they can

tackle increasingly complex analytic goals with a wider breadth

of analytic capabilities. However, organizations need not reach

maximum Data Science maturity to achieve success. Significant

gains can be found in every stage.

› Data Science is a different kind of team sport.

Data Science teams need a broad view of the organization. Leaders

must be key advocates who meet with stakeholders to ferret out

the hardest challenges, locate the data, connect disparate parts of

the business, and gain widespread buy-in.

The Short Version

17 - START HERE for T H E B A S I C S

AN INTRODUCTION TO DATA SCIENCE

If you haven’t heard of Data Science, you’re behind the

times. Just renaming your Business Intelligence group

the Data Science group is not the solution. - What do We Mean by

Data Science?

Describing Data Science is like trying to describe a sunset – it

should be easy, but somehow capturing the words is impossible.

THE FIELD GUIDE to D A T A S C I E N C E - Data Science Defined

Data Science is the art of turning data into actions. This is

accomplished through the creation of data products, which provide

actionable information without exposing decision makers to the

underlying data or analytics (e.g., buy/sell strategies for financial

instruments, a set of actions to improve product yield, or steps to

improve product marketing).

Performing Data Science requires the extraction of timely, actionable » Data Product

information from diverse data sources to drive data products.

Examples of data products include answers to questions such as:

A data product provides actionable

“Which of my products should I advertise more heavily to increase

information without exposing

profit? How can I improve my compliance program, while reducing

decision makers to the underlying

costs? What manufacturing process change will allow me to build a

data or analytics. Examples include:

better product?” The key to answering these questions is: understand

• Movie Recommendations

the data you have and what the data inductively tells you.

• Weather Forecasts

• Stock Market Predictions

• Production Process

Read this for additional background:

Improvements

• Health Diagnosis

The term Data Science appeared infrastructure development

in the computer science literature work. We saw the need for a

• Flu Trend Predictions

throughout the 1960s-1980s.

new approach to distill value

• Targeted Advertising

It was not until the late 1990s

from our clients’ data. We

however, that the field as we

approached the problem

describe it here, began to

with a multidisciplinary

emerge from the statistics and

team of computer scientists,

data mining communities

mathematicians and domain

(e.g., [2] and [3]). Data Science

experts. They immediately

was first introduced as an

produced new insights and

independent discipline in 2001.[4] analysis paths, solidifying the

Since that time, there have been

validity of the approach. Since

countless articles advancing the

that time, our Data Science

discipline, culminating with

team has grown to 250 staff

Data Scientist being declared the supporting dozens of clients

sexiest job of the 21st century.[5]

across a variety of domains.

This breadth of experience

We established our first Data

provides a unique perspective

Science team at Booz Allen

on the conceptual models,

in 2010. It began as a natural

tradecraft, processes and

extension of our Business

culture of Data Science.

Intelligence and cloud

Start Here f

art Her or the Basics

e f

21 - What makes Data Science Different?

Data Science supports and encourages shifting between deductive

(hypothesis-based) and inductive (pattern-based) reasoning. This is

a fundamental change from traditional analytic approaches. Inductive

reasoning and exploratory data analysis provide a means to form or

refine hypotheses and discover new analytic paths. In fact, to do the

discovery of significant insights that are the hallmark of Data Science,

you must have the tradecraft and the interplay between inductive

and deductive reasoning. By actively combining the ability to reason

deductively and inductively, Data Science creates an environment

where models of reality no longer need to be static and empirically

based. Instead, they are constantly tested, updated and improved until

better models are found. These concepts are summarized in the figure,

The Types of Reason and Their Role in Data Science Tradecraft.

THE TYPES OF REASON…

DEDUCTIVE REASONING:

INDUCTIVE REASONING

›

Commonly associated

›

Commonly known as “informal

with “formal logic.”

logic,” or “everyday argument.”

›

Involves reasoning from known

›

Involves drawing uncertain

premises, or premises presumed

inferences, based on

to be true, to a certain conclusion.

probabilistic reasoning.

›

The conclusions reached are

›

The conclusions reached

certain, inevitable, inescapable.

are probable, reasonable,

plausible, believable.

…AND THEIR ROLE IN DATA SCIENCE TRADECRAFT.

DEDUCTIVE REASONING:

INDUCTIVE REASONING

›

Formulate hypotheses about

›

Exploratory data analysis to

relationships and underlying models.

discover or refine hypotheses.

›

Carry out experiments with the data

›

Discover new relationships, insights

to test hypotheses and models.

and analytic paths from the data.

Source: Booz Allen Hamilton

The Types of Reason and Their Role in Data Science Tradecraft

THE FIELD GUIDE to D A T A S C I E N C E - The differences between Data Science and traditional analytic

approaches do not end at seamless shifting between deductive

and inductive reasoning. Data Science offers a distinctly different

perspective than capabilities such as Business Intelligence. Data

Science should not replace Business Intelligence functions within

an organization, however. The two capabilities are additive and

complementary, each offering a necessary view of business operations

and the operating environment. The figure, Business Intelligence and

Data Science – A Comparison, highlights the differences between the

two capabilities. Key contrasts include:

›

Discovery vs. Pre-canned Questions: Data Science actually

works on discovering the question to ask as opposed to just

asking it.

›

Power of Many vs. Ability of One: An entire team provides

a common forum for pulling together computer science,

mathematics and domain expertise.

›

Prospective vs. Retrospective: Data Science is focused on

obtaining actionable information from data as opposed to

reporting historical facts.

LOOKING BACKWARD AND FORWARD

FIRST THERE WAS

NOW WE'VE ADDED

BUSINESS INTELLIGENCE

DATA SCIENCE

Deductive Reasoning

Inductive and Deductive Reasoning

Backward Looking

Forward Looking

Slice and Dice Data

Interact with Data

Warehoused and Siloed Data

Distributed, Real Time Data

Analyze the Past, Guess the Future

Predict and Advise

Creates Reports

Creates Data Products

Analytic Output

Answer Questions and Create New Ones

Actionable Answer

Source: Booz Allen Hamilton

Business Intelligence and Data Science - A Comparison (adapted in part from [6])

Start Here f

art Her or the Basics

e f

23 - What is the Impact of Data Science?

As we move into the data economy, Data Science is the competitive

advantage for organizations interested in winning – in whatever way

winning is defined. The manner in which the advantage is defined

is through improved decision-making. A former colleague liked to

describe data-informed decision making like this: If you have perfect

information or zero information then your task is easy – it is in between

those two extremes that the trouble begins. What he was highlighting is

the stark reality that whether or not information is available, decisions

must be made.

The way organizations make decisions has been evolving for half a

century. Before the introduction of Business Intelligence, the only

options were gut instinct, loudest voice, and best argument. Sadly, this

method still exists today, and in some pockets it is the predominant

means by which the organization acts. Take our advice and never, ever

work for such a company!

Fortunately for our economy, most organizations began to inform

their decisions with real information through the application of

simple statistics. Those that did it well were rewarded; those that did

not failed. We are outgrowing the ability of simple stats to keep pace

with market demands, however. The rapid expansion of available data

and the tools to access and make use of the data at scale are enabling

fundamental changes to the way organizations make decisions.

Data Science is required to maintain competitiveness in the

increasingly data-rich environment. Much like the application of

simple statistics, organizations that embrace Data Science will be

rewarded while those that do not will be challenged to keep pace. As

more complex, disparate datasets become available, the chasm between

these groups will only continue to widen. The figure, The Business

Impacts of Data Science, highlights the value awaiting organizations

that embrace Data Science.

THE FIELD GUIDE to D A T A S C I E N C E - DATA SCIENCE IS NECESSARY...

17-49%

increase in productivity when organizations increase data

usability by 10%

11-42%

return on assets (ROA) when organizations increase data

access by 10%

241%

increase in ROI when organizations use big data to

improve competitiveness

increase in ROI when deploying analytics across most of

1000%

the organization, aligning daily operations with senior

management's goals, and incorporating big data

5-6%

performance improvement for organizations making

data-driven decisions.

...TO COMPETE IN THE FUTURE

Source: Booz Allen Hamilton

The Business Impacts of Data Science (adapted from [7], [8] and [9])

Start Here f

art Her or the Basics

e f

25 - What is Different Now?

For 20 years IT systems were built the same way. We separated

the people who ran the business from the people who managed the

infrastructure (and therefore saw data as simply another thing they

had to manage). With the advent of new technologies and analytic

techniques, this artificial – and highly ineffective – separation of

critical skills is no longer necessary. For the first time, organizations

can directly connect business decision makers to the data. This simple

step transforms data from being ‘something to be managed’ into

‘something to be valued.’

In the wake of the transformation, organizations face a stark choice:

you can continue to build data silos and piece together disparate

information or you can consolidate your data and distill answers.

From the Data Science perspective, this is a false choice: The siloed

approach is untenable when you consider the (a) the opportunity

cost of not making maximum use of all available data to help

an organization succeed, and (b) the resource and time costs of

continuing down the same path with outdated processes. The tangible

benefits of data products include:

›

Opportunity Costs: Because Data Science is an emerging field,

opportunity costs arise when a competitor implements and

generates value from data before you. Failure to learn and account

for changing customer demands will inevitably drive customers

away from your current offerings. When competitors are able

to successfully leverage Data Science to gain insights, they can

drive differentiated customer value propositions and lead their

industries as a result.

›

Enhanced Processes: As a result of the increasingly interconnected

world, huge amounts of data are being generated and stored

every instant. Data Science can be used to transform data into

insights that help improve existing processes. Operating costs

can be driven down dramatically by effectively incorporating the

complex interrelationships in data like never before. This results

in better quality assurance, higher product yield and more

effective operations.

Start Here f

art Her or the Basics

e f

27 - How does Data Science

Actually Work?

It’s not rocket science… it’s something better - Data Science

Let’s not kid ourselves - Data Science is a complex field. It is difficult,

intellectually taxing work, which requires the sophisticated integration

of talent, tools and techniques. But as a field guide, we need to cut

through the complexity and provide a clear, yet effective way to

understand this new world.

To do this, we will transform the field of Data Science into a set of

simplified activities as shown in the figure, The Four Key Activities of a

Data Science Endeavor. Data Science purists will likely disagree with

this approach, but then again, they probably don’t need a field guide,

sitting as they do in their ivory towers! In the real world, we need

clear and simple operating models to help drive us forward.

THE FIELD GUIDE to D A T A S C I E N C E - High

Degree

of

Effort

Setup

Try

Do

Evaluate

Evaluate

1

2

3

4

Low

Acquire

Prepare

Analyze

Act

Data Science Activities

Activity 1: Acquire

Activity 2: Prepare

Activity 3: Analyze

Activity 4: Act

This activity focuses

Great outcomes

This is the activity

Every effective

on obtaining the

don’t just happen

that consumes the

Data Science team

data you need.

by themselves.

lion’s share of the

analyzes its data

Given the nature of

A lot depends on

team’s attention.

with a purpose

data, the details of

preparation, and

It is also the most

– that is, to turn

this activity depend

in Data Science,

challenging and

data into actions.

heavily on who you

that means

exciting (you will

Actionable and

are and what you

manipulating the

see a lot of ‘aha

impactful insights

do. As a result, we

data to fit your

moments’ occur in

are the holy grail

will not spend a

analytic needs.

this space). As the

of Data Science.

lot of time on this

This stage can

most challenging

Converting insights

activity other than

consume a great

and vexing of the

into action can be a

to emphasize its

deal of time, but

four activities,

politically charged

importance and

it is an excellent

this field guide

activity, however.

to encourage an

investment. The

focuses on helping

This activity

expansive view on

benefits are

you do this better

depends heavily

which data can and

immediate and

and faster.

on the culture and

should be used.

long term.

character of your

organization, so

we will leave you

to figure out those

details for yourself.

Source: Booz Allen Hamilton

The Four Key Activities of a Data Science Endeavor

Start Here f

art Her or the Basics

e f

29 - Acquire

All analysis starts with access to data, and for the Data Scientist

this axiom holds true. But there are some significant differences –

particularly with respect to the question of who stores, maintains and

owns the data in an organization.

But before we go there, lets look at what is changing. Traditionally,

rigid data silos artificially define the data to be acquired. Stated

another way, the silos create a filter that lets in a very small amount of

data and ignores the rest. These filtered processes give us an artificial

view of the world based on the ‘surviving data,’ rather than one that

shows full reality and meaning. Without a broad and expansive

dataset, we can never immerse ourselves in the diversity of the

data. We instead make decisions based on limited and constrained

information.

Eliminating the need for silos gives us access to all the data at once –

including data from multiple outside sources. It embraces the reality

that diversity is good and complexity is okay. This mindset creates a

completely different way of thinking about data in an organization by

giving it a new and differentiated role. Data represents a significant

new profit and mission-enhancement opportunity for organizations.

But as mentioned earlier, this first activity is heavily dependent upon

the situation and circumstances. We can’t leave you with anything

more than general guidance to help ensure maximum value:

›

Look inside first: What data do you have current access

to that you are not using? This is in large part the data

being left behind by the filtering process, and may be

incredibly valuable.

» Not All Data Is Created Equal

›

Remove the format constraints: Stop limiting your data

acquisition mindset to the realm of structured databases.

As you begin to aggregate data,

Instead, think about unstructured and semi-structured data

remember that not all data is

as viable sources.

created equally. Organizations have

›

Figure out what’s missing: Ask yourself what data would

a tendency to collect any data that

make a big difference to your processes if you had access to it.

is available. Data that is nearby

(readily accessible and easily

Then go find it!

obtained) may be cheap to collect,

›

Embrace diversity: Try to engage and connect to publicly

but there is no guarantee it is the

available sources of data that may have relevance to your

right data to collect. Focus on the

domain area.

data with the highest ROI for your

organization. Your Data Science

team can help identify that data.

Also remember that you need to

strike a balance between the data

that you need and the data that you

have. Collecting huge volumes of

data is useless and costly if it is not

the data that you need.

THE FIELD GUIDE to D A T A S C I E N C E - Prepare

Once you have the data, you need to prepare it for analysis.

Organizations often make decisions based on inexact data. Data

stovepipes mean that organizations may have blind spots. They are

not able to see the whole picture and fail to look at their data and

challenges holistically. The end result is that valuable information is

withheld from decision makers. Research has shown almost 33% of

decisions are made without good data or information. [10]

When Data Scientists are able to explore and analyze all the data, new

opportunities arise for analysis and data-driven decision making. The

insights gained from these new opportunities will significantly change

the course of action and decisions within an organization. Gaining

access to an organization’s complete repository of data, however,

requires preparation.

Our experience shows time and time again that the best tool for

Data Scientists to prepare for analysis is a lake – specifically, the Data

Lake.[11] This is a new approach to collecting, storing and integrating

data that helps organizations maximize the utility of their data.

Instead of storing information in discrete data structures, the Data

Lake consolidates an organization’s complete repository of data in

a single, large view. It eliminates the expensive and cumbersome

data-preparation process, known as Extract/Transform/Load (ETL),

necessary with data silos. The entire body of information in the Data

Lake is available for every inquiry – and all at once.

Start Here f

art Her or the Basics

e f

31 - Analyze

We have acquired the data… we have prepared it… now it is time to

analyze it.

The Analyze activity requires the greatest effort of all the activities

in a Data Science endeavor. The Data Scientist actually builds the

analytics that create value from data. Analytics in this context is

an iterative application of specialized and scalable computational

resources and tools to provide relevant insights from exponentially

growing data. This type of analysis enables real-time understanding

of risks and opportunities by evaluating situational, operational and

behavioral data.

With the totality of data fully accessible in the Data Lake,

organizations can use analytics to find the kinds of connections and

patterns that point to promising opportunities. This high-speed

analytic connection is done within the Data Lake, as opposed to

older style sampling methods that could only make use of a narrow

slice of the data. In order to understand what was in the lake, you had

to bring the data out and study it. Now you can dive into the lake,

bringing your analytics to the data. The figure, Analytic Connection in

the Data Lake, highlights the concept of diving into the Data Lake to

discover new connections and patterns.

Source: Booz Allen Hamilton

Analytic Connection in the Data Lake

THE FIELD GUIDE to D A T A S C I E N C E - Data Scientists work across the spectrum of analytic goals – Describe,

Discover, Predict and Advise. The maturity of an analytic capability

determines the analytic goals encompassed. Many variables play key

roles in determining the difficulty and suitability of each goal for an

organization. Some of these variables are the size and budget of an

organization and the type of data products needed by the decision

makers. A detailed discussion on analytic maturity can be found in

Data Science Maturity within an Organization.

In addition to consuming the greatest effort, the Analyze activity

is by far the most complex. The tradecraft of Data Science is an

art. While we cannot teach you how to be an artist, we can share

foundational tools and techniques that can help you be successful.

The entirety of Take Off the Training Wheels is dedicated to sharing

insights we have learned over time while serving countless clients.

This includes descriptions of a Data Science product lifecycle and

the Fractal Analytic Model (FAM). The Analytic Selection Process and

accompanying Guide to Analytic Selection provide key insights into one

of the most challenging tasks in all of Data Science – selecting the

right technique for the job.

Act

Now that we have analyzed the data, it’s time to take action.

The ability to make use of the analysis is critical. It is also very

situational. Like the Acquire activity, the best we can hope for is to

provide some guiding principles to help you frame the output for

maximum impact. Here are some key points to keep in mind when

presenting your results:

1. The finding must make sense with relatively little up-front

training or preparation on the part of the decision maker.

2. The finding must make the most meaningful patterns, trends

and exceptions easy to see and interpret.

3. Every effort must be made to encode quantitative data

accurately so the decision maker can accurately interpret and

compare the data.

4. The logic used to arrive at the finding must be clear and

compelling as well as traceable back through the data.

5. The findings must answer real business questions.

Start Here f

art Her or the Basics

e f

33 - Data Science Maturity within

an Organization

The four activities discussed thus far provide a simplified view of Data

Science. Organizations will repeat these activities with each new Data

Science endeavor. Over time, however, the level of effort necessary

for each activity will change. As more data is Acquired and Prepared

in the Data Lake, for example, significantly less effort will need to

be expended on these activities. This is indicative of a maturing Data

Science capability.

Assessing the maturity of your Data Science capability calls for a

slightly different view. We use The Data Science Maturity Model as

a common framework for describing the maturity progression and

components that make up a Data Science capability. This framework

can be applied to an organization’s Data Science capability or even

to the maturity of a specific solution, namely a data product. At each

stage of maturity, powerful insight can be gained.

Maturity

Advise

os

a Sil

Dat

Predict

Proportion

of

Discover

Effort

Describe

Collect

Stages of Maturity

Source: Booz Allen Hamilton

The Data Science Maturity Model

Start Here f

art Her or the Basics

e f

35 - When organizations start out, they have Data Silos. At this stage,

they have not carried out any broad Aggregate activities. They may

not have a sense of all the data they have or the data they need. The

decision to create a Data Science capability signals the transition into

the Collect stage.

All of your initial effort will be focused on identifying and aggregating

data. Over time, you will have the data you need and a smaller

proportion of your effort can focus on Collect. You can now begin to

Describe your data. Note, however, that while the proportion of time

spent on Collect goes down dramatically, it never goes away entirely.

This is indicative of the four activities outlined earlier – you will

continue to Aggregate and Prepare data as new analytic questions

arise, additional data is needed and new data sources become available.

Organizations continue to advance in maturity as they move through

the stages from Describe to Advise. At each stage they can tackle

increasingly complex analytic goals with a wider breadth of analytic

capabilities. As described for Collect, each stage never goes away

entirely. Instead, the proportion of time spent focused on it goes

down and new, more mature activities begin. A brief description

of each stage of maturity is shown in the table The Stages of Data

Science Maturity.

The Stages of Data Science Maturity

Stage

Description

Example

Collect

Focuses on collecting internal

Gathering sales records and

or external datasets.

corresponding weather data.

Seeks to enhance or

How are my customers

Describe

refine raw data as well

distributed with respect to

as leverage basic analytic

location, namely zip code?

functions such as counts.

Are there groups within

Discover

Identifies hidden relationships

or patterns.

my regular customers that

purchase similarly?

Utilizes past observations to

Can we predict which products

Predict

predict future observations.

that certain customer groups

are more likely to purchase?

Defines your possible decisions,

Your advice is to target advertise

Advise

optimizes over those decisions,

and advises to use the decision

to specific groups for certain

that gives the best outcome.

products to maximize revenue.

Source: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - The maturity model provides a powerful tool for understanding

and appreciating the maturity of a Data Science capability.

Organizations need not reach maximum maturity to achieve

success. Significant gains can be found in every stage. We believe

strongly that one does not engage in a Data Science effort, however,

» Where does your organization

unless it is intended to produce an output – that is, you have the

fall in analytic maturity?

intent to Advise. This means simply that each step forward in

maturity drives you to the right in the model diagram. Moving

Take the quiz!

to the right requires the correct processes, people, culture and

1. How many data sources do

operating model – a robust Data Science capability. What Does it

you collect?

Take to Create a Data Science Capability? addresses this topic.

a. Why do we need a bunch of data?

– 0 points, end here.

We have observed very few organizations actually operating at

b. I don’t know the exact number.

the highest levels of maturity, the Predict and Advise stages. The

– 5 points

tradecraft of Discover is only now maturing to the point that

c. We identified the required data and

organizations can focus on advanced Predict and Advise activities.

collect it. – 10 points

This is the new frontier of Data Science. This is the space in which

we will begin to understand how to close the cognitive gap between

2. Do you know what questions

humans and computers. Organizations that reach Advise will be

your Data Science team is trying

to answer?

met with true insights and real competitive advantage.

a. Why do we need questions?

- 0 points

b. No, they figure it out for themselves.

– 5 points

c. Yes, we evaluated the questions that

will have the largest impact to the

business. – 10 points

3. Do you know the important factors

driving your business?

a. I have no idea. – 0 points

b. Our quants help me figure it out.

– 5 points

c. We have a data product for that.

– 10 points

4. Do you have an understanding of

future conditions?

a. I look at the current conditions and

read the tea leaves. – 0 points

b. We have a data product for that.

– 5 points

5. Do you know the best course

of action to take for your key

decisions?

a. I look at the projections and plan a

course. – 0 points

b. We have a data product for that.

– 5 points

Check your score:

0 – Data Silos, 5-10 – Collect,

10-20 – Describe, 20-30 – Discover,

30-35 – Predict, 35-40 - Advise

Source: Booz Allen Hamilton

Start Here f

art Her or the Basics

e f

37 - What Does it Take to Create

a Data Science Capability?

Data Science is all about building teams and culture.

Many organizations (both commercial and government) see the

potential in capitalizing on data to unlock operational efficiencies,

to create new services and experiences, and to propel innovation.

Unfortunately, too many business leaders invest in one-off technical

solutions— with a big price tag and mixed results— instead of

investing in building a strategic Data Science capability. A Data

Science capability embeds and operationalizes Data Science across

an enterprise such that it can deliver the next level of organizational

performance and return on investment. A Data Science capability

moves an organization beyond performing pockets of analytics to an

enterprise approach that uses analytical insights as part of the normal

course of business. When building a capability, it is important for an

organization to first identify its analytic goals (i.e., what it is trying

to achieve through analytics) and then assess its readiness to achieve

those goals – examining both technical readiness and organizational

readiness. An organization can then make strategic choices on how to

address gaps and begin to build their capability.

THE FIELD GUIDE to D A T A S C I E N C E - Building Your Data Science Team

A critical component to any Data Science capability is having the

right team. Data Science depends on a diverse set of skills as shown

in The Data Science Venn Diagram. Computers provide the

environment in which data-driven hypotheses are tested, and as such,

computer science is necessary for data manipulation and processing.

Mathematics provides the theoretical structure in which Data Science

problems are examined. A rich background in statistics, geometry,

linear algebra, and calculus are all important to understand the basis

for many algorithms and tools. Finally, domain expertise contributes

to an understanding of what problems actually need to be solved,

what kind of data exists in the domain, and how the problem space

may be instrumented and measured.

DOMAIN EXPERTISE

Provides understanding

of the reality in which a

problem space exists.

COMPUTER SCIENCE

MATHEMATICS

Provides the environment

Provides the theoretical

in which data products

structure in which Data

are created.

Science problems

are examined.

Source: Booz Allen Hamilton

The Data Science Venn Diagram (inspired by [12])

Remember that Data Science is a team sport. Most of the time, you

will not be able to find the rare “unicorns” - people with expertise

across all three of the skill areas. Therefore, it is important to build a

blended team that covers all three elements of the Data Science

Venn Diagram.

DOMAIN EXPERTISE

Provides understanding

of the reality in which a

problem space exists.

Start Here f

art Her or the Basics

e f

39

COMPUTER SCIENCE

MATHEMATICS

Provides the environment

Provides the theoretical

in which data products

structure in which Data

are created.

Science problems

are examined. - BALANCING THE DATA

SCIENCE TEAM EQUATION

Balancing the composition of a Data Science team

is much like balancing the reactants and products in

a chemical reaction. Each side of the equation must

represent the same quantity of any particular element.

In the case of Data Science, these elements are the

foundational technical skills Computer Science (CS),

Mathematics (M) and Domain Expertise (DE). The

reactants, your Data Scientists, each have their own

unique skills compositions. You must balance the staff

mix to meet the skill requirements of the Data Science

team, the product in the reaction. If you don’t correctly

balance the equation, your Data Science team will not

have the desired impact on the organization.

2 CS M2 + 2 CS + M DE → CS4 M5 DE

In the example above, your project requires four parts

computer science, five parts mathematics and one part

domain expertise. Given the skills mix of the staff, five

people are needed to balance the equation. Throughout

your Data Science project, the skills requirements of

the team will change. You will need to re-balance the

equation to ensure the reactants balance with

the products.

Source: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - Understanding What Makes

a Data Scientist

Data Science often requires a significant investment of time across

a variety of tasks. Hypotheses must be generated and data must be

acquired, prepared, analyzed, and acted upon. Multiple techniques

are often applied before one yields interesting results. If that seems

daunting, it is because it is. Data Science is difficult, intellectually

taxing work, which requires lots of talent: both tangible technical

» The Triple Threat Unicorn

skills as well as the intangible “x-factors.”

There are four independent yet comprehensive foundational Data

Individuals who are great at

Science competency clusters that, when considered together, convey

all three of the Data Science

the essence of what it means to be a successful Data Scientist. There

foundational technical skills are

are also reach back competencies that complement the foundational

like unicorns – very rare and if

you’re ever lucky enough to find

clusters but do not define the core tradecraft or attributes of the Data one they should be treated carefully.

Science team.

When you manage these people:

› Encourage them to lead your

Data Science Competency Framework

team, but not manage it.

(see [13] for complete framework)

Don’t bog them down with

responsibilities of management

Clusters

Competencies

Description

that could be done by other staff.

Technical:

Advanced Mathematics;

The technical competency

› Put extra effort into managing

“Knows How and Computer Science; Data

cluster depicts the

What to do”

Mining and Integration;

foundational technical

their careers and interests

Database Science; Research

and specialty knowledge

within your organization. Build

Design; Statistical Modeling;

and skills needed for

Machine Learning;

successful performance

opportunities for promotion into

Operations Research;

in each job or role.

your organization that allow

Programming and Scripting

them to focus on mentoring other

Data Scientists and progressing

Data Science

Collaboration and Teamwork;

The characteristics in the

Consulting:

Communications; Data

consulting competency

the state of the art while also

“Can Do in

Science Consulting; Ethics

cluster can help Data

advancing their careers.

a Client and

and Integrity

Scientists easily integrate

Customer

into various market or

domain contexts and partner

› Make sure that they have the

Environment”

with business units to

opportunity to present and

understand the environment

and solve complex problems.

spread their ideas in many

different forums, but also be

Cognitive:

Critical Thinking; Inductive

The cognitive competency

sensitive to their time.

“Able to Do or

and Deductive Reasoning;

cluster represents the type

Learn to Do”

Problem Solving

of critical thinking and

reasoning abilities (both

inductive and deductive) a

Data Scientist should have to

perform their job.

Personality:

Adaptability/Flexibility;

The personality competency

“Willing or

Ambiguity Tolerance; Detail

cluster describes the

Motivated to Do”

Orientation; Innovation and

personality traits that drive

Creativity; Inquisitiveness;

behaviors that are beneficial

Perseverance; Resilience

to Data Scientists, such as

and Hardiness; Self-

inquisitiveness, creativity,

Confidence; Work Ethic

and perseverance.

Reach Back Competencies for Data Science Teams

Business Acumen; Data Visualization; Domain Expertise; Program Management

Source: Booz Allen Hamilton

Start Here f

art Her or the Basics

e f

41 - The most important qualities of Data Scientists tend to be the

intangible aspects of their personalities. Data Scientists are by nature

curious, creative, focused, and detail-oriented.

›

Curiosity is necessary to peel apart a problem and examine the

interrelationships between data that may appear superficially

unrelated.

›

Creativity is required to invent and try new approaches to

solving a problem, which often times have never been applied in

» Don’t judge a book by its

such a context before.

cover, or a Data Scientist by

›

Focus is required to design and test a technique over days and

his or her degree in this case.

weeks, find it doesn’t work, learn from the failure, and try again.

Amazing Data Scientists can

›

Attention to Detail is needed to maintain rigor, and to detect and

be found anywhere. Just look

avoid over-reliance on intuition when examining data.

at the diverse and surprising

sampling of degrees held by

We have found the single most important attribute is flexibility in

Our Experts:

overcoming setbacks - the willingness to abandon one idea and try

a new approach. Often, Data Science is a series of dead ends before,

› Bioinformatics

at last, the way forward is identified. It requires a unique set of

› Biomedical Engineering

personality attributes to succeed in such an environment. Technical

skills can be developed over time: the ability to be flexible – and

› Biophysics

patient, and persistent – cannot.

› Business

› Computer Graphics

› Computer Science

› English

Finding the Athletes for Your Team

› Forest Management

Building a Data Science team is complex. Organizations must

› History

simultaneously engage existing internal staff to create an “anchor” who

› Industrial Engineering

can be used to recruit and grow the team, while at the same time

› Information Technology

undergo organizational change and transformation to meaningfully

› Mathematics

incorporate this new class of employee. Building a team starts with

identifying existing staff within an organization who have a high

› National Security Studies

aptitude for Data Science. Good candidates will have a formal

› Operations Research

background in any of the three foundational technical skills we

› Physics

mentioned, and will most importantly have the personality traits

› Wildlife & Fisheries

necessary for Data Science. They may often have advanced (masters or

Management

higher) degrees, but not always. The very first staff you identify should

also have good leadership traits and a sense of purpose for the

organization, as they will lead subsequent staffing and recruiting

efforts. Don’t discount anyone – you will find Data Scientists in the

strangest places with the oddest combinations of backgrounds.

THE FIELD GUIDE to D A T A S C I E N C E - Shaping the Culture

It is no surprise—building a culture is hard and there is just as

much art to it as there is science. It is about deliberately creating the

conditions for Data Science to flourish (for both Data Scientists and

the average employee). You can then step back to empower collective

ownership of an organic transformation.

Data Scientists are fundamentally curious and imaginative. We have

a saying on our team, “We’re not nosy, we’re Data Scientists.” These

» “I'm not nosey, I'm a Data

qualities are fundamental to the success of the project and to gaining

Scientist”

new dimensions on challenges and questions. Often Data Science

projects are hampered by the lack of the ability to imagine something › Always remember that

new and different. Fundamentally, organizations must foster trust and

unrelenting curiosity and

transparent communication across all levels, instead of deference to

imagination should be the

authority, in order to establish a strong Data Science team. Managers

hallmarks of Data Science. They

should be prepared to invite participation more frequently, and offer

are fundamental to the success

explanation or apology less frequently.

of every Data Science project.

It is important to provide a path into the Data Science “club” and

to empower the average employee to feel comfortable and conversant

with Data Science. For something to be part of organizational

culture, it must be part of the fabric of the employee behavior.

That means employees must interact with and use data products

in their daily routines. Another key ingredient to shaping the

right culture is that all employees need a baseline of Data Science

knowledge, starting with a common lexicon, to facilitate productive

collaboration and instill confidence. While not everyone will be

Data Scientists, employees need to identify with Data Science and

be equipped with the knowledge, skills, and abilities to work with

Data Scientists to drive smarter decisions and deliver exponential

organizational performance.

Start Here f

art Her or the Basics

e f

43 - Selecting Your Operating Model

Depending on the size, complexity, and the business drivers,

organizations should consider one of three Data Science operating

models: Centralized, Deployed, or Diffused. These three models are

shown in the figure, Data Science Operating Models.

CENTRALIZED

DEPLOYED

DIFFUSED

Business units bring their

Small Data Science teams

Data Scientists are fully

problems to a centralized

are forward deployed to

embedded within the

Data Science team.

business units.

business units.

Source: Booz Allen Hamilton

Data Science Operating Models

(see [13] for complete descriptions)

Centralized Data Science teams serve the organization across all business

units. The team is centralized under a Chief Data Scientist and they all

co-locate together. The domain experts come to this organization for

brief rotational stints to solve challenges around the business. This model

provides greater efficiency with limited Data Science resources but can also

create the perceived need to compete with other business units for Data

Science talent. To address this challenge, it is important to place emphasis

on portfolio management and creating transparency on how organizations

will identify and select Data Science projects.

Deployed Data Science teams go to the business unit and reside there for

short- or long-term assignments. They are their own entity and they work

with the domain experts within the group to solve hard problems. In

the deployed model, Data Science teams collectively develop knowledge

across business units, with central leadership as a bridging mechanism for

addressing organization-wide issues. However, Data Science teams are

accountable to business unit leadership and their centralized leadership,

which could cause confusion and conflict. In this model, it is important

to emphasize conflict management to avoid competing priorities.

The Diffused Data Science team is one that is fully embedded with each

group and becomes part of the long-term organization. These teams work

best when the nature of the domain or business unit is already one focused

on analytics. In the Diffused Model, teams can quickly react to high-

priority business unit needs. However, the lack of central management can

result in duplicate software and tools. Additionally, business units with the

most money will often have full access to analytics while other units have

none—this may not translate to the greatest organizational impact. In this

model, it is important to establish cross-functional groups that promote

organization-wide governance and peer collaboration.

Full descriptions of each operating model can be found in Booz Allen’s Tips for

Building a Data Science Capability [13].

THE FIELD GUIDE to D A T A S C I E N C E - How to Generate Momentum

A Data Science effort can start at the grass roots level by a few folks

tackling hard problems, or as directed by the Chief Executive Officer,

Chief Data Officer, or Chief Analytics Officer. Regardless of how an

effort starts, political headwinds often present more of a challenge

than solving any technical hurdles. To help battle the headwinds, it is

important to generate momentum and prove the value a Data Science

team can provide. The best way to achieve this is usually through

a Data Science prototype or proof of concept. Proofs of concepts

can generate the critical momentum needed to jump start any Data

Science Capability Four qualities, in particular, are essential for every

Data Science prototype:

1. Organizational Buy-in: A prototype will only succeed if the

individuals involved believe in it and are willing to do what

they can to make it successful. A good way to gauge interest

is to meet with the middle managers; their views are usually

indicative of the larger group.

2. Clear ROI: Before choosing a prototype problem, ensure that

the ROI of the analytic output can be clearly and convincingly

demonstrated for both the project and the organization as a

whole. This outcome typically requires first reaching consensus

on how the ROI will be determined and measured, so that the

benefit can be quantified.

3. Necessary Data: Before selecting a prototype, you must first

determine exactly what data is needed, whether it will actually

be available, and what it will cost in terms of time and expense.

It is important to note that organizations do not need all the

possible data – they can still create successful analytics even

with some gaps.

4. Limited Complexity and Duration: The problem addressed

by the prototype should achieve a balance between being too

complex and too easy. Organizations new to Data Science often

try to show its value with highly complex projects. However,

the greater the complexity, the greater the risk of failure. At the

same time, if the problem is too easy to solve, senior leaders

and others in the organization may not see the need for Data

Science. Look for efforts that could benefit from large datasets,

or bringing together disparate datasets that have never been

combined before, as opposed to those that require complex

analytic approaches. In these cases, there is often low-hanging

fruit that can lead to significant value for the organization.

Start Here f

art Her or the Basics

e f

45 - TAKE OFF the T R A I N I N G W H E E L S

THE PRACTITIONER’S GUIDE

TO DATA SCIENCE

Read this section to get beyond the hype and

learn the secrets of being a Data Scientist. - Guiding Principles

Failing is good; failing quickly is even better.

The set of guiding principles that govern how we conduct the

tradecraft of Data Science are based loosely on the central tenets

of innovation, as the two areas are highly connected. These principles

are not hard and fast rules to strictly follow, but rather key tenets

that have emerged in our collective consciousness. You should use

these to guide your decisions, from problem decomposition

through implementation.

› Be willing to fail. At the core of Data Science is the idea of

experimentation. Truly innovative solutions only emerge when

» Tips From the Pros

you experiment with new ideas and applications. Failure is an

acceptable byproduct of experimentation. Failures locate regions

It can be easier to rule out a solution

that no longer need to be considered as you search for a solution.

than confirm its correctness. As a › Fail often and learn quickly. In addition to a willingness to fail, be

result, focus on exploring obvious

ready to fail repeatedly. There are times when a dozen approaches

shortcomings that can quickly

must be explored in order to find the one that works. While you

disqualify an approach. This will allow

shouldn’t be concerned with failing, you should strive to learn from

you to focus your time on exploring

the attempt quickly. The only way you can explore a large number

truly viable approaches as opposed to

of solutions is to do so quickly.

dead ends.

› Keep the goal in mind. You can often get lost in the details and

challenges of an implementation. When this happens, you lose

sight of your goal and begin to drift off the path from data to

analytic action. Periodically step back, contemplate your goal, and

evaluate whether your current approach can really lead you where

you want to go.

› Dedication and focus lead to success. You must often explore

many approaches before finding the one that works. It’s easy to

become discouraged. You must remain dedicated to your analytic

goal. Focus on the details and the insights revealed by the data.

Sometimes seemingly small observations lead to big successes.

» Tips From the Pros

If the first thing you try to do is to › Complicated does not equal better. As technical practitioners, we

create the ultimate solution, you will

have a tendency to explore highly complex, advanced approaches.

fail, but only after banging your head

While there are times where this is necessary, a simpler approach

against a wall for several weeks.

can often provide the same insight. Simpler means easier and

faster to prototype, implement and verify.

THE FIELD GUIDE to D A T A S C I E N C E - The Importance of Reason

Beware: in the world of Data Science, if it walks like a duck

and quacks like a duck, it might just be a moose.

Data Science supports and encourages shifting between deductive

(hypothesis-based) and inductive (pattern-based) reasoning.

Inductive reasoning and exploratory data analysis provide a means

to form or refine hypotheses and discover new analytic paths.

Models of reality no longer need to be static. They are constantly

tested, updated and improved until better models are found.

The analysis of big data has brought inductive reasoning to the

forefront. Massive amounts of data are analyzed to identify

correlations. However, a common pitfall to this approach is confusing » Correlation without

correlation with causation. Correlation implies but does not prove

Causation

causation. Conclusions cannot be drawn from correlations until the

underlying mechanisms that relate the data elements are understood. A common example of this

Without a suitable model relating the data, a correlation may simply

phenomenon is the high correlation

be a coincidence.

between ice cream consumption and

the murder rate during the summer

months. Does this mean ice cream

consumption causes murder or,

conversely, murder causes ice cream

consumption? Most likely not, but

you can see the danger in mistaking

correlation for causation. Our job as

Data Scientists is making sure we

understand the difference.

Take off the Training Wheels 49 - ›› The Dangers of Rejection

In the era of big gene and formulate a hypothesis

data, one piece to test whether there is a

of analysis that significant difference between

is frequently

the means. Given that we were

overlooked is

running thousands of these tests

the problem of at α = 0.05, we found several

finding patterns differences that were significant.

Paul Yacci

when there

The problem was that some

are actually no of these could be caused by

apparent patterns. In statistics

random chance.

this is referred to as Type I error.

As scientists, we are always

Many corrections exist to

on the lookout for a new or

control for false indications of

interesting breakthrough that

significance. The Bonferroni

could explain a phenomenon.

correction is one of the most

We hope to see a pattern in our

conservative. This calculation

data that explains something

lowers the level below which you

or that can give us an answer.

will reject the null hypothesis

The primary goal of hypothesis

(your p value). The formula is

testing is to limit Type I error.

alpha/n, where n equals the

This is accomplished by using

number of hypothesis tests

small α values. For example,

that you are running. Thus, if

a α value of 0.05 states that

you were to run 1,000 tests of

there is a 1 in 20 chance that

significance at α = 0.05, your

the test will show that there

p value should be less than

is something significant when

0.00005 (0.05/1,000) to reject the

in actuality there isn’t. This

null hypothesis. This is obviously

problem compounds when

a much more stringent value.

testing multiple hypotheses.

A large number of the previously

When running multiple

significant values were no longer

hypothesis tests, we are likely

significant, revealing the true

to encounter Type I error. As

relationships within the data.

more data becomes available

for analysis, Type I error

The corrected significance gave

needs to be controlled.

us confidence that the observed

expression levels were due to

One of my projects required

differences in the cellular gene

testing the difference between

expression rather than noise. We

the means of two microarray

were able to use this information

data samples. Microarray

to begin investigating what

data contains thousands of

proteins and pathways were

measurements but is limited

active in the genes expressing

in the number of observations.

the phenotype of interest. By

A common analysis approach

solidifying our understanding

is to measure the same genes

of the causal relationships, we

under different conditions. If

focused our research on the

there is a significant enough

areas that could lead to new

difference in the amount of

discoveries about gene function

gene expression between the

and, ultimately to improved

two samples, we can say that

medical treatments.

the gene is correlated with a

particular phenotype. One way

to do this is to take the mean of

each phenotype for a particular

THE FIELD GUIDE to D A T A S C I E N C E - Reason and common sense are foundational to Data Science. Without these, data is

simply a collection of bits. Context, inferences and models are created by humans and

carry with them biases and assumptions. Blindly trusting your analyses is a dangerous

thing that can lead to erroneous conclusions. When you approach an analytic

challenge, you should always pause to ask yourself the following questions:

› What problem are we trying

› Does the analysis address the

to solve? Articulate the answer

original intent? Make sure

as a sentence, especially when

that you are not aligning the

communicating with the end-

answer with the expectations

user. Make sure that it sounds

of the client. Always speak

» Tips From the Pros

like an answer. For example,

the truth, but remember that

“Given a fixed amount of

answers of “your baby is ugly” Better a short pencil than a

human capital, deploying

require more, not less, analysis. long memory. End every day by

people with these priorities

documenting where you are; you

will generate the best return

› Is the story complete? The goal may learn something along the way.

on their time.”

of your analysis is to tell an

Document what you learned and why

actionable story. You cannot

you changed your plan.

› Does the approach make sense?

rely on the audience to stitch

Write out your analytic plan.

the pieces together. Identify

Embrace the discipline of

potential holes in your

writing, as it brings structure

story and fill them to avoid

to your thinking. Back of

surprises. Grammar, spelling

the envelope calculations are

and graphics matter; your

an existence proof of your

audience will lose confidence

approach. Without this kind

in your analysis if your results

of preparation, computers are

look sloppy.

power tools that can produce

lots of bad answers really fast. › Where would we head next?

No analysis is ever finished,

» Tips From the Pros

› Does the answer make sense?

you just run out of resources.

Can you explain the answer?

Understand and explain what Test your answers with a friendly

Computers, unlike children,

additional measures could

audience to make sure your findings

do what they are told. Make

be taken if more resources

hold water.

sure you spoke to it clearly by

are found.

validating that the instructions

you provided are the ones you

intended. Document your

assumptions and make sure

they have not introduced bias

in your work.

› Is it a finding or a mistake?

Be skeptical of surprise

findings. Experience says that

it if seems wrong, it probably

is wrong. Before you accept

that conclusion, however,

make sure you understand

and can clearly explain why

it is wrong.

Take off the Training Wheels 51 - Component Parts of

Data Science

There is a web of components that interact to create your

solution space. Understanding how they are connected

is critical to your ability to engineer solutions to Data

Science problems.

The components involved in any Data Science project fall into a

number of different categories including the data types analyzed, the

analytic classes used, the learning models employed and the execution

models used to run the analytics. The interconnection across these

components, shown in the figure, Interconnection Among the Component

Parts of Data Science, speaks to the complexity of engineering Data

Science solutions. A choice made for one component exerts influence

over choices made for others categories. For example, data types

lead the choices in analytic class and learning models, while latency,

timeliness and algorithmic parallelization strategy inform the

execution model. As we dive deeper into the technical aspects of

Data Science, we will begin with an exploration of these components

and touch on examples of each.

Read this to get the quick and dirty:

When engineering a Data

how the learning models embodied

Science solution, work from an

will operate and evolve, and the

understanding of the components execution models that will govern

that define the solution space.

how the analytic will be run.

Regardless of your analytic goal, You will be able to articulate a

you must consider the data types complete Data Science solution

with which you will be working, only after considering each of

the classes of analytics you will use these aspects.

to generate your data product,

THE FIELD GUIDE to D A T A S C I E N C E - execution

e

data types

x

models

ecution serial

eaming a

ex par

str

ecution

dat

all

ch

el

a

batdat

streaming

ed

execution

structur

data

batch

unstructured

execution

data

transf

offline

anal orming

ytics

learning

l

analearning

online

ytics

learning

anal predictiv

l supervised

earning

ytics e

learning

unsupervised

learning

analytic

models

classes

Source: Booz Allen Hamilton

Interconnection Among the Component Parts of Data Science

Take off the Training Wheels 53 - Data Types

Data types and analytic goals go hand-in-hand much like the chicken

and the egg; it is not always clear which comes first. Analytic goals are

derived from business objectives, but the data type also influences the

goals. For example, the business objective of understanding consumer

product perception drives the analytic goal of sentiment analysis.

Similarly, the goal of sentiment analysis drives the selection of a

text-like data type such as social media content. Data type also

drives many other choices when engineering your solutions.

There are a number of ways to classify data. It is common to

characterize data as structured or unstructured. Structured data exists

when information is clearly broken out into fields that have an

explicit meaning and are highly categorical, ordinal or numeric.

A related category, semi-structured, is sometimes used to describe

structured data that does not conform to the formal structure of

data models associated with relational databases or other forms

of data tables, but nonetheless contains tags or other markers.

Unstructured data, such as natural language text, has less clearly

delineated meaning. Still images, video and audio often fall under

the category of unstructured data. Data in this form requires

preprocessing to identify and extract relevant ‘features.’ The features

are structured information that are used for indexing and retrieval,

or training classification, or clustering models.

Data may also be classified by the rate at which it is generated,

collected or processed. The distinction is drawn between streaming

data that arrives constantly like a torrent of water from a fire

hose, and batch data, which arrives in buckets. While there is

rarely a connection between data type and data rate, data rate has

significant influence over the execution model chosen for analytic

implementation and may also inform a decision of analytic class or

learning model.

Take off the Training Wheels 55 - Classes of Analytic Techniques

As a means for helping conceptualize the universe of possible analytic

techniques, we grouped them into nine basic classes. Note that

techniques from a given class may be applied in multiple ways to

achieve various analytic goals. Membership in a class simply indicates

a similar analytic function. The nine analytic classes are shown in the

figure, Classes of Analytic Techniques.

ANALYTIC CLASSES

TRANSFORMING

LEARNING

PREDICTIVE

Aggregation

Enrichment

Processing

Regression

Clustering

Classification Recommend

Simulation

Optimization

Source: Booz Allen Hamilton

Classes of Analytic Techniques

»› Transforming Analytics

›

Aggregation: Techniques to summarize the data. These

include basic statistics (e.g., mean, standard deviation),

distribution fitting, and graphical plotting.

LEARNING STYLE

TRAINING STYLE

›

Enrichment: Techniques for adding additional information

to the data, such as source information or other labels.

Unsupervised ›

Processing

Supervised

: Tec

Offline hniques that a

Online

ddress data cleaning,

preparation, and separation. This group also includes

common algorithm pre-processing activities such as

transformations and feature extraction.

»› Learning Analytics

›

Regression: Techniques for estimating relationships among

variables, including understanding which variables are

important in predicting future values.

SCHEDULING

SEQUENCING

›

Clustering: Techniques to segment the data into naturally

similar groups.

Batch

› Str Classific

eaming

ation: Techniques to identify data element

Serial

Parallel

group membership.

›

Recommendation: Techniques to predict the rating or

preference for a new entity, based on historic preference

or behavior.

»› Predictive Analytics

›

Simulation: Techniques to imitate the operation of a real-

world process or system. These are useful for predicting

behavior under new conditions.

›

Optimization: Operations Research techniques focused on

selecting the best element from a set of available alternatives

to maximize a utility function.

THE FIELD GUIDE to D A T A S C I E N C E - Learning Models

Analytic classes that perform predictions, such as regression,

clustering, classification and recommendation employ learning

models. These models characterize how the analytic is trained to

perform judgments on new data based on historic observation.

Aspects of learning models describe both the types of judgments

performed and how the models evolve over time, as shown in the

figure, Analytic Learning Models.

LEARNING STYLE

TRAINING STYLE

Unsupervised

Semi-

Supervised

Supervised

Offline

Reinforcement

Online

Source: Booz Allen Hamilton

Analytic Learning Models

Learning models are typically described as belonging to the categories

of unsupervised or supervised learning. Supervised learning takes

place when a model is trained using a labeled dataset that has a known

class or category associated with each data element. The model relates

features found in training instances with labels so that predictions

can be made for unlabeled instances. Unsupervised learning involves

» Reinforcement Learning

no a-priori knowledge about the classes into which data can be

in Action

placed. Unsupervised learning uses the features in the dataset to

form groupings based on feature similarity. Semi-supervised learning The possibilities of Reinforcement

is a hybrid between these two approaches, using a small amount of

Learning captured significant

labeled data in conjunction with a large amount of unlabeled data.

attention with the publication of a

This is done to improve learning accuracy in cases where only a

study in the journal Nature in which

small number of labeled observations are available for learning.

a computer agent learned to play 49

different video games with accuracy

There are a variety of ways to train learning models. A useful

rivaling a professional game tester [14].

distinction is between those that are trained in a single pass, which are The agent was able to achieve these

known as offline models, and those that are trained incrementally over results using only the raw screen

time, known as online models. Many learning approaches have online pixels and game score as input. This

or offline variants. The decision to use one or another is based on the

approach represents the first artificial

analytic goals and execution models chosen.

intelligence agent that is capable

of learning complex tasks while

Generating an offline model requires taking a pass over the entire

bridging between high-dimensional

training dataset. Improving the model requires making separate

sensory inputs and actions.

passes over the data. These models are static in that once trained, their

predictions will not change until a new model is created through a

subsequent training stage. Offline model performance is easier to

evaluate due to this deterministic behavior. Deployment of the model

into a production environment involves swapping out the old model

for the new.

Online models dynamically evolve over time, meaning they only

require a single deployment into a production setting. The fact that

Take off the Training Wheels 57 - these models do not have the entire dataset available when being

trained is a challenge. They must make assumptions about the data

based on the examples observed; these assumptions may be sub-

optimal. The impact of sub-optimal predictions can be mitigated in

cases where feedback on the model

ANAL

’s predictio

YTIC CLAS

ns is available

SES

. Online

models can rapidly incorporate feedback to improve performance.

TRANSFORMING

LEARNING

PREDICTIVE

One such training style is known as Reinforcement Learning. Under

this approach, an algorithm takes action in an environment and

incrementally learns how to achieve goals based on the response to a

Aggregation

functio

Enrichment

Pr n used to deter

ocessing

Regr mine the qualit

ession

Clust

y of its r

ering

Clas esults.

sification Reinf

Rec orcement

ommend

Simulation Optimization

learning is generally applicable to complex, real-world tasks that

involve optimization, such as navigation or trading. Due to the

publication of many promising results from Reinforcement Learning

algorithms, the popularity of this technique has risen dramatically in

recent years along with Deep Learning.

Execution Models

Execution models describe how data is manipulated to perform

an analytic function. They may be c

LEARNING STYLE

ategorized across a number

TRAINING STYLE

of dimensions. Execution Models are embodied by an execution

framework, which orchestrates the sequencing of analytic

computation. In this sense, a framework might be as simple as a

Unsupervised

Supervised

Offline

Online

programming language runtime, such as the Python interpreter, or

a distributed computing framework that provides a specific API for

one or more programming languages such as Hadoop, MapReduce

or Spark. Grouping execution models based on how they handle data

is common, classifying them as either batch or streaming execution

models. The categories of execution model are shown in the figure,

Analytic Execution Models.

SCHEDULING

SEQUENCING

Batch

Streaming

Serial

Parallel

Source: Booz Allen Hamilton

Analytic Execution Models

A batch execution model implies that data is analyzed in large

segments, that the analytic has a state where it is running and a state

where it is not running and that little state is maintained in memory

between executions. Batch execution may also imply that the analytic

produces results with a frequency on the order of several minutes or

more. Batch workloads tend to be fairly easy to conceptualize because

THE FIELD GUIDE to D A T A S C I E N C E - they represent discrete units of work. As such, it is easy to identify

a specific series of execution steps as well as the proper execution

frequency and time bounds based on the rate at which data arrives.

Depending on the algorithm choice, batch execution models are

easily scalable through parallelism. There are a number of frameworks

that support parallel batch analytic execution. Most famously,

Hadoop provides a distributed batch execution model in its

MapReduce framework.

Conversely, a streaming model analyzes data as it arrives. Streaming

execution models imply that under normal operation, the analytic

is always executing. The analytic can hold state in memory and

constantly deliver results as new data arrives, on the order of seconds

or less. Many of the concepts in streaming are inherent in the Unix-

» Tips From the Pros

pipeline design philosophy; processes are chained together by linking

the output of one process to the input of the next. As a result, many

In order to understand system capacity

developers are already familiar with the basic concepts of streaming.

in the context of streaming analytic

A number of frameworks are available that support the parallel

execution, collect metrics including:

execution of streaming analytics such as Storm, S4 and Samza.

the amount of data consumed, data

emitted, and latency. This will help

The choice between batch and streaming execution models often

you understand when scale limits

hinges on analytic latency and timeliness requirements. Latency refers are reached.

to the amount of time required to analyze a piece of data once it

arrives at the system, while timeliness refers to the average age of an

answer or result generated by the analytic system. For many analytic

goals, a latency of hours and timeliness of days is acceptable and

thus lend themselves to the implementation enabled by the batch

approach. Some analytic goals have up-to-the-second requirements

where a result that is minutes old has little worth. The streaming

execution model better supports such goals.

Batch and streaming execution models are not the only dimensions

within which to categorize analytic execution methods. Another

distinction is drawn when thinking about scalability. In many cases,

scale can be achieved by spreading computation over a number of

computers. In this context, certain algorithms require a large shared

memory state, while others are easily parallelizable in a context

where no shared state exists between machines. This distinction has

significant impacts on both software and hardware selection when

building out a parallel analytic execution environment.

Take off the Training Wheels 59 - Fractal Analytic Model

Data Science analytics are a lot like broccoli.

Fractals are mathematical sets that display self-similar patterns. As

you zoom in on a fractal, the same patterns reappear. Imagine a stalk

of broccoli. Rip off a piece of broccoli and the piece looks much like

the original stalk. Progressively smaller pieces of broccoli still look like

the original stalk.

Data Science analytics are a lot like broccoli – fractal in nature in

both time and construction. Early versions of an analytic follow the

same development process as later versions. At any given iteration, the

analytic itself is a collection of smaller analytics that often decompose

into yet smaller analytics.

THE FIELD GUIDE to D A T A S C I E N C E - Iterative by Nature

Good Data Science is fractal in time — an iterative process. Getting

an imperfect solution out the door quickly will gain more interest

from stakeholders than a perfect solution that is never completed. The

figure, The Data Science Product Lifecycle, summarizes the lifecycle of

the Data Science product.

Setup

Try

Do

Evaluate

Evaluate

Source: Booz Allen Hamilton

The Data Science Product Lifecycle

Set up the infrastructure, aggregate and prepare the data, and

incorporate domain expert knowledge. Try different analytic

techniques and models on subsets of the data. Evaluate the models,

refine, evaluate again, and select a model. Do something with your

models and results – deploy the models to inform, inspire action, and

act. Evaluate the business results to improve the overall product.

Take off the Training Wheels 61 - Smaller Pieces of Broccoli: A Data

Science Product

Components inside and outside of the Data Science product will

change with each iteration. Let’s take a look under the hood of a

Data Science product and examine the components during one

such iteration.

In order to achieve a greater analytic goal, you need to first decompose

the problem into sub-components to divide and conquer. The figure,

The Fractal Analytic Model, shows a decomposition of the Data Science

product into four component pieces.

GOAL

CLASSES OF ANALYTICS

› Describe

› Discover

Aggregation

› Predict

› Advise

Enrichment

Clustering

DATA

ACTION

Classification

› Text

› Productization

› Imagery

› Data Monetization

› Waveform

› Insights & Relationships

› Geo

› Time Series

COMPUTATION

Source: Booz Allen Hamilton

The Fractal Analytic Model

THE FIELD GUIDE to D A T A S C I E N C E - GOAL

You must first have some idea of your analytic goal and the end state

of the analysis. Is it to Discover, Describe, Predict, or Advise? It is

probably a combination of several of those. Be sure that before you

start, you define the business value of the data and how you plan to

use the insights to drive decisions, or risk ending up with interesting

but non-actionable trivia.

DATA

Data dictates the potential insights that analytics can provide. Data

Science is about finding patterns in variable data and comparing those

patterns. If the data is not representative of the universe of events you

wish to analyze, you will want to collect that data through carefully

planned variations in events or processes through A/B testing or

design of experiments. Datasets are never perfect so don’t wait for

perfect data to get started. A good Data Scientist is adept at handling

messy data with missing or erroneous values. Just make sure to spend

the time upfront to clean the data or risk generating garbage results.

COMPUTATION

Computation aligns the data to goals through the process of creating

insights. Through divide and conquer, computation decomposes

into several smaller analytic capabilities with their own goals, data,

computation and resulting actions, just like a smaller piece of broccoli

maintains the structure of the original stalk. In this way, computation

itself is fractal. Capability building blocks may utilize different

types of execution models such as batch computation or streaming,

that individually accomplish small tasks. When properly combined

together, the small tasks produce complex, actionable results.

ACTION

How should engineers change the manufacturing process to generate

higher product yield? How should an insurance company choose

which policies to offer to whom and at what price? The output of

computation should enable actions that align to the goals of the data

product. Results that do not support or inspire action are nothing but

interesting trivia.

Given the fractal nature of Data Science analytics in time and

construction, there are many opportunities to choose fantastic or

shoddy analytic building blocks. The Analytic Selection Process offers

some guidance.

Take off the Training Wheels 63 - The Analytic

Selection Process

If you focus only on the science aspect of Data Science you will

never become a data artist.

A critical step in Data Science is to identify an analytic technique that

will produce the desired action. Sometimes it is clear; a characteristic

of the problem (e.g., data type) points to the technique you should

implement. Other times, however, it can be difficult to know where

to begin. The universe of possible analytic techniques is large. Finding

your way through this universe is an art that must be practiced. We

are going to guide you on the next portion of your journey - becoming

a data artist.

THE FIELD GUIDE to D A T A S C I E N C E - Decomposing the Problem

Decomposing the problem into manageable pieces is the first step

in the analytic selection process. Achieving a desired analytic action

often requires combining multiple analytic techniques into a holistic,

end-to-end solution. Engineering the complete solution requires that

the problem be decomposed into progressively smaller sub-problems.

The Fractal Analytic Model embodies this approach. At any given

stage, the analytic itself is a collection of smaller computations that

decompose into yet smaller computations. When the problem is

decomposed far enough, only a single analytic technique is needed

to achieve the analytic goal. Problem decomposition creates multiple

sub-problems, each with their own goals, data, computations, and

actions. The concept behind problem decomposition is shown in the

figure, Problem Decomposition Using the Fractal Analytic Model.

GOAL

CLASSES OF ANALYTICS

› Describe

› Discover

Aggregation

› Predict

› Advise

Enrichment

Clustering

DATA

ACTION

Classification

› Text

› Productization

› Imagery

› Data Monetization

› Waveform

› Insights & Relationships

› Geo

› Time Series

GOAL

DATA

ACTION

GOAL

DATA

ACTION

Source: Booz Allen Hamilton

Problem Decomposition Using the Fractal Analytic Model

Take off the Training Wheels 65 - On the surface, problem decomposition appears to be a mechanical,

repeatable process. While this may be true conceptually, it is

really the performance of an art as opposed to the solving of an

engineering problem. There may be many valid ways to decompose

the problem, each leading to a different solution. There may be

hidden dependencies or constraints that only emerge after you begin

developing a solution. This is where art meets science. Although the

art behind problem decomposition cannot be taught, we have distilled

some helpful hints to help guide you. When you begin to think about

decomposing your problem, look for:

› Compound analytic goals that create natural segmentation.

For example, many problems focused on predicting future

conditions include both Discover and Predict goals.

» Tips From the Pros

› Natural orderings of analytic goals. For example, when extracting

features you must first identify candidate features and then

One of your first steps should be to

select the features set with the highest information value. These

explore available data sources that

two activities form distinct analytic goals.

have not been previously combined.

Emerging relationships between data

sources often allow you to pick low

› Data types that dictate processing activities. For example, text or

hanging fruit.

imagery both require feature extraction.

› Requirements for human-in-the-loop feedback. For example,

when developing alerting thresholds, you might need to solicit

analyst feedback and update the threshold based on their

assessment.

› The need to combine multiple data sources. For example, you may

need to correlate two datasets to achieve your broader goal.

Often this indicates the presence of a Discover goal.

In addition to problem decomposition providing a tractable approach

to analytic selection, it has the added benefit of simplifying a highly

complex problem. Rather than being faced with understanding the

entire end-to-end solution, the computations are discrete segments

that can be explored. Note, however, that while this technique helps

the Data Scientist approach the problem, it is the complete end-to-

end solution that must be evaluated.

THE FIELD GUIDE to D A T A S C I E N C E - Identifying Spoofed Domains

››

Identifying spoofed domains is important for an organization

to preserve their brand image and to avoid eroded customer

confidence. Spoofed domains occur when a malicious actor

creates a website, URL or email address that users believe is

associated with a valid organization. When users click the link,

visit the website or receive emails, they are subjected to some

type of nefarious activity.

Stephanie

Rivera

Discover spoofed

domains

Compare

Test &

List of

Datasets

recently

Evaluation

Alert on spoofed domains to

Discover likely

registered

provide opportuinity to

candidates for

company

minimize brand image and

spoofed domains

domains

Data

Simulate

consumer confidence damage

Collection Spoofed

Data

Store

Generated

Domains

List of recently

List of candidate

Describe closeness of

registered

spoofed domains

spoof to valid domains

company domains

Generate

Test &

Candidate Evaluation

› List of

Domains

candidate

spoofed

domains

Calculate

Set

Quantitative

Threshold that balances

› List of

Metric

Threshold

measure of feature

false positive and false

recently

information value

negative rate

registered

company

domains

Lis

› t of candidate spoofed

domains

Lis

› t of recently registered

Test &

Quantitative threshold for

company domains

Evaluation

automated result ranking

Quantit

›

ative measure of

feature information value

Source: Booz Allen Hamilton

Spoofed Domain Problem Decomposition

Our team was faced with the problem of

alert speeds are vital. Result speed created

identifying spoofed domains for a commercial

an implementation constraint that forced us to

company. On the surface, the problem sounded re-evaluate how we decomposed the problem.

easy; take a recently registered domain, check

to see if it is similar to the company’s domain

Revisiting the decomposition process led us

and alert when the similarity is sufficiently high. to a completely new approach. In the end,

Upon decomposing the problem, however, the

we derived a list of domains similar to those

main computation quickly became complicated. registered by the company. We then compared

that list against a list of recently registered

We needed a computation that determined

domains. The figure, Spoofed Domain Problem

similarity between two domains. As we

Decomposition, illustrates our approach. Upon

decomposed the similarity computation,

testing and initial deployment, our analytic

complexity and speed became a concern.

discovered a spoofed domain within 48 hours.

As with many security-related problems, fast

Take off the Training Wheels 67 - Implementation Constraints

In the spoofed domains case study, the emergence of an

implementation constraint caused the team to revisit its approach.

This demonstrates that analytic selection does not simply mean

choosing an analytic technique to achieve a desired outcome. It also

means ensuring that the solution is feasible to implement.

The Data Scientist may encounter a wide variety of implementation

constraints. They can be conceptualized, however, in the context of

five dimensions that compete for your attention: analytic complexity,

speed, accuracy & precision, data size, and data complexity. Balancing

these dimensions is a zero sum game - an analytic solution cannot

simultaneously exhibit all five dimensions, but instead must make

trades between them. The figure, Balancing the Five Analytic

Dimensions, illustrates this relationship.

SPEED: The speed at which an

analytic outcome must be produced

(e.g., near real-time, hourly, daily) or the time

it takes to develop and implement the

SPEED

analytic solution

ANALYTIC COMPLEXITY:

Algorithmic complexity (e.g., complexity class

and execution resources)

DATA

ANALYTIC

COMPLEXITY

COMPLEXITY

ACCURACY & PRECISION: The ability

to produce exact versus approximate

solutions as well as the ability to provide a

measure of confidence

DATA SIZE: The size of the dataset

(e.g., number of rows)

DATA COMPLEXITY: The data

ACCURACY

type, formal complexity measures

DATA SIZE

&

including measures of overlap and

PRECISION

linear separability, number of

dimensions /columns, and linkages

between datasets

Source: Booz Allen Hamilton

Balancing the Five Analytic Dimensions

Implementation constraints occur when an aspect of the problem

dictates the value for one or more of these dimensions. As soon as

As we compiled this section, we

one dimension is fixed, the Data Scientist is forced to make trades

talked extensively about ways to

group and classify implementation

among the others. For example, if the analytic problem requires

constraints. After much discussion

actions to be produced in near real-time, the speed dimension is

we settled on these five dimensions.

fixed and trades must be made among the other four dimensions.

We present this model in hopes

Understanding which trades will achieve the right balance among

that others weigh in and offer

the five dimensions is an art that must be learned over time.

their own perspectives.

Take off the Training Wheels 69 - Some common examples of implementation

constraints include:

• Computation frequency. The solution may need to run on a regular

basis (e.g., hourly), requiring that computations be completed

within a specified window of time. The best analytic is useless if it

» Tips From the Pros

cannot solve the problem within the required time.

When possible, consider approaches • Solution timeliness. Some applications require near real-time

that make use of previously computed

results, pointing toward streaming approaches. While some

results. Your algorithm will run

algorithms can be implemented within streaming frameworks,

much faster if you can avoid

many others cannot.

re-computing values across

the full time horizon of data. • Implementation speed. A project may require that you rapidly

develop and implement a solution to quickly produce analytic

insights. In these cases, you may need to focus on less complex

techniques that can be quickly implemented and verified.

• Computational resource limitations. Although you may be able

to store and analyze your data, data size may be sufficiently large

that algorithms requiring multiple computations across the full

» Tips From the Pros

dataset are too resource intensive. This may point toward needing

approaches that only require a single pass on the data (e.g., canopy

Our Data Science Product Lifecycle

cluster as opposed to k-means clustering).

has evolved to produce results quickly

and then incrementally improve

the solution. • Data storage limitations. There are times when big data becomes

so big it cannot be stored or only a short time horizon can be

stored. Analytic approaches that require long time horizons may

not be possible.

Organizational policies and regulatory requirements are a major

source of implicit constraints that merit a brief discussion. Policies

are often established around specific classes of data such as Personally

Identifiable Information (PII) or Personal Health Information (PHI).

» Tips From the Pros While the technologies available today can safely house information

with a variety of security controls in a single system, these policies

Streaming approaches may be useful force special data handling considerations including limited retention

for overcoming storage limitations. periods and data access. Data restrictions impact the data size and

complexity dimensions outlined earlier, creating yet another layer of

constraints that must be considered.

THE FIELD GUIDE to D A T A S C I E N C E - If you want to add or remove data based on its value, start with:

If you have expert knowledge to capture

› Relational algebra projection and selection

› Expert systems

FILTERING

If early results are uninformative and duplicative, start with:

How do I identify

› Outlier removal

› Gaussian filter

If you have known dependent relationships between variables

If you're looking for basic facts

data based on

› Exponential smoothing

› Median filter

› Bayesian network

› Logical reasoning

Guide to Analytic Selection

its absolute or

LOGICAL

relative values?

If you want to generate values from other observations in your dataset, start with:

If you are unsure of feature importance, start with:

REASONING

› Random sampling

› Neural nets,

How do I sort

› Markov Chain Monte Carlo (MC)

Data

through different

IMPUTATION

› Random forests

If your problem is represented by a

Science

evidence?

Your senses are incapable of perceiving the entire universe, so

How do I fill in

If you want to generate values without using other observations in your dataset,

CLASSIFICATION

› Deep learning

non-deterministic utility function, start with:

If you want an ordered set of clusters with variable precision, start with:

missing values

start with:

How do I

› Stochastic search

we drew you a map.

› Hierarchical

If you require a highly transparent model, start with:

in my data?

› Mean

› Regression models

predict group

› Statistical distributions

› Decision trees

membership?

1

If approximate solutions are acceptable,

If you have a known number of clusters, start with:

start with:

DESCRIBE

The universe of analytic techniques is vast and hard to comprehend.

DIMENSIONALITY

If you need to determine whether there is multi-dimensional correlation,

› X-means

If you have <20 data dimensions, start with:

› Genetic algorithms

REDUCTION

start with:

› Canopy

› K-nearest neighbors

› Simulated annealing

We created this diagram to aid you in finding your way from data

› Apriori

2

How do I reduce

› PCA and other factor analysis

and goal to analytic action. Begin at the center of the universe

If you have a large dataset with an unknown classification

› Gradient search

the number of

DISCOVER

If you can represent individual observations by membership in a group, start with:

If you have text data, start with:

signal, start with:

(Data Science) and answer questions about your analytic goals and

dimensions

› K-means clustering

If your problem is represented by a

› Topic modeling

› Naive bayes

problem characteristics. The answers to your questions will guide you

3

in my data?

› Canopy clustering

deterministic utility function, start with:

OPTIMIZATION

Data

through the diagram to the appropriate class of analytic techniques

If you have non-elliptical clusters, start with:

PREDICT

› Linear programming

If you have unstructured text data, start with:

Science

If you want to estimate an unobservable state based on

How do I identify the

› Integer programming

and provide recommendations for a few techniques to consider.

NORMALIZATION &

› Term Frequency/Inverse Document Frequency (TF IDF)

› Fractal

observable variables, start with:

best course of action

› Non-linear programming

TRANSFORMATION

› DB Scan

› Hidden markov model

when my objective

If you have a variable number of features but your

How do I reconcile

4

can be expressed as

algorithm requires a fixed number, start with:

If you want soft membership in the clusters, start with:

1

If you have limited resources to search with

duplication

If you don't know where else to begin, start with:

CLUSTERING

a utility function?

› Feature hashing

› Active learning

› Gaussian mixture models

representations

› Support vector machines (SVM)

How do I segment

DESCRIBE

ADVISE

in the data?

If you are not sure which features are the most important, start with:

› Random forests

the data to

What course

If you want to try multiple models

› Wrapper methods

If you have an known number of clusters, start with:

find natural

2

of action

› Ensemble learning

› Sensitivity analysis

› K-means

groupings?

should I take?

DESCRIBE

FEATURE

1

DISCOVER

EXTRACTION

If you need to facilitate understanding of the

probability distribution of the space, start with:

› Self organizing maps

If the data structure is unknown,

3

If you must model discrete entities, start with:

Data

start with:

› Discrete event simulation (DES)

If you suspect duplicate data elements, start with:

If your data has unknown structure, start with:

1

› Tree-based methods

PREDICT

› Deduplication

› Tree-based methods

Science

PROCESSING

DISCOVER

2

DESCRIBE

What are the

If there are a discrete set of possible states, start with:

How do I clean

REGRESSION

If you require a highly

› Markov models

If you want your data to fall within a specified range, start with:

likely future

and separate

If statistical measures of importance are

How do I predict

transparent model, start with:

› Normalization

outcomes?

my data?

needed, start with:

a future value?

› Generalized linear models

If there are actions and interactions among autonomous

2

If your data is stored in a binary format, start with:

› Generalized linear models

entities, start with:

› Format Conversion

If you have <20 data dimensions,

› Agent-based simulation

DISCOVER

REGRESSION

If statistical measures of importance are not

PREDICT

start with:

SIMULATION

If you are operating in frequency space, start with:

3

What are the

How do I

needed, start with:

› K-nearest neighbors

How do I characterize

If you do not need to model discrete entities, start with:

› Fast Fourier Transform (FFT),

key relationships

determine which

› Regression with shrinkage (e.g., LASSO, Elastic net)

variables may

4

a system that does not

› Monte Carlo simulation

› Discrete wavelet transform

in the data?

› Stepwise regression

have a closed-form

4

be important?

ADVISE

If you are operating in Euclidian space, start with:

If you are modeling a complex system with feedback

ADVISE

representation?

› Coordinate transform

1

mechanisms between actions, start with:

› Systems dynamics

DESCRIBE

If you only have knowledge of how people interact with items,

If you require continuous tracking of system behavior,

How do I develop

If you are unfamiliar with the dataset, start with

start with:

3

If you want to compare two groups

start with:

an understanding

basic statistics:

› Collaborative filtering

› T-test

PREDICT

› Activity-based simulation

of the content of

› Count › Standard deviation › Box plots

RECOMMENDATION

AGGREGATION

› Mean › Range

› Scatter plots

my data?

If you have a feature vector of item characteristics, start with:

If you want to compare multiple groups

How do I predict relevant

How do I collect

4

› Content-based methods

If you already have an understanding of what factors

If your approach assumes the data follows a

› ANOVA

conditions?

and summarize

govern the system, start with:

ADVISE

TIP: There are several situations where dimensionality reduction may be needed:

distribution, start with:

my data?

If you only have knowledge of how items are connected to one

› ODES

HYPOTHESIS

› Models fail to converge

feature space

› Distribution fitting

› PDES

› Models produce results equivalent to

› You do not know which aspects of the

2

another, start with:

TESTING

random chance

data are

If you want to understand all the information

D - If you want to add or remove data based on its value, start with:

If you have expert knowledge to capture

› Relational algebra projection and selection

› Expert systems

FILTERING

If early results are uninformative and duplicative, start with:

How do I identify

› Outlier removal

› Gaussian filter

If you have known dependent relationships between variables

If you're looking for basic facts

data based on

› Exponential smoothing

› Median filter

› Bayesian network

› Logical reasoning

its absolute or

LOGICAL

relative values?

If you want to generate values from other observations in your dataset, start with:

If you are unsure of feature importance, start with:

REASONING

› Random sampling

› Neural nets,

How do I sort

› Markov Chain Monte Carlo (MC)

Data

through different

IMPUTATION

› Random forests

If your problem is represented by a

Science

evidence?

How do I fill in

If you want to generate values without using other observations in your dataset,

CLASSIFICATION

› Deep learning

non-deterministic utility function, start with:

If you want an ordered set of clusters with variable precision, start with:

missing values

start with:

How do I

› Stochastic search

› Hierarchical

If you require a highly transparent model, start with:

in my data?

› Mean

› Regression models

predict group

› Statistical distributions

› Decision trees

membership?

1

If approximate solutions are acceptable,

If you have a known number of clusters, start with:

start with:

DESCRIBE

DIMENSIONALITY

If you need to determine whether there is multi-dimensional correlation,

› X-means

If you have <20 data dimensions, start with:

› Genetic algorithms

REDUCTION

start with:

› Canopy

› K-nearest neighbors

› Simulated annealing

› Apriori

2

How do I reduce

› PCA and other factor analysis

If you have a large dataset with an unknown classification

› Gradient search

the number of

DISCOVER

If you can represent individual observations by membership in a group, start with:

If you have text data, start with:

signal, start with:

dimensions

› K-means clustering

If your problem is represented by a

› Topic modeling

› Naive bayes

3

in my data?

› Canopy clustering

deterministic utility function, start with:

OPTIMIZATION

Data

If you have non-elliptical clusters, start with:

PREDICT

› Linear programming

If you have unstructured text data, start with:

Science

If you want to estimate an unobservable state based on

How do I identify the

› Integer programming

NORMALIZATION &

› Term Frequency/Inverse Document Frequency (TF IDF)

› Fractal

observable variables, start with:

best course of action

› Non-linear programming

TRANSFORMATION

› DB Scan

› Hidden markov model

when my objective

If you have a variable number of features but your

How do I reconcile

4

can be expressed as

algorithm requires a fixed number, start with:

If you want soft membership in the clusters, start with:

1

If you have limited resources to search with

duplication

If you don't know where else to begin, start with:

CLUSTERING

a utility function?

› Feature hashing

› Active learning

› Gaussian mixture models

representations

› Support vector machines (SVM)

How do I segment

DESCRIBE

ADVISE

in the data?

If you are not sure which features are the most important, start with:

› Random forests

the data to

What course

If you want to try multiple models

› Wrapper methods

If you have an known number of clusters, start with:

find natural

2

of action

› Ensemble learning

› Sensitivity analysis

› K-means

groupings?

should I take?

DESCRIBE

FEATURE

1

DISCOVER

EXTRACTION

If you need to facilitate understanding of the

probability distribution of the space, start with:

› Self organizing maps

If the data structure is unknown,

3

If you must model discrete entities, start with:

Data

start with:

› Discrete event simulation (DES)

If you suspect duplicate data elements, start with:

If your data has unknown structure, start with:

1

› Tree-based methods

PREDICT

› Deduplication

› Tree-based methods

Science

PROCESSING

DISCOVER

2

DESCRIBE

What are the

If there are a discrete set of possible states, start with:

How do I clean

REGRESSION

If you require a highly

› Markov models

If you want your data to fall within a specified range, start with:

likely future

and separate

If statistical measures of importance are

How do I predict

transparent model, start with:

› Normalization

outcomes?

my data?

needed, start with:

a future value?

› Generalized linear models

If there are actions and interactions among autonomous

2

If your data is stored in a binary format, start with:

› Generalized linear models

entities, start with:

› Format Conversion

If you have <20 data dimensions,

› Agent-based simulation

DISCOVER

REGRESSION

If statistical measures of importance are not

PREDICT

start with:

SIMULATION

If you are operating in frequency space, start with:

3

What are the

How do I

needed, start with:

› K-nearest neighbors

How do I characterize

If you do not need to model discrete entities, start with:

› Fast Fourier Transform (FFT),

key relationships

determine which

› Regression with shrinkage (e.g., LASSO, Elastic net)

variables may

4

a system that does not

› Monte Carlo simulation

› Discrete wavelet transform

in the data?

› Stepwise regression

have a closed-form

4

be important?

ADVISE

If you are operating in Euclidian space, start with:

If you are modeling a complex system with feedback

ADVISE

representation?

› Coordinate transform

1

mechanisms between actions, start with:

› Systems dynamics

DESCRIBE

If you only have knowledge of how people interact with items,

If you require continuous tracking of system behavior,

How do I develop

If you are unfamiliar with the dataset, start with

start with:

3

If you want to compare two groups

start with:

an understanding

basic statistics:

› Collaborative filtering

› T-test

PREDICT

› Activity-based simulation

of the content of

› Count › Standard deviation › Box plots

RECOMMENDATION

AGGREGATION

› Mean › Range

› Scatter plots

my data?

If you have a feature vector of item characteristics, start with:

If you want to compare multiple groups

How do I predict relevant

How do I collect

4

› Content-based methods

If you already have an understanding of what factors

If your approach assumes the data follows a

› ANOVA

conditions?

and summarize

govern the system, start with:

ADVISE

distribution, start with:

my data?

If you only have knowledge of how items are connected to one

› ODES

HYPOTHESIS

› Distribution fitting

› PDES

2

another, start with:

TESTING

If you want to understand all the information

Data

› Graph-based methods

Science

How do I

If you have imprecise categories

DISCOVER

available on an entity, start with:

test ideas?

› Fuzzy logic

› “Baseball card” aggregation

3

PREDICT

ENRICHMENT

4

If you need to keep track of source information or other

How do I add

user-defined parameters, start with:

ADVISE

new information

› Annotation

to my data?

If you often process certain data fields together or use

Data

Science

one field to compute the value of another, start with:

› Relational algebra rename,

› Feature addition (e.g., Geography, Technology, Weather)

Source: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - If you want to add or remove data based on its value, start with:

If you have expert knowledge to capture

› Relational algebra projection and selection

› Expert systems

FILTERING

If early results are uninformative and duplicative, start with:

How do I identify

› Outlier removal

› Gaussian filter

If you have known dependent relationships between variables

If you're looking for basic facts

data based on

› Exponential smoothing

› Median filter

› Bayesian network

› Logical reasoning

its absolute or

LOGICAL

relative values?

If you want to generate values from other observations in your dataset, start with:

If you are unsure of feature importance, start with:

REASONING

› Random sampling

› Neural nets,

How do I sort

› Markov Chain Monte Carlo (MC)

Data

through different

IMPUTATION

› Random forests

If your problem is represented by a

Science

evidence?

How do I fill in

If you want to generate values without using other observations in your dataset,

CLASSIFICATION

› Deep learning

non-deterministic utility function, start with:

If you want an ordered set of clusters with variable precision, start with:

missing values

start with:

How do I

› Stochastic search

› Hierarchical

If you require a highly transparent model, start with:

in my data?

› Mean

› Regression models

predict group

› Statistical distributions

› Decision trees

membership?

1

If approximate solutions are acceptable,

If you have a known number of clusters, start with:

start with:

DESCRIBE

DIMENSIONALITY

If you need to determine whether there is multi-dimensional correlation,

› X-means

If you have <20 data dimensions, start with:

› Genetic algorithms

REDUCTION

start with:

› Canopy

› K-nearest neighbors

› Simulated annealing

› Apriori

2

How do I reduce

› PCA and other factor analysis

If you have a large dataset with an unknown classification

› Gradient search

the number of

DISCOVER

If you can represent individual observations by membership in a group, start with:

If you have text data, start with:

signal, start with:

dimensions

› K-means clustering

If your problem is represented by a

› Topic modeling

› Naive bayes

3

in my data?

› Canopy clustering

deterministic utility function, start with:

OPTIMIZATION

Data

If you have non-elliptical clusters, start with:

PREDICT

› Linear programming

If you have unstructured text data, start with:

Science

If you want to estimate an unobservable state based on

How do I identify the

› Integer programming

NORMALIZATION &

› Term Frequency/Inverse Document Frequency (TF IDF)

› Fractal

observable variables, start with:

best course of action

› Non-linear programming

TRANSFORMATION

› DB Scan

› Hidden markov model

when my objective

If you have a variable number of features but your

How do I reconcile

4

can be expressed as

algorithm requires a fixed number, start with:

If you want soft membership in the clusters, start with:

1

If you have limited resources to search with

duplication

If you don't know where else to begin, start with:

CLUSTERING

a utility function?

› Feature hashing

› Active learning

› Gaussian mixture models

representations

› Support vector machines (SVM)

How do I segment

DESCRIBE

ADVISE

in the data?

If you are not sure which features are the most important, start with:

› Random forests

the data to

What course

If you want to try multiple models

› Wrapper methods

If you have an known number of clusters, start with:

find natural

2

of action

› Ensemble learning

› Sensitivity analysis

› K-means

groupings?

should I take?

DESCRIBE

FEATURE

1

DISCOVER

EXTRACTION

If you need to facilitate understanding of the

probability distribution of the space, start with:

› Self organizing maps

If the data structure is unknown,

3

If you must model discrete entities, start with:

Data

start with:

› Discrete event simulation (DES)

If you suspect duplicate data elements, start with:

If your data has unknown structure, start with:

1

› Tree-based methods

PREDICT

› Deduplication

› Tree-based methods

Science

PROCESSING

DISCOVER

2

DESCRIBE

What are the

If there are a discrete set of possible states, start with:

How do I clean

REGRESSION

If you require a highly

› Markov models

If you want your data to fall within a specified range, start with:

likely future

and separate

If statistical measures of importance are

How do I predict

transparent model, start with:

› Normalization

outcomes?

my data?

needed, start with:

a future value?

› Generalized linear models

If there are actions and interactions among autonomous

2

If your data is stored in a binary format, start with:

› Generalized linear models

entities, start with:

› Format Conversion

If you have <20 data dimensions,

› Agent-based simulation

DISCOVER

REGRESSION

If statistical measures of importance are not

PREDICT

start with:

SIMULATION

If you are operating in frequency space, start with:

3

What are the

How do I

needed, start with:

› K-nearest neighbors

How do I characterize

If you do not need to model discrete entities, start with:

› Fast Fourier Transform (FFT),

key relationships

determine which

› Regression with shrinkage (e.g., LASSO, Elastic net)

variables may

4

a system that does not

› Monte Carlo simulation

› Discrete wavelet transform

in the data?

› Stepwise regression

have a closed-form

4

be important?

ADVISE

If you are operating in Euclidian space, start with:

If you are modeling a complex system with feedback

ADVISE

representation?

› Coordinate transform

1

mechanisms between actions, start with:

› Systems dynamics

DESCRIBE

If you only have knowledge of how people interact with items,

If you require continuous tracking of system behavior,

How do I develop

If you are unfamiliar with the dataset, start with

start with:

3

If you want to compare two groups

start with:

an understanding

basic statistics:

› Collaborative filtering

› T-test

PREDICT

› Activity-based simulation

of the content of

› Count › Standard deviation › Box plots

RECOMMENDATION

AGGREGATION

› Mean › Range

› Scatter plots

my data?

If you have a feature vector of item characteristics, start with:

If you want to compare multiple groups

How do I predict relevant

How do I collect

4

› Content-based methods

If you already have an understanding of what factors

If your approach assumes the data follows a

› ANOVA

conditions?

and summarize

govern the system, start with:

ADVISE

distribution, start with:

my data?

If you only have knowledge of how items are connected to one

› ODES

HYPOTHESIS

› Distribution fitting

› PDES

2

another, start with:

TESTING

If you want to understand all the information

Data

› Graph-based methods

Science

How do I

If you have imprecise categories

DISCOVER

available on an entity, start with:

test ideas?

› Fuzzy logic

› “Baseball card” aggregation

3

PREDICT

ENRICHMENT

4

If you need to keep track of source information or other

How do I add

user-defined parameters, start with:

ADVISE

new information

› Annotation

to my data?

If you often process certain data fields together or use

Data

Science

one field to compute the value of another, start with:

› Relational algebra rename,

› Feature addition (e.g., Geography, Technology, Weather)

TIP: Canopy clustering is good when you only want to make a single pass over the data.

TIP: Use canopy or hierarchical clustering to estimate the number of clusters you

should generate.

Source: Booz Allen Hamilton - If you want to add or remove data based on its value, start with:

If you have expert knowledge to capture

› Relational algebra projection and selection

› Expert systems

FILTERING

If early results are uninformative and duplicative, start with:

How do I identify

› Outlier removal

› Gaussian filter

If you have known dependent relationships between variables

If you're looking for basic facts

data based on

› Exponential smoothing

› Median filter

› Bayesian network

› Logical reasoning

its absolute or

LOGICAL

relative values?

If you want to generate values from other observations in your dataset, start with:

If you are unsure of feature importance, start with:

REASONING

› Random sampling

› Neural nets,

How do I sort

› Markov Chain Monte Carlo (MC)

Data

through different

IMPUTATION

› Random forests

If your problem is represented by a

Science

evidence?

How do I fill in

If you want to generate values without using other observations in your dataset,

CLASSIFICATION

› Deep learning

non-deterministic utility function, start with:

If you want an ordered set of clusters with variable precision, start with:

missing values

start with:

How do I

› Stochastic search

› Hierarchical

If you require a highly transparent model, start with:

in my data?

› Mean

› Regression models

predict group

› Statistical distributions

› Decision trees

membership?

1

If approximate solutions are acceptable,

If you have a known number of clusters, start with:

start with:

DESCRIBE

DIMENSIONALITY

If you need to determine whether there is multi-dimensional correlation,

› X-means

If you have <20 data dimensions, start with:

› Genetic algorithms

REDUCTION

start with:

› Canopy

› K-nearest neighbors

› Simulated annealing

› Apriori

2

How do I reduce

› PCA and other factor analysis

If you have a large dataset with an unknown classification

› Gradient search

the number of

DISCOVER

If you can represent individual observations by membership in a group, start with:

If you have text data, start with:

signal, start with:

dimensions

› K-means clustering

If your problem is represented by a

› Topic modeling

› Naive bayes

3

in my data?

› Canopy clustering

deterministic utility function, start with:

OPTIMIZATION

Data

If you have non-elliptical clusters, start with:

PREDICT

› Linear programming

If you have unstructured text data, start with:

Science

If you want to estimate an unobservable state based on

How do I identify the

› Integer programming

NORMALIZATION &

› Term Frequency/Inverse Document Frequency (TF IDF)

› Fractal

observable variables, start with:

best course of action

› Non-linear programming

TRANSFORMATION

› DB Scan

› Hidden markov model

when my objective

If you have a variable number of features but your

How do I reconcile

4

can be expressed as

algorithm requires a fixed number, start with:

If you want soft membership in the clusters, start with:

1

If you have limited resources to search with

duplication

If you don't know where else to begin, start with:

CLUSTERING

a utility function?

› Feature hashing

› Active learning

› Gaussian mixture models

representations

› Support vector machines (SVM)

How do I segment

DESCRIBE

ADVISE

in the data?

If you are not sure which features are the most important, start with:

› Random forests

the data to

What course

If you want to try multiple models

› Wrapper methods

If you have an known number of clusters, start with:

find natural

2

of action

› Ensemble learning

› Sensitivity analysis

› K-means

groupings?

should I take?

DESCRIBE

FEATURE

1

DISCOVER

EXTRACTION

If you need to facilitate understanding of the

probability distribution of the space, start with:

› Self organizing maps

If the data structure is unknown,

3

If you must model discrete entities, start with:

Data

start with:

› Discrete event simulation (DES)

If you suspect duplicate data elements, start with:

If your data has unknown structure, start with:

1

› Tree-based methods

PREDICT

› Deduplication

› Tree-based methods

Science

PROCESSING

DISCOVER

2

DESCRIBE

What are the

If there are a discrete set of possible states, start with:

How do I clean

REGRESSION

If you require a highly

› Markov models

If you want your data to fall within a specified range, start with:

likely future

and separate

If statistical measures of importance are

How do I predict

transparent model, start with:

› Normalization

outcomes?

my data?

needed, start with:

a future value?

› Generalized linear models

If there are actions and interactions among autonomous

2

If your data is stored in a binary format, start with:

› Generalized linear models

entities, start with:

› Format Conversion

If you have <20 data dimensions,

› Agent-based simulation

DISCOVER

REGRESSION

If statistical measures of importance are not

PREDICT

start with:

SIMULATION

If you are operating in frequency space, start with:

3

What are the

How do I

needed, start with:

› K-nearest neighbors

How do I characterize

If you do not need to model discrete entities, start with:

› Fast Fourier Transform (FFT),

key relationships

determine which

› Regression with shrinkage (e.g., LASSO, Elastic net)

variables may

4

a system that does not

› Monte Carlo simulation

› Discrete wavelet transform

in the data?

› Stepwise regression

have a closed-form

4

be important?

ADVISE

If you are operating in Euclidian space, start with:

If you are modeling a complex system with feedback

ADVISE

representation?

› Coordinate transform

1

mechanisms between actions, start with:

› Systems dynamics

DESCRIBE

If you only have knowledge of how people interact with items,

If you require continuous tracking of system behavior,

How do I develop

If you are unfamiliar with the dataset, start with

start with:

3

If you want to compare two groups

start with:

an understanding

basic statistics:

› Collaborative filtering

› T-test

PREDICT

› Activity-based simulation

of the content of

› Count › Standard deviation › Box plots

RECOMMENDATION

AGGREGATION

› Mean › Range

› Scatter plots

my data?

If you have a feature vector of item characteristics, start with:

If you want to compare multiple groups

How do I predict relevant

How do I collect

4

› Content-based methods

If you already have an understanding of what factors

If your approach assumes the data follows a

› ANOVA

conditions?

and summarize

govern the system, start with:

ADVISE

distribution, start with:

my data?

If you only have knowledge of how items are connected to one

› ODES

HYPOTHESIS

› Distribution fitting

› PDES

2

another, start with:

TESTING

If you want to understand all the information

Data

› Graph-based methods

Science

How do I

If you have imprecise categories

DISCOVER

available on an entity, start with:

test ideas?

› Fuzzy logic

› “Baseball card” aggregation

3

PREDICT

ENRICHMENT

4

If you need to keep track of source information or other

How do I add

user-defined parameters, start with:

ADVISE

new information

› Annotation

TIP: It can be difficult to predict which classifier will work best on your dataset.

to my data?

Always try multiple classifiers. Pick the one or two that work the best to refine and

If you often process certain data fields together or use

explore further.

Data

Science

one field to compute the value of another, start with:

TIP: These are our favorite, go-to classification algorithms.

› Relational algebra rename,

› Feature addition (e.g., Geography, Technology, Weather)

TIP: Be careful of the “recommendation bubble”, the tendency of recommenders to

recommend only what has been seen in the past.

You must ensure you add diversity to avoid this phenomenon.

TIP: SVD and PCA are good tools for creating better features for recommenders.

Source: Booz Allen H - If you want to add or remove data based on its value, start with:

If you have expert knowledge to capture

› Relational algebra projection and selection

› Expert systems

FILTERING

If early results are uninformative and duplicative, start with:

How do I identify

› Outlier removal

› Gaussian filter

If you have known dependent relationships between variables

If you're looking for basic facts

data based on

› Exponential smoothing

› Median filter

› Bayesian network

› Logical reasoning

its absolute or

LOGICAL

relative values?

If you want to generate values from other observations in your dataset, start with:

If you are unsure of feature importance, start with:

REASONING

› Random sampling

› Neural nets,

How do I sort

› Markov Chain Monte Carlo (MC)

Data

through different

IMPUTATION

› Random forests

If your problem is represented by a

Science

evidence?

How do I fill in

If you want to generate values without using other observations in your dataset,

CLASSIFICATION

› Deep learning

non-deterministic utility function, start with:

If you want an ordered set of clusters with variable precision, start with:

missing values

start with:

How do I

› Stochastic search

› Hierarchical

If you require a highly transparent model, start with:

in my data?

› Mean

› Regression models

predict group

› Statistical distributions

› Decision trees

membership?

1

If approximate solutions are acceptable,

If you have a known number of clusters, start with:

start with:

DESCRIBE

DIMENSIONALITY

If you need to determine whether there is multi-dimensional correlation,

› X-means

If you have <20 data dimensions, start with:

› Genetic algorithms

REDUCTION

start with:

› Canopy

› K-nearest neighbors

› Simulated annealing

› Apriori

2

How do I reduce

› PCA and other factor analysis

If you have a large dataset with an unknown classification

› Gradient search

the number of

DISCOVER

If you can represent individual observations by membership in a group, start with:

If you have text data, start with:

signal, start with:

dimensions

› K-means clustering

If your problem is represented by a

› Topic modeling

› Naive bayes

3

in my data?

› Canopy clustering

deterministic utility function, start with:

OPTIMIZATION

Data

If you have non-elliptical clusters, start with:

PREDICT

› Linear programming

If you have unstructured text data, start with:

Science

If you want to estimate an unobservable state based on

How do I identify the

› Integer programming

NORMALIZATION &

› Term Frequency/Inverse Document Frequency (TF IDF)

› Fractal

observable variables, start with:

best course of action

› Non-linear programming

TRANSFORMATION

› DB Scan

› Hidden markov model

when my objective

If you have a variable number of features but your

How do I reconcile

4

can be expressed as

algorithm requires a fixed number, start with:

If you want soft membership in the clusters, start with:

1

If you have limited resources to search with

duplication

If you don't know where else to begin, start with:

CLUSTERING

a utility function?

› Feature hashing

› Active learning

› Gaussian mixture models

representations

› Support vector machines (SVM)

How do I segment

DESCRIBE

ADVISE

in the data?

If you are not sure which features are the most important, start with:

› Random forests

the data to

What course

If you want to try multiple models

› Wrapper methods

If you have an known number of clusters, start with:

find natural

2

of action

› Ensemble learning

› Sensitivity analysis

› K-means

groupings?

should I take?

DESCRIBE

FEATURE

1

DISCOVER

EXTRACTION

If you need to facilitate understanding of the

probability distribution of the space, start with:

› Self organizing maps

If the data structure is unknown,

3

If you must model discrete entities, start with:

Data

start with:

› Discrete event simulation (DES)

If you suspect duplicate data elements, start with:

If your data has unknown structure, start with:

1

› Tree-based methods

PREDICT

› Deduplication

› Tree-based methods

Science

PROCESSING

DISCOVER

2

DESCRIBE

What are the

If there are a discrete set of possible states, start with:

How do I clean

REGRESSION

If you require a highly

› Markov models

If you want your data to fall within a specified range, start with:

likely future

and separate

If statistical measures of importance are

How do I predict

transparent model, start with:

› Normalization

outcomes?

my data?

needed, start with:

a future value?

› Generalized linear models

If there are actions and interactions among autonomous

2

If your data is stored in a binary format, start with:

› Generalized linear models

entities, start with:

› Format Conversion

If you have <20 data dimensions,

› Agent-based simulation

DISCOVER

REGRESSION

If statistical measures of importance are not

PREDICT

start with:

SIMULATION

If you are operating in frequency space, start with:

3

What are the

How do I

needed, start with:

› K-nearest neighbors

How do I characterize

If you do not need to model discrete entities, start with:

› Fast Fourier Transform (FFT),

key relationships

determine which

› Regression with shrinkage (e.g., LASSO, Elastic net)

variables may

4

a system that does not

› Monte Carlo simulation

› Discrete wavelet transform

in the data?

› Stepwise regression

have a closed-form

4

be important?

ADVISE

If you are operating in Euclidian space, start with:

If you are modeling a complex system with feedback

ADVISE

representation?

› Coordinate transform

1

mechanisms between actions, start with:

› Systems dynamics

DESCRIBE

If you only have knowledge of how people interact with items,

If you require continuous tracking of system behavior,

How do I develop

If you are unfamiliar with the dataset, start with

start with:

3

If you want to compare two groups

start with:

an understanding

basic statistics:

› Collaborative filtering

› T-test

PREDICT

› Activity-based simulation

of the content of

› Count › Standard deviation › Box plots

RECOMMENDATION

AGGREGATION

› Mean › Range

› Scatter plots

my data?

If you have a feature vector of item characteristics, start with:

If you want to compare multiple groups

How do I predict relevant

How do I collect

4

› Content-based methods

If you already have an understanding of what factors

If your approach assumes the data follows a

› ANOVA

conditions?

and summarize

govern the system, start with:

ADVISE

distribution, start with:

my data?

If you only have knowledge of how items are connected to one

› ODES

HYPOTHESIS

› Distribution fitting

› PDES

2

another, start with:

TESTING

If you want to understand all the information

Data

› Graph-based methods

Science

How do I

If you have imprecise categories

DISCOVER

available on an entity, start with:

test ideas?

› Fuzzy logic

› “Baseball card” aggregation

3

PREDICT

ENRICHMENT

4

If you need to keep track of source information or other

How do I add

user-defined parameters, start with:

ADVISE

new information

› Annotation

to my data?

If you often process certain data fields together or use

Data

Science

one field to compute the value of another, start with:

› Relational algebra rename,

› Feature addition (e.g., Geography, Technology, Weather)

Source: Booz Allen Hamilton - Detailed Table of Analytics

Getting you to the right starting point isn’t enough. We also

provide a translator so you understand what you’ve been told.

Identifying several analytic techniques that can be applied to your

problem is useful, but their name alone will not be much help. The

Detailed Table of Analytics translates the names into something more

meaningful. Once you’ve identified a technique in the Guide to

Analytic Selection, find the corresponding row in the table. There you

will find a brief description of the techniques, tips we’ve learned and a

few references we’ve found helpful.

References we

Technique

Description

Tips From the Pros

love to read

Intelligent sample selection

to improve performance

Burr, Settles B. “Active

Active

of model. Samples are

Can be paired with a human in-the-loop

Learning: Synthesis Lectures

Learning

selected to provide the

to help capture domain knowledge.

on Artificial Intelligence and

greatest information

Machine Learning.” Morgan

to a learning model.

& Claypool, 2012. Print.

In many systems, complex behavior

Macal, Charles, and Michael

Agent Based

Simulates the actions

results from surprisingly simple rules.

North. “Agent-based Modeling and

Simulation

and interactions of

Simulation.” Winter Simulation

autonomous agents.

Keep the logic of your agents simple

and gradually build in sophistication.

Conference. Austin, TX. 2009.

Conference Presentation.

Hypothesis testing for

Check model assumptions before

Bhattacharyya, Gouri K.,

ANOVA

differences between

utilizing, and watch out for Family Wise

and Richard A. Johnson.

more than two groups.

error when running multiple tests.

Statistical Concepts and

Models. Wiley, 1977. Print.

Agrawal, Rakesh, and

Association

Data mining technique

Ramakrishnan Srikant.

Rule Mining

to identify the common

Utilize when you have a need to understand

“Fast Algorithms for Mining

potential relationships between elements.

Association Rules.” Proc. Of

(Apriori)

co-occurances of items.

20th Intl. Conf. on VLDB. 1994.

Conference Presentation.

Models conditional

Russel, Stuart, and Peter

Bayesian

probabilities amongst

Calculate by hand before using larger

Norvig. “Artificial Intelligence:

Network

elements, visualized as a

models to ensure understanding.

A Modern Approach.” Prentice

Directed Acyclic Graph.

Hall, 2009 Print.

Compiled by: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - References we

Technique

Description

Tips From the Pros

love to read

Also known as

'Recommendation,' suggest

or eliminate items from a

Use Singular Value Decomposition

Owen, Sean, Robin Anil, Ted

Collaborative

set by comparing a history

based Recommendation for cases

Dunning, and Ellen Friedman.

Filtering

of actions against items

performed by users. Finds

where there are latent factors in your

Mahout in Action. New Jersey:

similar items based on who

domain, e.g., genres in movies.

Manning, 2012. Print.

used them or similar users

based on the items they use.

Changing the coordinate system for data, for

Coordinate

example, using polar or cylindrical coordinates,

Abbott, Edwin A., Flatland: A

Transforma-

Provides a different

may more readily highlight key structure in the

Romance of Many Dimensions.

perspective on data.

data. A key step in coordinate transformations

United Kingdom: Seely & Co.,

tion

is to appreciate multidimensionality and to

1884. Print.

systematically analyze subspaces of the data.

Method that learns

Bengio, Yoshua, and Yann

features that leads to

LeCun. “Scaling Learning

Deep Learning

higher concept learning.

Utilize a GPU to efficiently train complex

Algorithms towards AI.” Large-

Usually very deep neural

models.

Scale Kernel Machines. New

network architectures.

York: MIT Press, 2007. Print.

Applies controlled

Montgomery, Douglas. Design

Design of

experiments to quantify

Fractional factorial designs can significantly

and Analysis of Experiments.

Experiments

effects on system output

reduce the number of different types of

New Jersey: John Wiley

caused by changes to inputs.

experiments you must conduct.

& Sons, 2012. Print.

Used to express

Differential equations can be used to formalize

Zill, Dennis, Warren Wright,

Differential

relationships between

models and make predictions. The equations

and Michael Cullen. Differential

Equations

functions and their

themselves can be solved numerically and

Equations with Boundary-Value

derivatives, for example,

tested with different initial conditions to study

Problems. Connecticut: Cengage

change over time.

system trajectories.

Learning, 2012. Print.

Simulates a discrete

Discrete event simulation is useful when

Burrus, C. Sidney, Ramesh A.

sequence of events where

analyzing event based processes such as

Gopinath, Haitao Guo, Jan E.

Discrete Event each event occurs at a

production lines and service centers to

Odegard and Ivan W. Selesnick.

Simulation

particular instant in time.

determine how system level behavior changes

Introduction to Wavelets and

The model updates its state

as different process parameters change.

Wavelet Transforms: A Primer.

only at points in time when

Optimization can integrate with simulation to

New Jersey: Prentice Hall, 1998.

events occur.

gain efficiencies in a process.

Print.

Burrus, C.Sidney, Ramesh

A. Gopinath, Haitao Guo,

Discrete

Transforms time series

Offers very good time and frequency

Jan E. Odegard, and Ivan W.

Wavelet

data into frequency

localization. The advantage over Fourier

domain preserving

transforms is that it preserves both frequency

Selesnick. Introduction to

Transform

locality information.

and locality.

Wavelets and Wavelet Transforms:

A Primer. New Jersey:

Prentice Hall, 1998. Print.

Dietterich, Thomas G. “Ensemble

Ensemble

Learning multiple models

Be careful not to overfit data by having too many Methods in Machine Learning.”

Learning

and combining output to

achieve better performance.

model parameters and overtraining.

Lecture Notes in Computer

Science. Springer, 2000. Print.

Shortliffe, Edward H., and

Expert

Systems that use symbolic

Useful to have a human readable explanation of

Bruce G. Buchanan. “A Model of

Systems

logic to reason about facts.

Inexact Reasoning in Medicine.”

Emulates human reasoning.

why a system came to a conclusion.

Mathematical Biosciences.

Elsevier B.V., 1975. Print.

Chatfield, Chris, Anne B.

Koehler, J. Keith Ord, and

In comparison to a using moving average

Ralph D. Snyder. “A New Look

Exponential

Used to remove artifacts

where past observations are weighted equally,

at Models for Exponential

Smoothing

expected from collection

error or outliers.

exponential smoothing assigns exponentially

Smoothing.” Journal of the Royal

decreasing weights over time.

Statistical Society: Series D (The

Statistician). Royal Statistical

Society, 2001. Print.

Compiled by: Booz Allen Hamilton

Take off the Training Wheels 79 - References we

Technique

Description

Tips From the Pros

love to read

Describes variability among

Factor

correlated variables with the If you suspect there are inmeasurable

Child, Dennis. The Essentials of

Analysis

goal of lowering the number influences on your data, then you may want to

Factor Analysis. United Kingdom:

of unobserved variables,

try factor analysis.

Cassell Educational, 1990. Print.

namely, the factors.

Transforms time series from Filtering a time varying signal can be done

Mitra, Partha P., and Hemant

Fast Fourier

time to frequency domain

more effectively in the frequency domain. Also,

Bokil. Observed Brain Dynamics.

Transform

efficiently. Can also be used

for image improvement by

noise can often be identified in such signals by

United Kingdom: Oxford

spatial transforms.

observing power at aberrant frequencies.

University Press, 2008. Print.

Creates a standard

representation of data

There are a number of open source software

Ingersoll, Grant S., Thomas S.

Format

regardless of source format. packages that support format conversion and

Morton, and Andrew L. Farris.

Conversion

For example, extracting raw

Taming Text: How to Find,

UTF-8 encoded text from

can interpret a wide variety of formats. One

Organize, and Manipulate It. New

binary file formats such as

notable package is Apache Tikia.

Jersey: Manning, 2013. Print.

Microsoft Word or PDFs.

Logical reasoning that

Utilize when categories are not clearly defined.

Zadeh L.A., "Fuzzy Sets.”

Fuzzy Logic

allows for degrees of truth

Concepts such as "warm", "cold", and "hot" can

Information and Control.

for a statement.

mean different things at different temperatures

California: University of

and domains.

California, Berkeley, 1965. Print.

Parker, James R. Algorithms for

Gaussian

Acts to remove noise

Can be used to remove speckle

Image Processing and Computer

Filtering

or blur data.

noise from images.

Vision. New Jersey: John Wiley &

Sons, 2010. Print.

Expands ordinary linear

MacCullagh, P., and John A.

Generalized

regression to allow

Use if the observed error in your system does

Nelder. Generalized Linear

Linear Models

for error distribution

not follow the normal distribution.

Models. Florida: CRC Press,

that is not normal.

1989. Print.

Evolves candidate

Increasing the generation size adds diversity

models over generations

in considering parameter combinations, but

De Jong, Kenneth A. Evolutionary

Genetic

by evolutionary

requires more objective function evaluation.

Computation - A Unified

Algorithms

inspired operators of

Calculating individuals within a generation

Approach. Massachusetts:

mutation and crossover

is strongly parallelizable. Representation of

MIT Press, 2002. Print.

of parameters.

candidate solutions can impact performance.

Kolda, Tamara G., Robert M.

Lewis, and Virginia Torczon.

Systematic search across

A grid across the parameters is used to

“Optimization by Direct

Grid Search

discrete parameter

Search: New Perspectives on

values for parameter

visualize the parameter landscape and assess

Some Classical and Modern

exploration problems.

whether multiple minima are present.

Methods.” SIAM Review. Society

for Industrial and Applied

Mathematics, 2003. Print.

One of the most powerful properties of Hidden

Models sequential data by

Markov Models is their ability to exhibit

some degree of invariance to local warping

Bishop, Christopher M. Pattern

Hidden Markov determining the discrete

(compression and stretching) of the time axis.

Recognition and Machine

Models

latent variables, but

the observables may be

However, a significant weakness of the Hidden

Learning. New York: Springer,

continuous or discrete.

Markov Model is the way in which it represents

2006. Print.

the distribution of times for which the system

remains in a given state.

Connectivity based

clustering approach

Provides views of clusters at multiple

Hierarchical

that sequentially builds

resolutions of closeness. Algorithms

Rui Xu, and Don Wunsch.

Clustering

bigger (agglomerative)

begin to slow for larger datasets due

Clustering. New Jersey: Wiley-

or smaller (divisive)

to most implementations exhibiting

IEEE Press, 2008. Print.

clusters in the data.

O(N3) or O(N2) complexity.

Centroid based clustering

K-means

algorithms, where with

When applying clustering techniques, make

Rui Xu, and Don Wunsch.

and X-means

K means the number

sure to understand the shape of your data.

of clusters is set and X

Clustering techniques will return poor results if

Clustering. New Jersey: Wiley-

Clustering

means the number of

your data is not circular or ellipsoidal shaped.

IEEE Press, 2008. Print.

clusters is unknown.

Compiled by: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - References we

Technique

Description

Tips From the Pros

love to read

Linear,

Set of techniques for

Winston, Wayne L. Operations

Non-linear,

minimizing or maximizing a

Start with linear programs because algorithms

Research: Applications and

and Integer

function over a constrained

for integer and non-linear variables can take

much longer to run.

Algorithms. Connecticut:

Programming

set of input parameters.

Cengage Learning, 2003. Print.

Problems that are intractable using analytic

Andrieu, Christophe, Nando

A method of sampling

approaches can become tractable using MCMC,

de Freitas, Amaud Doucet,

Markov Chain

typically used in Bayesian

when even considering high-dimensional

and Michael I. Jordan. “An

Monte Carlo

models to estimate the joint

problems. The tractability is a result of using

statistics on the underlying distributions of

Introduction to MCMC for

(MCMC)

distribution of parameters

Machine Learning.” Machine

given the data.

interest, namely, sampling with Monte Carlo and

considering the stochastic sequential process of Learning. Kluwer Academic

Markov Chains.

Publishers, 2003. Print.

Particularly useful for numerical integration,

Fishman, George S. Monte

Monte Carlo

Set of computational

solutions of differential equations, computing

Carlo: Concepts, Algorithms, and

Methods

techniques to generate

random numbers.

Bayesian posteriors, and high dimensional

Applications. New York: Springer,

multivariate sampling.

2003. Print.

Predicts classes following

Assumes that all variables are independent,

Bayes Theorem that

so it can have issues learning in the context of

Ingersoll, Grant S., Thomas S.

states the probability of

highly interdependent variables. The model can

Morton, and Andrew L. Farris.

Naïve Bayes

an outcome given a set of

be learned on a single pass of data using simple Taming Text: How to Find,

features is based on the

counts and therefore is useful in determining

Organize, and Manipulate It. New

probability of features given

whether exploitable patterns exist in large

Jersey: Manning, 2013. Print.

an outcome.

datasets with minimal development time.

Learns salient features in

Training a neural network takes substantially

Haykin, Simon O. Neural

Neural

data by adjusting weights

longer than evaluating new data with an already

Networks and Learning

Networks

between nodes through a

trained network. Sparser network connectivity

Machines. New Jersey:

learning rule.

can help to segment the input space and

improve performance on classification tasks.

Prentice Hall, 2008. Print.

Maimon, Oded, and Lior

Rockach. Data Mining and

Outlier

Method for identifying and

Be cautious when removing outliers. Sometimes Knowledge Discovery Handbook:

Removal

removing noise or artifacts

the most interesting behavior of a system is at

A Complete Guide for Practitioners

from data.

times when there are aberrant data points.

and Researchers. The

Netherlands: Kluwer Academic

Publishers, 2005. Print.

Many large datasets contain correlations

Wallisch, Pascal, Michael E.

between dimensions; therefore part of the

Lusignan, Marc D. Benayoun,

Principal

Enables dimensionality

dataset is redundant. When analyzing the

Tanya I. Baker, Adam Seth

Components

reduction by identifying

highly correlated

resulting principal components, rank order

Dickey, and Nicholas G.

Analysis

dimensions.

them by variance as this is the highest

Hatsopoulos. Matlab for

information view of your data. Use skree plots to Neuroscientists. New Jersey:

infer the optimal number of components.

Prentice Hall, 2009. Print.

Randomly adjust

Bergstra J. and Bengio Y.

Random

parameters to find a

Use as a benchmark for how well a search

Random Search for Hyper-

Search

better solution than

algorithm is performing. Be careful to use a

Parameter Optimization,

currently found.

good random number generator and new seed.

Journal of Machine Learning

Research 13, 2012.

Tibshirani, Robert. “Regression

Regression

A method of variable

There are different methods to select the

Shrinkage and Selection via the

with Shrinkage selection and prediction

Lasso.” Journal of the Royal

combined into a possibly

lambda parameter. A typical choice is cross

Statistical Society. Series B

(Lasso)

biased linear model.

validation with MSE as the metric.

(Methodological). Toronto: Royal

Statistical Society, 1996. Print.

Saltelli, A., Marco Ratto, Terry

Involves testing individual

Insensitive model parameters during an

Andres, Francesca Campolongo,

Sensitivity

parameters in an analytic

optimization are candidates for being set to

Jessica Cariboni, Debora Gatelli,

Analysis

or model and observing the

constants. This reduces the dimensionality

Michaela Saisana, and Stefano

magnitude of the effect.

of optimization problems and provides an

Tarantola. Global Sensitivity

opportunity for speed up.

Analysis: the Primer. New Jersey:

John Wiley & Sons, 2008. Print.

Compiled by: Booz Allen Hamilton

Take off the Training Wheels 81 - References we

Technique

Description

Tips From the Pros

love to read

Named after a controlled

cooling process in

The standard annealing function allows for

initial wide exploration of the parameter space

Simulated

metallurgy, and by

followed by a narrower search. Depending on

Bertsimas, Dimitris, and John

Annealing

analogy using a changing

Tsitsiklis. “Simulated Annealing.”

temperature or annealing

the search priority the annealing function can

Statistical Science. 1993. Print.

schedule to vary

be modified to allow for longer explorative

algorithmic convergence.

search at a high temperature.

A method of variable

selection and prediction.

Akaike's information

criterion AIC is used as

Caution must be used when considering

Hocking, R.R. “The Analysis

Stepwise

the metric for selection.

Stepwise Regression, as over fitting often

and Selection of Variables in

Regression

The resulting predictive

model is based upon

occurs. To mitigate over fitting try to limit the

Linear Regression.” Biometrics.

ordinary least squares,

number of free variables used.

1976. Print.

or a general linear model

with parameter estimation

via maximum likelihood.

General-purpose

Witten, Ian H., Eibe Frank,

Stochastic

optimization for learning of

Applied in cases where the objective

and Mark A. Hall. Data Mining:

Gradient

neural networks, support

function is not completely differentiable

Practical Machine Learning Tools

Descent

vector machines, and

when using sub-gradients.

and Techniques. Massachusetts:

logistic regression models.

Morgan Kaufmann, 2011. Print.

Projection of feature vectors

Hsu, Chih-Wei, Chih-Chung

Support Vector using a kernel function into

Try multiple kernels and use k-fold cross

Chang, and Chih-Jen Lin. “A

Machines

a space where classes are

validation to validate the choice of the best one.

Practical Guide to Support Vector

more separable.

Classification.” National Taiwan

University Press, 2003. Print.

Typically used in text mining. Assuming a

Term

corpus of news articles, a term that is very

Ingersoll, Grant S., Thomas S.

Frequency

A statistic that measures

frequent such as “the” will likely appear many

Morton, and Andrew L. Farris.

Inverse

the relative importance of a

times in many documents, having a low value. A

Taming Text: How to Find,

Document

term from a corpus.

term that is infrequent such as a person’s last

Organize, and Manipulate It. New

Frequency

name that appears in a single article will have a

Jersey: Manning, 2013. Print.

higher TD IDF score.

Topic Modeling

Blei, David M., Andrew Y. Ng,

(Latent

Identifies latent topics

Employ part-of-speech tagging to eliminate

and Michael I. Jordan. “Latent

Dirichlet

in text by examining

words other than nouns and verbs. Use raw

Dirichlet Allocation.” Journal

word co-occurrence.

term counts instead of TF/IDF weighted terms.

of Machine Learning Research.

Allocation)

2003. Print.

James, G., D. Witten, T. Hastie,

Tree Based

Models structured as graph

Can be used to systematize a process or act as

and R. Tibshirani. “Tree Based

Methods

trees where branches

Methods.” An Introduction to

indicate decisions.

a classifier.

Statistical Learning. New York:

Springer, 2013. Print.

Hypothesis test used to

Make sure you meet the tests assumptions and

Bhattacharyya, Gouri K., and

T-Test

test for differences between

watch out for Family Wise error when running

Richard A. Johnson. Statistical

two groups.

multiple tests.

Concepts and Models. Wiley,

1977. Print.

Feature set reduction

John, George H., Ron

method that utilizes

Kohavi, and Karl Pfleger.

performance of a set of

“Irrelevant Features and the

Wrapper

features on a model, as

Utilize k-fold cross validation

Subset Selection Problem.”

Methods

a measure of the feature

Proceedings of ICML-94, 11th

set’s performance. Can

to control over fitting.

International Converence

help identify combinations

on Machine Learning. New

of features in models that

Brunswick, New Jersey. 1994.

achieve high performance.

Conference Presentation.

Compiled by: Booz Allen Hamilton

THE FIELD GUIDE to D A T A S C I E N C E - LIFE in T H E T R E N C H E S

NAVIGATING NECK DEEP IN DATA

Our Data Science experts have learned

and developed new solutions over the years

from properly framing or reframing analytic

questions. In this section, we list a few

important topics to Data Science coupled

with firsthand experience from our experts. - Going Deep into

Machine Learning

Machines are getting better at learning by mimicking the

human brain.

Think about where you were 10 years ago. Could computers understand

and take action based upon your spoken word? Recently, speech-to-text

quality has improved dramatically to nearly perfect accuracy, much to

the delight of many mobile phone users. In other complex tasks, similar

magic capabilities have emerged. The world-record high scores in 29

video games are now held by a machine learning algorithm with no

specific knowledge of Atari or computer games in general.

These impressive feats were made possible by deep learning, a different

way of approaching machine learning problems. Most approaches to

machine learning require humans to encode logic and rules to create

features, which are then fed into machine learning models. In some

domains, such as audio, text, image, and signal processing, effective

feature engineering requires considerable human expertise to achieve

decent model performance. Deep learning avoids the necessity

of human-encoded features and instead incorporates the feature

engineering, feature selection, and model fitting into one step.

Deep learning is an extension of an idea originating in the

1950s called neural networks, which are loosely inspired by our

understanding of how neurons in our brains operate. Recent

hardware developments, originally designed for faster renderings

of graphics, birthed a renaissance in neural networks. The latest

graphical processing units, or GPUs, have more than 3,000 processing

cores that are well suited for parallel processing of complex matrix

manipulations required for rendering graphics – and for executing

computations on neural networks.

In the late 2000s, the combination of GPUs, advances in algorithms,

and collections of big data reignited interest in neural networks. GPUs

enabled computers to process much larger networks in much less time,

and clever advances in algorithms made the model fitting process more

efficient. Large collections of image, video, and text data provided

content for the deep learning algorithms to learn. The ability to train

larger networks with more data drove the exploration of new neural

network architectures featuring additional hidden layers and widening

the breadth of the networks.

Presently, deep learning has moved beyond academic applications and is

finding its way into our daily lives. Deep learning powers speech-to-text

on our mobile phones and smart devices, image search provided by major

tech companies, language translation services for text and spoken word,

and even drug discovery within advanced pharmaceutical companies.

THE FIELD GUIDE to D A T A S C I E N C E - ›› National Data Science Bowl

The first-ever National Data Science Bowl offered Data Scientists

a platform through which individuals could harness their passion,

unleash their curiosity and amplify their impact to affect change

on a global scale. The competition presented participants with

more than 100,000 underwater images provided by the Hatfield

Marine Science Center. Participants were challenged to develop a

classification algorithm that would enable researchers to monitor

Aaron Sander

ocean health at a speed and scale never before possible.

More than 1,000 teams submitted a

Deep Sea, Happy Lantern Festival,

total of approximately 15,000 solutions

and Poisson Process, all used CNNs

over the 90 days of the competition.

in their solutions. Their results

A large proportion of the participants’

increased algorithm accuracy by 10%

implemented solutions used deep

over the state of the art. Without their

learning-based approaches, specifically

algorithms, it would have taken marine

Convolutional Neural Nets (CNNs).

researchers more than two lifetimes to

The competition forum exploded

manually complete the classification

with competitors collectively sharing

process. The work submitted by all

knowledge and collaborating to advance

the participants represents major

the state-of-the-art in computer vision.

advances for both the marine research

Participants tested new techniques

and Data Science communities.

for developing CNNs and contributed

» Visit www.DataScienceBowl.com

to the development of open source

to learn more about the first-ever

software for creating CNN models.

National Data Science Bowl

The top three competitors, Team

INPUT LAYER

HIDDEN LAYER

OUTPUT

LAYER

Source: Booz Allen Hamilton

A Representation of Deep Learning

Life in the Trenches 87 - Feature Engineering

Feature engineering is a lot like oxygen. You can’t do without

it, but you rarely give it much thought.

Feature engineering is the process by which one establishes the

representation of data in the context of an analytic approach. It is the

foundational skill in Data Science. Without feature engineering, it

would not be possible to understand and represent the world through

a mathematical model. Feature engineering is a challenging art. Like

» What differentiates a great

model from a bad one? other arts, it is a creative process that manifests uniquely in each

Data Scientist. It will be influenced substantially by the scientist’s

In most cases, the inputs to the model experiences, tastes and understanding of the field.

matter even more than the choice

of the algorithm. The traditional When faced with the problem of feature engineering, there are

approach has a two-step process several paths that one may initially take. Generally speaking, better

where heuristics and subject-matter features can be developed with more knowledge of the domain. One

expertise are used to find a good set of approach to feature engineering is to begin by describing smaller

features, and then algorithms optimize elements of the domain and continuously constructing more intricate

model parameters. Deep learning features as the model becomes more complex. These more complicated

combines these steps into one. features can be defined by considering other attributes of the domain,

Feature engineering, feature selection, aggregating features together into different groups or using advanced

and model parameter estimation

are accomplished simultaneously. statistical and mathematical techniques to devise new features. The

This reduces the need for highly- ultimate judge in this process is the performance of the machine

specialized domain knowledge and learning algorithm which makes decisions based on this feature vector.

often results in better models. Consider the example of email spam classification. Because the

For example, in the context of images domain is a set of emails, one possible choice of an initial feature

from the natural world, deep networks vector is the integer array that counts the number of times a

may learn low-level features such as given word appears in the email. This is called the "bag of words"

lines at various angles and curved assumption, where the order that words appear in a text is ignored.

lines. Middle layers may combine the If an algorithm with this feature vector does not adequately

lower-level features into more complex

geometric shapes and patterns. distinguish between spam and non-spam emails, a feature could

Higher-level layers combine the mid- be added that counts the number of misspelled words in the text.

level features into more complicated This new feature uses the spam recognition domain knowledge that

features that begin to resemble faces asserts many spam emails misspell words. These misspelled words

and shapes of animals. Applied to alert filters saying that the existence of certain words automatically

other types of data, such as audio label an email as spam. If this new feature is not enough, there are

and text, deep neural networks learn still additional features to be considered such as whether or not a

increasingly sophisticated features first or last name is used in the email.

at each layer in a similar manner.

Feature engineering represents a complex, but crucial aspect of Data

Science. The Learning Optimal Features sidebar goes into detail about

feature learning – an automated approach to feature engineering that

applies machine learning techniques.

THE FIELD GUIDE to D A T A S C I E N C E - ›› Chemoinformatic Search

On one assignment, my

of the molecules were available. We

team was confronted

created a model of how molecular

with the challenge of

structural similarities might affect

developing a search

their properties. We began by

engine over chemical

finding all the sub-graphs of each

compounds. The goal

molecule with length n, resulting

of chemoinformatic

in a representation similar to

Ed Kohlwey

search is to predict

the properties that

the bag-of-words approach from

a molecule will exhibit as well as to

natural language processing.

provide indices over those predicted

We summarized each molecule

properties to facilitate data discovery

fragment in a type of fingerprint

in chemistry-based research. These

called a “Counting Bloom Filter.”

properties may either be discreet (e.g.,

“a molecule treats disease x well”)

Next, we used several exemplars from

or continuous (e.g., “a molecule may

the set to create new features. We

be dissolved up to 100.21 g/ml”).

found the distance from each member

of the full training set to each of the

Molecules are complex 3D structures, exemplars. We fed these features into

which are typically represented as

a non-linear regression algorithm

a list of atoms joined by chemical

to yield a model that could be used

bonds of differing lengths with varying on data that was not in the original

electron domain and molecular

training set. This approach can be

geometries. The structures are

conceptualized as a “hidden manifold,”

specified by the 3-space coordinates

whereby a hidden surface or shape

and the electrostatic potential

defines how a molecule will exhibit a

surface of the atoms in the molecule.

property. We approximate this shape

Searching this data is a daunting

using a non-linear regression and a

task when one considers that

set of data with known properties.

naïve approaches to the problem

Once we have the approximate

bear significant semblance to the

shape, we can use it to predict the

Graph Isomorphism Problem.[15]

properties of new molecules.

The solution we developed was

Our approach was multi-staged and

based on previous work in molecular

complex – we generated sub-graphs,

fingerprinting (sometimes also called

created bloom filters, calculated

hashing or locality sensitive hashing).

distance metrics and fit a linear-

Fingerprinting is a dimensionality

regression model. This example

reduction technique that dramatically

provides an illustration of how many

reduces the problem space by

stages may be involved in producing a

summarizing many features, often

sophisticated feature representation.

with relatively little regard to the

By creatively combining and building

importance of the feature. When

“features on features,” we were able

an exact solution is likely to be

to create new representations of data

infeasible, we often turn to heuristic

that were richer and more descriptive,

approaches such as fingerprinting.

yet were able to execute faster and

Our approach used a training set

produce better results.

where all the measured properties

Life in the Trenches 89 - Feature Selection

Models are like honored guests; you should only feed them the

good parts.

Feature selection is the process of determining the set of features with

the highest information value to the model. Two main approaches are

filtering and wrapper methods. Filtering methods analyze features

using a test statistic and eliminate redundant or non-informative

features. As an example, a filtering method could eliminate features

that have little correlation to the class labels. Wrapper methods

utilize a classification model as part of feature selection. A model is

trained on a set of features and the classification accuracy is used to

measure the information value of the feature set. One example is that

of training a neural network with a set of features and evaluating the

accuracy of the model. If the model scores highly on the test set, then

the features have high information value. All possible combinations of

features are tested to find the best feature set.

There are tradeoffs between these techniques. Filtering methods

are faster to compute since each feature only needs to be compared

against its class label. Wrapper methods, on the other hand, evaluate

feature sets by constructing models and measuring performance.

This requires a large number of models to be trained and evaluated

(a quantity that grows exponentially in the number of features).

Why would anyone use a wrapper method? Feature sets may perform

better than individual features.[16] With filter methods, a feature

with weak correlation to its class labels is eliminated. Some of

these eliminated features, however, may have performed well when

combined with other features.

THE FIELD GUIDE to D A T A S C I E N C E - ›› Cancer Cell Classification

On one project, was used as feedback into

our team was

the Genetic Algorithm. When

challenged

a set of features contained

to classify

no useful information, the

cancer cell

model performed poorly and

profiles. The

a different feature set would

overarching

be explored. Over time, this

Paul Yacci

goal was

method selected a set of

to classify different types

features that performed with

of Leukemia, based on

high accuracy. The down-

Microarray profiles from 72

selected feature set increased

samples[17] using a small

speed and performance as

set of features. We utilized

well as allowed for better

a hybrid Artificial Neural

insight into the factors that

Network (ANN)[18] and Genetic may govern the system. This

Algorithm[19] to identify subsets allowed our team to design

of 10 features selected from

a diagnostic test for only a

thousands.[20] We trained the

few genetic markers instead

ANN and tested performance

of thousands, substantially

using cross-fold validation.

reducing diagnostic test

The performance measure

complexity and cost.

Life in the Trenches 91 - Ensemble Models

None of us is as smart as all of us, but some are smarter

than others.

In 1906, Sir Francis Galton attended a fair at which there was a

contest to guess the weight of an ox. Galton had the idea to collect the

guesses of the 787 entrants and compute the mean. To his surprise,

the mean was only one pound off from the ox’s real weight of 1,198

pounds. Together, the estimates made by many amateurs formed a

prediction that was more accurate than that of individual experts.

Galton’s “wisdom of crowds” extends to Data Science in the form of

ensemble learning, which is colloquially and somewhat equivalently

called ensembling, blending, or stacking. An ensemble takes the

predictions of many individual models and combines them to

make a single prediction. Like the people guessing the ox’s weight,

Data Science models have unique strengths and weaknesses (i.e.,

determined by their design), and are influenced by varied perspectives

based on past experience (i.e., the data they have observed).

Ensembles overcome individual weaknesses to make predictions with

more accuracy than their constituent models. These models need not

stem from different methodologies; an ensemble might employ the

same method with different parameters or weights (e.g., boosting),

feature subsets (e.g., random forests), or sub-samples of data (e.g.,

bagging). The ensembling methodology may be as simple as averaging

two outputs, or as complex as using a “meta model” to learn an

optimal combination.

An ensemble’s ability to reduce individual errors arises from the

diversity of its members. If one model over fits the data, it is balanced

by a different model that under fits the data. If one subset is skewed

by outlier values, another subset is included without them. If one

method is unstable to noisy inputs, it is bolstered by another method

that is more robust.

In practice, ensembling typically improves a model by a few percent.

The price of this accuracy is paid in complexity. The accuracy vs.

complexity tradeoff can make it difficult to know when ensembling

is justified. On one hand, ensembles appear to be a fit for high-stakes

problems—think detecting cancer in MRI images vs. detecting

unripe blueberries on a conveyer belt. On the other hand, high-stakes

problems mandate higher standards for auditing model functionality.

The Data Scientist must manage a balance between ensemble

interpretability and black-box complexity. If this seems easy, it isn’t!

Put yourself in the driver’s seat of the machine learning code for a

self-driving car. If a well-behaved regression model makes a right

decision 99.5% of the time, but a complex, less-explainable ensemble

is right 99.8% of the time, which would you pick?

THE FIELD GUIDE to D A T A S C I E N C E - ›› The Value of

Ensemble Models

Several years ago, the Kaggle Photo Quality Prediction

competition posed the question “Given anonymized information

on thousands of photo albums, predict whether a human

evaluator would mark them as 'good'.” Participants were

supplied a large collection of user-generated photos. The goal

was to create an algorithm that could automatically pick out

particularly enjoyable or impressive photos from the collection.

Will Cukierski

Over the course of the competition, 207 people submitted entries. The log likelihood metric was

used to evaluate the accuracy of the entries. Scores for the top 50 teams ranged from 0.18434 to

0.19884, where lower is better. Kaggle data scientist Ben Hamner used the results to illustrate the

value of ensembling by means of averaging the top 50 scores. The figure below shows the results.

Ensemble

Private Leaderboard

er)

er is bett

ow

elihood (l

Log lik

Rank=1

Final Team Rank

Value of Ensembling for the Kaggle Photo Quality Prediction Competition

(results courtesy of Ben Hamner, Kaggle CTO)

The-blue line shows the individual scores

all points across the top 50 teams. However,

for each of the top 50 teams. The orange

after we increase the number of models in

line shows the ensembled score for the top

the ensemble beyond 15, we begin to see

n teams, where n ranges from 1 to the value

the ensembled score increase. This occurs

on the axis. For example, the ensemble point

because we are introducing less accurate (i.e.,

for Final Team Rank 5 is an ensemble of the

potentially overfit) models into the ensemble.

entries for teams 1 through 5. As shown in

The results of this simple experiment quantify

the graph, the ensembled score is lower than

the value of creating an ensemble model, while

any single individual score. The diversity of

reinforcing the idea that we must be thoughtful

models included within the ensemble causes

when selecting the individual models contained

the respective errors to cancel out, resulting

within the ensemble.

in an overall lower score. This holds true for

Life in the Trenches 93 - Data Veracity

We’re Data Scientists, not data alchemists. We can’t make

analytic gold from the lead of data.

While most people associate data volume, velocity, and variety with

big data, there is an equally important yet often overlooked dimension

– data veracity. Data veracity refers to the overall quality and

correctness of the data. You must assess the truthfulness and accuracy

of the data as well as identify missing or incomplete information. As

the saying goes, “Garbage in, garbage out.” If your data is inaccurate or

missing information, you can’t hope to make analytic gold.

Assessing data truthfulness is often subjective. You must rely on your

experience and an understanding of the data origins and context.

Domain expertise is particularly critical for the latter. Although data

accuracy assessment may also be subjective, there are times when

quantitative methods may be used. You may be able to re-sample from

the population and conduct a statistical comparison against the stored

values, thereby providing measures of accuracy.

The most common issues you will encounter are missing or

incomplete information. There are two basic strategies for dealing

with missing values – deletion and imputation. In the former, entire

observations are excluded from analysis, reducing sample size and

potentially introducing bias. Imputation, or replacement of missing

or erroneous values, uses a variety of techniques such as random

» Tips From the Pros sampling (hot deck imputation) or replacement using the mean,

statistical distributions or models.

Find an approach that works,

implement it, and move on. You

can worry about optimization and

tuning your approaches later during

incremental improvement.

THE FIELD GUIDE to D A T A S C I E N C E - ›› Time Series Modeling

›› Time Series Modeling

On one of our projects, the t

On one of our pr

eam was f

ojects, the t

ac

eam was f ed with c

ac

orr

ed with c

elating the time

orr

series for v

series f

arious par

or v

amet

arious par

er

amet s. Our initial anal

er

ysis r

s. Our initial anal

e

ysis r v

e eal

v

ed that the

eal

corr

c

elations w

orr

er

elations w e almos

er

t non-e

e almos

xis

t non-e t

xis ent. W

t

e e

ent. W xamined the dat

e e

a and

xamined the dat

quickly disc

quickl

o

y disc v

o er

v ed dat

er

a v

ed dat er

a v acity is

er

sues. Ther

acity is

e w

sues. Ther

er

e w e mis

er

sing and

e mis

null values, as w

null v

ell as negativ

alues, as w

e-v

ell as negativ

alue observ

e-v

ations, an impos

alue observ

sibility

ations, an impos

given the c

giv

ont

en the c

e

ont xt of the measur

e

ements (see the figur

xt of the measur

e,

ements (see the figur Time S

e,

eries

Data Prior t

Da

o Cl

ta Prior t

eansing

o Cl

). Garbage dat

eansing

a meant garbage r

). Garbage dat

esults.

a meant garbage r

Brian Keller

Brian Kell

Sourc

our e: Booz Allen Hamilt

e: Booz All

on

on

Time Series Data Prior to Cleansing

o Cl

Because sampl

Bec

e size was alr

ause sampl

eady small, del

e size was alr

eting

eady small, del

dist

dis ortion in the underl

t

ying signal, and w

ortion in the underl

e quickl

ying signal, and w

y

e quickl

observations was undesir

observ

abl

ations was undesir

e. The v

abl

olatil

e. The v

e natur

olatil

e

e natur

abandoned the approach.

abandoned the appr

of the time series meant that imputation thr

of the time series meant that imput

ough

ation thr

One of our team member

One of our t

s who had e

eam member

xperienc

s who had e

e in

xperienc

sampling could not be trus

sampling c

t

ould not be trus ed t

t

o pr

ed t

oduc

o pr

e v

oduc alues

e v

signal proc

signal pr es

oc sing sugges

es

t

sing sugges ed a median filt

t

er

ed a median filt . The

er

in which the team w

in which the t

ould be c

eam w

onfident. As a r

ould be c

esult,

onfident. As a r

median filter is a windo

median filt

wing t

er is a windo

echnique that mo

wing t

v

echnique that mo es

v

we quickl

w

y r

e quickl ealized that the bes

y r

t s

ealized that the bes tr

t s at

tr egy was an

at

through the dat

thr

a point-by-point, and r

ough the dat

eplac

a point-by-point, and r

es it

eplac

approach that c

appr

ould filt

oach that c

er and c

ould filt

orr

er and c

ect the noise in

orr

with the median value c

with the median v

al

alue c culat

al

ed f

culat

or the curr

ed f

ent

or the curr

the data.

the dat

window

windo . W

w

e e

. W xperiment

e e

ed with v

xperiment

arious windo

ed with v

w

arious windo

We initiall

W

y tried a simplis

e initiall

tic appr

y tried a simplis

oach in which w

tic appr

e

oach in which w

sizes to achie

sizes t

v

o achie e an ac

v

c

e an ac ept

c

abl

ept

e tr

abl

adeoff betw

e tr

een

adeoff betw

replac

r

ed each observ

eplac

ation with a mo

ed each observ

ving av

ation with a mo

er

ving av age.

er

smoothing noise and smoothing away signal. The

While this c

Whil

orr

e this c

ect

orr

ed some noise, including the

ect

figure,

figur Time S

e,

eries Data Af

eries Da

ter Cl

t

eansing

er Cl

, shows the

, sho

outlier values in our mo

outlier v

ving-av

alues in our mo

er

ving-av age c

er

omput

age c

ation

omput

same two time series aft

same tw

er median filt

o time series aft

er imput

er median filt

ation.

er imput

shifted the time series. This c

shift

aused undesir

ed the time series. This c

abl

aused undesir

e

abl

Sourc

our e: Booz Allen Hamilt

e: Booz All

on

on

Time Series Data After Cleansing

er Cl

The application of the median filt

The applic

er appr

ation of the median filt

oach was

er appr

By addres

By addr sing our dat

es

a v

sing our dat er

a v acity is

er

sues, w

acity is

e w

sues, w

er

e w e

er

hugely suc

hugel

c

y suc es

c sful. Visual inspection of the time

es

able t

abl o cr

e t

eat

o cr

e anal

eat

ytic gold. Whil

e anal

e other appr

ytic gold. Whil

oaches

e other appr

series plots r

series pl

e

ots r v

e eals smoothing of the outlier

v

s

eals smoothing of the outlier

may also have been eff

may also hav

ectiv

e been eff

e, impl

ectiv

ement

e, impl

ation speed

ement

without dampening the naturall

without dampening the natur y oc

all

curring peaks

y oc

cons

c

tr

ons aints pr

tr

e

aints pr v

e ent

v

ed us fr

ent

om doing any further

ed us fr

and troughs (no signal l

and tr

os

oughs (no signal l s). Prior t

os

o smoothing,

s). Prior t

analysis. W

anal

e achie

ysis. W

v

e achie ed the suc

v

c

ed the suc es

c s w

es

e w

s w

er

e w e aft

er

er and

e aft

we saw no c

w

orr

e saw no c

elation in our dat

orr

a, but aft

elation in our dat

erwar

a, but aft

ds,

erwar

mov

mo ed on t

v

o addr

ed on t

es

o addr s other aspects of the pr

es

obl

s other aspects of the pr

em.

obl

Spearman’s Rho was ~0.5 f

Spearman’

or almos

s Rho was ~0.5 f

t all par

or almos

amet

t all par

er

amet s.

er

Life in the Trenches 95 - Application of

Domain Knowledge

We are all special in our own way. Don’t discount what

you know.

Knowledge of the domain in which a problem lies is immensely

valuable and irreplaceable. It provides an in-depth understanding

of your data and the factors influencing your analytic goal. Many

times domain knowledge is a key differentiator to a Data Science

team’s success. Domain knowledge influences how we engineer and

select features, impute data, choose an algorithm, and determine

success. One person cannot possibly be a domain expert in every

field, however. We rely on our team, other analysts and domain

experts as well as consult research papers and publications to build

an understanding of the domain.

THE FIELD GUIDE to D A T A S C I E N C E - ›› Motor Vehicle Theft

On one project,

Our team began by parsing and

our team explored

verifying San Francisco crime data.

how Data Science

We enriched stolen car reporting with

could be applied to

general city data. After conducting

improve public safety.

several data experiments across both

According to the

space and time, three geospatial and

FBI, approximately

one temporal hotspot emerged (see

Armen

$8 Billion is lost

figure, Geospatial and Temporal Car

Kherlopian

annually due to

Theft Hotspots). The domain expert

automobile theft. Recovery of the

on the team was able to discern

one million vehicles stolen every

that the primary geospatial hotspot

year in the U.S. is less than 60%.

corresponded to an area surrounded

Dealing with these crimes represents

by parks. The parks created an urban

a significant investment of law

mountain with a number of over-foot

enforcement resources. We wanted

access points that were conducive to

to see if we could identify how to

car theft.

reduce auto theft while efficiently

using law enforcement resources.

GEOSPATIAL HOTSPOTS

TEMPORAL HOTSPOT

3am

6am

9am

Noon

3pm

6pm

9pm

Sun

Sat

Fri

Thu

Wed

Tue

Mon

3am

6am

9am

Noon

3pm

6pm

9pm

Source: Booz Allen Hamilton

Geospatial and Temporal Car Theft Hotspots

Our team used the temporal hotspot information in tandem with the insights

from the domain expert to develop a Monte Carlo model to predict the likelihood

of a motor vehicle theft at particular city intersections. By prioritizing the

intersections identified by the model, local governments would have the

information necessary to efficiently deploy their patrols. Motor vehicle thefts

could be reduced and law enforcement resources could be more efficiently

deployed. The analysis, enabled by domain expertise, yielded actionable insights

that could make the streets safer.

Life in the Trenches 97 - The Curse of

Dimensionality

There is no magical potion to cure the curse, but there is PCA.

The “curse of dimensionality” is one of the most important results

in machine learning. Most texts on machine learning mention this

phenomenon in the first chapter or two, but it often takes many years

of practice to understand its true implications.

Classification methods, like most machine learning methods, are

subject to the implications of the curse of dimensionality. The basic

intuition in this case is that as the number of data dimensions

increases, it becomes more difficult to create generalizable

classification models (models that apply well over phenomena not

observed in the training set). This difficulty is usually impossible to

overcome in real world settings. There are some exceptions in domains

where things happen to work out, but usually you must work to

minimize the number of dimensions. This requires a combination

of clever feature engineering and use of dimensionality reduction

techniques (see Feature Engineering and Feature Selection Life in the

Trenches). In our practical experience, the maximum

number of dimensions seems to be ~10 for linear model-based

approaches. The limit seems to be in the tens of thousands for more

sophisticated methods such as support vector machines, but the limit

still exists nonetheless.

A counterintuitive consequence of the curse of dimensionality is

that it limits the amount of data needed to train a classification

model. There are roughly two reasons for this phenomenon. In one

case, the dimensionality is small enough that the model can be

trained on a single machine. In the other case, the exponentially

expanding complexity of a high-dimensionality problem makes it

(practically) computationally impossible to train a model. In our

experience, it is quite rare for a problem to fall in a “sweet spot”

between these two extremes.

Rather than trying to create super-scalable algorithm

implementations, focus your attention on solving your immediate

problems with basic methods. Wait until you encounter a problem

where an algorithm fails to converge or provides poor cross-validated

results, and then seek new approaches. Only when you find that

alternate approaches don’t already exist, should you begin building

new implementations. The expected cost of this work pattern is lower

than over-engineering right out of the gate.

Put otherwise, “Keep it simple, stupid”.

THE FIELD GUIDE to D A T A S C I E N C E - Baking the Cake

››

I was once given a time series set of roughly

1,600 predictor variables and 16 target variables

and asked to implement a number of modeling

techniques to predict the target variable

values. The client was challenged to handle the

complexity associated with the large number of

variables and needed help. Not only did I have

Stephanie

a case of the curse, but the predictor variables

Rivera

were also quite diverse. At first glance, it looked

like trying to bake a cake with everything in the cupboard.

That is not a good way to bake or to make predictions!

The data diversity could be

Using this approach, I was

partially explained by the fact

able to condition upon week,

that the time series predictors without greatly increasing

did not all have the same

the dimensionality. For the

periodicity. The target time

other predictors, I used

series were all daily values

a variety of techniques,

whereas the predictors were

including projection and

daily, weekly, quarterly, and

correlation, to make heads

monthly. This was tricky to

or tails of the predictors. My

sort out, given that imputing

approach successfully reduced

zeros isn’t likely to produce

the number of variables,

good results. For this specific

accomplishing the client’s goal

reason, I chose to use neural

of making the problem space

networks for evaluating the

tractable. As a result, the cake

weekly variable contributions. turned out just fine.

Life in the Trenches 99 - Model Validation

Repeating what you just heard does not mean that you

learned anything.

Model validation is central to construction of any model. This answers

the question “How well did my hypothesis fit the observed data?”

If we do not have enough data, our models cannot connect the dots.

On the other hand, given too much data the model cannot think

outside of the box. The model learns specific details about the training

data that do not generalize to the population. This is the problem of

model over fitting.

Many techniques exist to combat model over fitting. The simplest

method is to split your dataset into training, testing and validation

sets. The training data is used to construct the model. The model

constructed with the training data is then evaluated with the testing

data. The performance of the model against the testing set is used to

further reduce model error. This indirectly includes the testing data

within model construction, helping to reduce model over fit. Finally,

the model is evaluated on the validation data to assess how well the

model generalizes.

A few methods where the data is split into training and testing sets

include: k-fold cross-validation, Leave-One-Out cross-validation,

bootstrap methods, and resampling methods. Leave-One-Out cross-

validation can be used to get a sense of ideal model performance

over the training set. A sample is selected from the data to act as the

testing sample and the model is trained on the rest of the data. The

error on the test sample is calculated and saved, and the sample is

returned to the dataset. A different sample is then selected and the

process is repeated. This continues until all samples in the testing set

have been used. The average error over the testing examples gives a

measure of the model’s error.

There are other approaches for testing how well your hypothesis

reflects the data. Statistical methods such as calculating the coefficient

of determination, commonly called the R-squared value are used to

identify how much variation in the data your model explains. Note

that as the dimensionality of your feature space grows, the R-squared

» Do we really need a case study value also grows. An adjusted R-squared value compensates for this

to know that you should phenomenon by including a penalty for model complexity. When

check your work? testing the significance of the regression as a whole, the F-test

compares the explained variance to unexplained variance. A regression

result with a high F-statistic and an adjusted R-squared over 0.7 is

almost surely significant.

THE FIELD GUIDE to D A T A S C I E N C E - PU T TING it A L L T O G E T H E R
- Streamlining Medication Review

›› Analytic Challenge

The U.S. Food and Drug Administration (FDA) is responsible for advancing public

health by supporting the delivery of new treatments to patients; assessing the safety,

efficacy and quality of regulated products; and conducting

research to drive medical innovation. Although the FDA

houses one of the world’s largest repositories of regulatory

and scientific data, reviewers are not able to easily leverage

data-driven approaches and analytics methods to extract

» Our Case Studies

information, detect signals and uncover trends to enhance

regulatory decision-making and protect public health. In

Hey, we have given you a lot of

addition, a rapid increase in the volume, velocity and variety really good technical content. We

of data that must be analyzed to address and respond to

know that this section has the look

regulatory challenges, combined with variances in data

and feel of marketing material,

standards, formats, and quality, severely limit the ability

but there is still a really good story

here. Remember, storytelling comes

of FDA Center for Drug Evaluation and Research (CDER)

in many forms and styles, one of

regulatory scientists to conduct cross-study, cross-product, which is the marketing version. You

retrospective, and meta-analysis during product reviews.

should read this chapter for what it

is – great information told with a

Booz Allen Hamilton was engaged to research, develop,

marketing voice.

and evaluate emerging informatics tools, methods, and

techniques to determine their ability to address regulatory

challenges faced by the FDA Center for Drug Evaluation

and Research (CDER). The main goal was to enable the CDER community to fully

utilize the agency’s expansive data resources for efficient and effective drug review

through the design and development of informatics capabilities based on Natural

Language Processing (NLP), data integration, and data visualization methodologies.

Our Approach

To support transformational change

datasets, and geographic visualization

at CDER, we designed and developed

capabilities to support the inspection of

a set of informatics prototypes for

pharmaceutical manufacturing facilities.

the analysis and modeling of complex

structured, unstructured, and

Product Safety Analytics. The review,

fragmented datasets. We developed

surveillance, and analysis of adverse

multiple prototypes to enable the

events throughout the product lifecycle

evaluation of emerging informatics

require significant resources. The ability

tools, methods, and techniques and

to identify actionable insights that lead

their ability to enable a critical-value

to informed decision-making requires

driver – e.g., co-locate complex,

significant investment of effort, including

heterogeneous data to identify patterns

the active and passive surveillance

and foster development of strategies

of adverse events. To address these

to protect public health. For example,

challenges, we developed a Product

we implemented NLP algorithms

Safety Dashboard that compares adverse

to compare adverse events across

events listed in the product label (i.e.,

THE FIELD GUIDE to D A T A S C I E N C E - package inserts) with data from the

of the product quality-review process.

FDA Adverse Event Reporting System

Integration of disparate data sources is

(FAERS). Using NLP, we extracted

the first step in building a comprehensive

adverse events from the product label

profile of manufacturers, facilities, and

to create a structured table of label

the products associated with individual

data out of unstructured text. This

facilities. To address these challenges,

dashboard allows safety evaluators to

we developed a Facility Inventory Report

view whether or not a reported adverse

to show the geographic location of

event is already known, without having

facilities and their associated metadata.

to access an external data source

This geovisualization tool processes and

and read through product labels.

transforms raw data into a user-friendly

visual interface with mapping features

Product Quality Analytics.

to enhance the surveillance capabilities

To support CDER’s mission of reviewing

of CDER and provide reviewers with the

and managing product quality, novel

ability to establish connections between

methodologies and tools are needed

facility data and product quality.

to improve the efficiency and efficacy

Our Impact

Since the FDA is responsible for regulating 25 cents of every dollar that Americans

spend, the agency’s ability to fully use regulatory datasets and meaningfully integrate

previously incompatible data to rapidly detect product quality and safety issues is

critical for safeguarding public health. NLP approaches provide CDER with the ability

to more efficiently search a broader range of textual data and enhances the ability to

gain insight from additional data forms that may seem unrelated. Data integration

and visualization directly increase the efficiency of researchers by reducing their time

spent on searching for frequently-performed aggregate or granular calculations,

and by proactively presenting the most frequently desired data to the reviewer

through thoughtful and contextual dashboards designed to reveal patterns and

trends in disparate data sources. These new capabilities position the FDA to enhance

regulatory decision-making, drive advances in personalized medicine, and enable

earlier detection of safety signals in the general population.

Putting it all Together 105 - Reducing Flight Delays

›› Analytic Challenge

Domestic airline departure delays are estimated to cost the U.S. economy $32.9 billion

annually. The Federal Aviation Administration’s (FAA’s) Traffic Flow Management

System (TFMS) is used to strategically manage flights and includes a flight departure-

delay prediction engine which applies simple heuristics to predict flight delays.

However, the limited predictive power of these heuristics constrains the FAA’s ability

to act in accordance with its existing departure-delay management plan. In response,

the FAA’s NextGen Advanced Concepts and Technology Development Group wanted to

create a predictive probabilistic model to improve aircraft departure time predictions.

This new model would help the FAA understand the causes of departure delays and

develop policies and actions to improve the reliability of departure time predictions for

real-time air traffic flow management.

Our Approach

The commercial aviation industry is rich

departure deviation. The most critical

in flight operations data, much of which

steps in model development were the

is publicly available through government selection of optimal algorithms to

websites and a few subscription

discretize model variables, and the

vendors. Booz Allen Hamilton leveraged

selection of appropriate machine learning

these sources to gather over 4 TB of

techniques to learn the model from the

data detailing tarmac and airspace

data. The team followed information

congestion, weather conditions, network theory principles to discretize model

effects, Traffic Management Initiatives,

variables to maximize the model’s

and airline and aircraft-specific

predictive power, and to represent the

attributes for every commercial flight

data as closely as possible with the

departing from U.S. airports between

least amount of network complexity.

2008 and 2012. This data included over 50 Booz Allen segmented the model

million flights and around 100 variables

variables into three different categories

for each flight. The data included

based on the time to flight departure:

composite variables (e.g. incoming flight 24 hours, 11 hours, and one hour.

delay) that were constructed from the

Certain flight variables could only

raw data to capture relevant dynamics

be known for specific pre-departure

of flight operations. Data acquisition,

times. For example, the tarmac and

processing, quality control, and accuracy airspace congestion variables for

between disparate datasets were

a flight are only known just before

important steps during this process.

the flight, and hence those variables

The team applied supervised learning

feature only in the one hour category.

algorithms to develop Bayesian Belief

Departure delays were predicted for

Network (BBN) models to predict flight

each of the three time horizons.

THE FIELD GUIDE to D A T A S C I E N C E - Our Impact

For a typical airport, the model delivers a delay prediction improvement of between

100,000 and 500,000 minutes annually over previous FAA predictions. The model can

be used by a range of aviation stakeholders, such as airlines, to better understand

and predict network flight delays. This can improve the airlines’ operational decisions

to include more proactive schedule adjustments during times of disruption (e.g.

weather or sector load). The improved reliability of departure prediction will improve

FAA's predictions for airports, sectors, and other resources, and has the potential to

enable improved real-time traffic flow management, which can significantly reduce

airline departure, delays, and the associated economic costs. This means a more

efficient and effective air transportation network.

Putting it all Together 107 - Making Vaccines Safer

›› Analytic Challenge

The U.S. Food and Drug Administration (FDA) Center for Biologics Evaluation and

Research (CBER) is responsible for protecting public health by assuring the safety

and efficacy of biologics, including vaccines, blood and blood products. CBER’s

current surveillance process, which requires resource-intensive manual review by

expert Medical Officers, does not scale well to short-term workload variation and

limits long-term improvements in review cycle-time. In addition, the large volume of

Adverse Event (AE) reports received by the Agency makes it difficult for reviewers to

compare safety issues across products and patient populations.

CBER engaged Booz Allen Hamilton to develop advanced analytics approaches for

the triage and analysis of AE reports. The main goal was to leverage (1) Natural

Language Processing (NLP) to alleviate resource pressures by semi-automating

some of the manual review steps through techniques, such as text classification

and entity extraction, and (2) network visualizations to offer alternative interactions

with datasets and support AE pattern recognition. By integrating NLP and network

analysis capabilities into the Medical Officer’s review process, Booz Allen successfully

provided decision-makers with important information concerning product risks and

possible mitigations that can reduce risk.

Our Approach

We designed and developed a set

extracting diagnosis, cause of death, and

of prototypes for the analysis and

time to onset, and presenting relevant

visualization of complex structured

information for review, the text-mining

and unstructured AE datasets. We

tool streamlines and enhances the post

developed tools that leverage NLP

market surveillance process.

and network analysis to extend and

enhance CBER’s efforts to monitor the

Adverse Event Network Analytics.

safety of biologics and manage safety

Visualizing relationships between

throughout the product lifecycle.

vaccines and AEs can reveal new

patterns and trends in the data, leading

Adverse Event Text Mining Analytics.

reviewers to uncover safety issues.

Reviewing AE reports involves sorting

To assist CBER Medical Officers and

based on the likelihood of a relationship

researchers in identifying instances

between a product and reported adverse where certain vaccines or combinations

events. Since much of an AE report is

of vaccines might have harmful

unstructured text, and most reports

effects on patients, we developed an

are not directly related to the use of

AE network analysis tool using open

the implicated biologic, manual review

source tools. This network analyzer

is time consuming and inefficient. To

allows users to select partitions of the

address these challenges, we enhanced

FDA Vaccine Adverse Event Reporting

and integrated tools for text mining

System (VAERS) database, generate a

and NLP of AE reports using open

co-occurrence matrix, view networks

source tools, including Python and R. By

and network metrics (e.g., betweenness,

THE FIELD GUIDE to D A T A S C I E N C E - closeness, degree, strength), and

handle large amounts of data generated

interact with network nodes to gain

by a Monte Carlo Markov Chain analysis

insights into product safety issues.

of the spread of influenza, developed

a large database analysis strategy

Other Analytics Solutions. In addition,

involving the application of classification

Booz Allen refactored, modularized,

algorithms to simulated genomic data,

and expended the capabilities of CBER’s

and implemented a Statistical Analysis

computer simulation model of the Bovine Software (SAS) macro that automatically

Spongiform Encephalopathy (BSE)

compares the relative potency of a given

agent to improve estimates of variant

lot of vaccine using a matched set of dose

Creutzfeldt-Jakob disease (vCJD) risk

response curves.

for blood products, developed code to

Our Impact

New methods for post market surveillance of biologics are critical for FDA reviewers

who must determine whether reported adverse events are actually a result of a

biologic product. With more than 10 million vaccines administered each year to

children less than one-year old, CBER reviewers are under pressure to quickly

evaluate potential safety signals through manual evaluation of AE reports, review

of scientific literature, and analysis of cumulative data using frequency calculations

or statistical algorithms. Booz Allen’s support resulted in the development

of innovative and data-driven approaches for the analysis of structured and

unstructured AE reports. We increased the speed of existing text mining tools by

two thousand times, allowing CBER reviewers to run a text mining algorithm to

extract information contained in VAERS reports in seconds, instead of hours. We

also increased the productivity of Medical Officers through the implementation

of text mining and network analysis tools. These new capabilities allow CBER to

streamline the post market review process, extract knowledge from scientific data,

and address public concerns regarding vaccine safety more quickly and efficiently.

Putting it all Together 109 - Forecasting the Relative Risk

›› for the Onset of Mass Killings to

Help Prevent Future Atrocities

Analytic Challenge

Mass atrocities are rare yet devastating crimes. They are also preventable. Studies

of past atrocities show that we can detect early warning signs of atrocities and that

if policy makers act on those warnings and develop preventive strategies, we can

save lives. Yet despite this awareness, all too often we see warning signs missed and

action taken too late, if at all, in response to threats of mass atrocities.

The Early Warning Project, an initiative of the United States Holocaust Memorial

Museum (Holocaust Museum), aims to assess a country’s level of risk for the onset

of future mass killings. Over time, the hope is to learn which models and which

indicators are the best at helping anticipate future atrocities to aid in the design and

implementation of more targeted and effective preventive strategies. By seeking to

understand why and how each countries’ relative level of risk rises and falls over

time, the system will deepen understanding of where new policies and resources can

help make a difference in averting atrocities and what strategies are most effective.

This will arm governments, advocacy groups, and at-risks societies with earlier and

more reliable warning, and thus more opportunity to take action, well before mass

killings occur.

The project’s statistical risk assessment seeks to build statistical and machine

learning algorithms to predict the onset of a mass killing in the succeeding 12 months

for each country with a population larger than 500,000. The publically available system

aggregates and provides access to open source datasets as well as democratizes the

source code for analytic approaches developed by the Holocaust Museum staff and

consultants, the research community, and the general public. The Holocaust Museum

engaged Booz Allen to validate existing approaches as well as explore new and

innovative approaches for the statistical risk assessment.

Our Approach

Taking into account the power of

identifying new datasets, building new

crowdsourcing, Booz Allen put out a call

machine learning models, and creating

to employees to participate in a hack-a-

frameworks for ensemble modeling

thon—just the start of the team’s support and interactive results visualization.

as the Museum refined and implemented Following the hack-a-thon, Booz

the recommendations. More than 80

Allen Data Scientists worked with

Booz Allen Hamilton software engineers, Holocaust Museum staff to create

data analysts, and social scientists

a data management framework to

devoted a Saturday to participate.

automate the download, aggregation,

Interdisciplinary teams spent 12 hours

and transformation of the open

THE FIELD GUIDE to D A T A S C I E N C E - source datasets used by the statistical

supporting greater engagement by

assessment. This extensible

the Data Science community.

framework allows integration of new

datasets with minimal effort, thereby

Our Impact

Publically launched in the fall of 2015, the Early Warning Project can now leverage

advanced quantitative and qualitative analyses to provide governments, advocacy

groups and at-risk societies with assessments regarding the potential for mass

atrocities around the world. Integration of the project’s statistical risk assessment

models and expert opinion pool created a publicly available source of invaluable

information and positioned Data Science at the center of global diplomacy.

The machine learning models developed during the hack-a-thon achieved

performance on par with state of the art approaches as well as demonstrated the

efficacy of predictions 2-5 years into the future. Teams also identified approaches for

constructing test/validation sets that support more robust model evaluation. These

risk assessments are an important technological achievement in and of themselves,

but what this initiative means for the Data Science community’s position in global

diplomatic dialogue marks an entirely new era for those on the frontiers of Big Data.

The data management framework developed from the lessons learned of the hack-

a-thon represents a great leap forward for the Holocaust Museum. The periodicity

of aggregating and transforming data was reduced from twice per year to once

per week. In addition to providing the community with more up-to-date data, the

reduced burden on researchers enables them to spend more time analyzing data and

identifying new and emergent trends. The extensible framework will also allow the

Holocaust Museum to seamlessly integrate new datasets as they become available or

are identified by the community as holding analytic value for the problem at hand.

Through this project, the Holocaust Museum was able to shift the dynamic from

monitoring ongoing violence to determining where it is likely to occur 12 to 24 months

into the future by integrating advanced quantitative and qualitative analyses to assess

the potential for mass atrocities around the world. The Early Warning Project is an

invaluable predictive resource supporting the global diplomatic dialogue. While the

focus of this effort was on the machine learning and data management technologies

behind the initiative, it demonstrates the growing role the Data Science community is

playing at the center of global diplomatic discussions.

Putting it all Together 111 - Predicting Customer Response

›› Analytic Challenge

It is very challenging to know how a customer will respond to a given promotional

campaign. Together with the InterContinental Hotels Group (IHG), Booz Allen

Hamilton explored approaches to predict customer-by-customer response to a state-

of-the-art promotional campaign in order to better understand and increase return on

investment (ROI).

In the past, conventional statistics have worked well for analyzing the impact of

direct marketing promotions on purchase behavior. Today, modern multi-channel

promotions often result in datasets that are highly dimensional and sometimes

sparse, which strains the power of conventional statistical methods to accurately

estimate the effect of a promotion on individual purchase decisions. Because of

the growing frequency of multi-channel promotions, IHG was driven to investigate

new approaches. In particular, IHG and Booz Allen studied one recent promotional

campaign using hotel, stay, and guest data for a group of loyalty program customers.

Our Approach

Working closely with IHG experts, Booz Allen investigated three key elements related

to different stages of analytic maturity:

Describe: Using initial data mining, what a probabilistic Bayesian Belief Network

insights or tendencies in guest behaviors (BBN) can learn the pairwise relationships

can be identified after joining multiple,

between all individual customer attributes

disparate datasets?

and their impact on promotional

return, Booz Allen investigated how

Discover: Can we determine which

this technique could be used to model

control group members would be likely to each treated customer without an exact

register for a promotion if offered? If so,

controlled look-alike.

can we also quantify their registration?

Predict: How would a hotel guest that

Specifically, Booz Allen developed a BBN

received the promotion have responded

to predict customer-by-customer impacts

if they were not offered the promotion?

driven by promotional campaign offers,

How would a hotel guest that did not

subsequently estimating the aggregated

receive the promotion have responded

ROI of individual campaigns. We used six

if they were offered the promotion?

machine learning techniques (support

vector machine, random forest, decision

For the promotion that was the focus

tree, neural network, linear model, and

of this case study, not everything about

AdaBoost) in unison with the BBN to

customers could be controlled as required predict how each customer would be

by traditional statistics. However, because influenced by a promotional offer.

THE FIELD GUIDE to D A T A S C I E N C E - Our Impact

The probabilistic model was capable of predicting customer response to the

promotion, without relying on a costly control group. This finding represents millions

of dollars of savings per promotional campaign. This analysis is an industry-

first for the travel and hospitality sector, where demand for more data-driven

approaches to optimize marketing ROI at an individual level is rapidly growing.

Because Booz Allen and IHG’s approach enabled estimation of ROI for each

hypothetical customer, even when no exact look-alikes exist, there are a number

of valuable future applications. One such application is optimal campaign design—

the ability to estimate the promotional attributes for an individual customer

that are likely to drive the greatest incremental spend. Another application is

efficient audience selection - which would reduce the risk of marketing “spam”

that prompts costly unsubscriptions and can negatively impact a hotel’s brand.

Putting it all Together 113 - CLOSING T I M E
- ›› THE FUTURE of

D A T A S C I E N C E

Data Science is rapidly evolving as it touches every aspect of our

lives on a daily basis. As Data Science changes the way we interact

with, and explore our world, the algorithms and applications of

Data Science continue to advance. We expect this trend to continue

as Data Science has an increasingly profound effect on humanity.

We describe here some of the trends and developments we

anticipate emerging in the field of Data Science over the coming years.

Kirk Borne

The advancements in some Data Science algorithms will deliberately

track the evolution of data structures and data models that Data

Scientists are using to represent their domains of study. One of the

clearest examples of this linkage is in the development of massive-

scale graph analytics algorithms, which are deployed on graph

databases (including network data and semantically linked databases).

It is sometimes said “all the world is a graph,” and consequently the

most natural data structure is not a table with rows and columns, but

a network graph with nodes and edges. Graph analytics encompasses

traditional methods of machine learning, but with a graph-data twist.

Another growth area in Data Science algorithms is in the domain

of geospatial temporal predictive analytics, which can be applied

to any dataset that involves geospatial location and time – that

describes just about everything in our lives! We expect increasingly

sophisticated deployments of this methodology in the areas of law

enforcement, climate change, disaster management, population

health, sociopolitical change, and more.

It is obvious that bigger, faster, and more complex datasets will

require faster (hyperfast!) analytics. We anticipate advanced

Data Science algorithms that take advantage of technological

advancements in quantum machine-learning, in-memory data

operations, and machine learning on specialized devices (e.g.,

the GPU, Raspberry Pi, or the next-generation mobile handheld

“supercomputer”). In such commodity devices, we expect to see

development of more embedded machine learning (specifically, deep

learning) algorithms that perform time-critical data-to-insights

transformations at the point of data collection. Such use cases will

be in great abundance within the emerging Internet of Things (IoT),

including the industrial IoT and the internet of everything.

Advances in cognitive machine learning are on the horizon,

including open source and configurable algorithms that exploit

streaming real-time data’s full content, context, and semantic

meaning. The ability to use the 360-degree view of a situation will

enable the delivery of the right action, at the right time, at the right

place, in the right context – this is the essence of cognitive analytics.

Another way to view cognitive analytics is that, given all of the

data and the context for a given object or population, the algorithm

identifies the right question that you should be asking of your data

(which might not be the question that you traditionally asked).

THE FIELD GUIDE to D A T A S C I E N C E - Another area of Data Science evolution that tracks with the growth

in a particular data type is that of unstructured data, specifically text.

The growth of such unstructured data is phenomenal, and demands

richer algorithms than those used on numerical data, since there are

many more shades of meaning in natural language than in tables of

numbers. The new Data Science algorithms for unstructured data

will be applied in multiple directions. Natural Language Generation

will be used to convert data points into text, which can be used

to generate the data’s story automatically. Structured Database

Generation will transform text documents or other unstructured

data into data points (i.e., converting qualitative data into machine-

computable quantitative data).

All of these technical advancements, plus others that we cannot

yet imagine, will be brought to bear on new domains. Some of the

hottest, most critical domains in which Data Science will be applied

in the coming years include:

› Cybersecurity including advanced detection, modeling, prediction, and

prescriptive analytics

› Healthcare including genomics, precision medicine, population health, healthcare

delivery, health data sharing and integration, health record mining, and wearable

device analytics

› IoT including sensor analytics, smart data, and emergent discovery alerting and response

› Customer Engagement and Experience including 360-degree view, gamification,

and just-in-time personalization

› Smart X, where X = cities, highways, cars, delivery systems, supply chain, and more

› Precision Y, where Y = medicine, farming, harvesting, manufacturing, pricing, and more

› Personalized Z, where Z = marketing, advertising, healthcare, learning, and more

› Human capital (talent) and organizational analytics

› Societal good

THE FUTURE OF

Algorithms

DATA SCIENCE

Applications

› Massive-scale Graph

› Cybersecurity

› Geospatial Temporal Predictive Analytics

› Healthcare

› Hyperfast Analytics

› Internet of Things

› Embedded Deep Learning

› Customer Engagement & Experience

› Cognitive Machine Learning

› Smart Everything

› Natural Language Generation

› Human Capital

› Structured Database Generation

› Data for Societal Good

Closing Time 117 - ›› PARTING

T H O U G H T S

Data Science capabilities are creating data analytics that are

touching every aspect of our lives on a daily basis. From visiting

the doctor, to driving our cars, to shopping for services Data

Science is quietly changing the way we interactive with and

explore our world. We hope we have helped you truly understand

the potential of your data and how to become extraordinary

thinkers by asking the right questions. We hope we have helped

continue to drive forward the science and art of Data Science.

Most importantly, we hope you are leaving with a newfound

passion and excitement for Data Science.

Thank you for taking this journey with us. Please join our

conversation and let your voice be heard. Email us your ideas

and perspectives at data_science@bah.com or submit them

via a pull request on the Github repository.

Tell us and the world what you know. Join us. Become an

author of this story.

Closing Time 119 - ›› REFERENCES

1. Commonly attributed to: Nye, Bill. Reddit Ask Me Anything (AMA).

July 2012. Web. Accessed 15 October 2013. SSRN: <http://www.reddit.

com/r/IAmA/comments/x9pq0/iam_bill_nye_the_science_guy_ama>

2. Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. “From

Data Mining to Knowledge Discovery in Databases.” AI Magazine 17.3

(1996): 37-54. Print.

3. “Mining Data for Nuggets of Knowledge.” Knowledge@Wharton,1999.

Web. Accessed 16 October 2013. SSRN: <http://knowledge.wharton.

upenn.edu/article/mining-data-for-nuggets-of-knowledge>

4. Cleveland, William S. “Data Science: An Action Plan for Expanding

the Technical Areas of the Field of Statistics.” International Statistical

Review 69.1 (2001): 21-26. Print.

5. Davenport, Thomas H., and D.J. Patil. “Data Scientist: The Sexiest Job

of the 21st Century.” Harvard Business Review 90.10 (October 2012):

70–76. Print.

6. Smith, David. “Statistics vs Data Science vs BI.” Revolutions, 15

May 2013. Web. Accessed 15 October 2013. SSRN:<http://blog.

revolutionanalytics.com/2013/05/statistics-vs-data-science-vs-bi.html>

7. Brynjolfsson, Erik, Lorin M. Hitt, and Heekyung H. Kim. “Strength

in Numbers: How Does Data-Driven Decision Making Affect Firm

Performance?” Social Science Electronic Publishing, 22 April 2011. Web.

Accessed 15 October 2013. SSRN: <http://ssrn.com/abstract=1819486

or http://dx.doi.org/10.2139/ssrn.1819486>

8. “The Stages of an Analytic Enterprise.” Nucleus Research. February 2012.

Whitepaper.

9. Barua, Anitesh, Deepa Mani, and Rajiv Mukherjee. “Measuring the

Business Impacts of Effective Data.” University of Texas. Web. Accessed

15 October 2013. SSRNL: <http://www.sybase.com/files/White_Papers/

EffectiveDataStudyPt1-MeasuringtheBusinessImpactsofEffectiveDa

ta-WP.pdf>

10. Zikopoulos, Paul, Dirk deRoos, Kirshnan Parasuraman, Thomas Deutsch,

David Corrigan and James Giles. Harness the Power of Big Data: The IBM

Big Data Platform. New York: McGraw Hill, 2013. Print. 281pp.

THE FIELD GUIDE to D A T A S C I E N C E - 11. Booz Allen Hamilton. Cloud Analytics Playbook. 2013. Web. Accessed 15

October 2013. SSRN: <http://www.boozallen.com/media/file/Cloud-

playbook-digital.pdf>

12. Conway, Drew. “The Data Science Venn Diagram.” March 2013.

Web. Accessed 15 October 2013. SSRN: <http://drewconway.com/

zia/2013/3/26/the-data-science-venn-diagram>

13. Booz Allen Hamilton. Tips for Building a Data Science Capability. 2015.

Web Accessed 2 September 2015. SSRN: < https://www.boozallen.com/

content/dam/boozallen/documents/2015/07/DS-Capability-Handbook.pdf>

14. Mnih et al. 2015. Human-level control through deep reinforcement

learning. Nature. 518: 529-533.

15. Torán, Jacobo. “On the Hardness of Graph Isomorphism.” SIAM Journal

on Computing. 33.5 (2004): 1093-1108. Print.

16. Guyon, Isabelle and Andre Elisseeff. “An Introduction to Variable and

Feature Selection.” Journal of Machine Learning Research 3 (March

2003):1157-1182. Print.

17. Golub T., D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,

H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E.

Lander. “Molecular Classification of Cancer: Class Discovery and Class

Prediction by Gene Expression Monitoring.” Science. 286.5439 (1999):

531-537. Print.

18. Haykin, Simon O. Neural Networks and Learning Machines. New Jersey:

Prentice Hall, 2008. Print.

19. De Jong, Kenneth A. Evolutionary Computation - A Unified Approach.

Massachusetts: MIT Press, 2002. Print.

20. Yacci, Paul, Anne Haake, and Roger Gaborski. “Feature Selection of

Microarray Data Using Genetic Algorithms and Artificial Neural

Networks.” ANNIE 2009. St Louis, MO. 2-4 November 2009.

Conference Presentation.

Closing Time 121 - About

›› BOOZ ALLEN

H A M I L T O N

Booz Allen Hamilton has been at the forefront of strategy and

technology for more than 100 years. Today, the firm provides

management and technology consulting and engineering services to

leading Fortune 500 corporations, governments, and not-for-profits

across the globe. Booz Allen partners with public and private sector

clients to solve their most difficult challenges through a combination

of consulting, analytics, mission operations, technology, systems

delivery, cybersecurity, engineering, and innovation expertise.

With international headquarters in McLean, Virginia, the firm

employs more than 22,500 people globally, and had revenue of $5.27

billion for the 12 months ended March 31, 2015. To learn more, visit

www.boozallen.com. (NYSE: BAH)

Closing Time 123