このページは http://www.slideshare.net/RevolutionAnalytics/performance-and-scale-options-for-r-with-hadoop-a-comparison-of-potential-architectures の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

1年以上前 (2015/02/25)にアップロードin学び

R and Hadoop go together. In fact, they go together so well, that the number of options available...

R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.

Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?

This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.

Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.

- R at Microsoft1年以上前 by Revolution Analytics
- Using R with Hadoop3年以上前 by Revolution Analytics
- Integrate Hive and R5年弱前 by JunHo Cho

- R and Hadoop:

Architectural Options

Bill Jacobs

VP Product Marketing & Field CTO, Revolution

Analytics

@bill_jacobs - Polling Question #1:

Who Are You? (choose one)

– Statistician or modeler who uses R

– Other R developer

– Hadoop Expert

– Application builder

– Data guru

– Business user

– Systems vendor or reseller

– Something else… - Agenda

• Challenges

• Options

• Considerations

• How to Choose - Boundless Opportunities

Marketing: Clickstream &

P&C Insurance: Risk Analysis

Campaign Analyses

Consumer Products: Warranty

Digital Media:

Optimization

Recommendation Engines

Operations: Supply Chain

Retail: Social Sentiment

Optimization

Analysis

Econometrics: Market

Insurance: Fraud Waste and

Prediction

Abuse

Marketing: Mix and Price

Healthcare Delivery: Outcome

Optimization

Prediction

Life Sciences:

Manufacturing: Quality

Pharmacogenetics

Optimization

Transportation: Asset

Utilization - Polling Question #2:

What Industry Do You Represent?

– Financial Services

– Insurance

– Healthcare, Life Sciences or Pharma

– Manufacturing

– Energy

– Retail

– Logistics and Transportation

– Education

– Government

– Marketing & Advertising

– Technology

– Other - In A Perfect World…

Analytical Capability

Security

Compute

Ease

Data Scale

Price

Users - Hadoop Analytics - Many Alternatives

R Based Alternatives

Legacy tools updated – SAS HPA, etc.

Big Data Databases

Other Languages – Scala, Java, Julia, various GUIs

Today’s Topic:

R-Based Alternatives

– “Beside Architectures”

– “Inside Architectures”

– Open Source and Commercial - Reality: Tradeoffs.

Traditional Statistics vs. Machine Learning

In-Memory vs. Shared Infrastructure

CRAN vs. Parallelization

Desktop vs. Remote

Explicit vs. Automatic Distribution

Locality vs. Movement

Real-Time vs. MapReduce

Memory Limits - No Magic Bullet.
- Corporate Overview & Quick Facts

Revolution R Enterprise is the leading commercial analytics platform based on

the open source R statistical computing language

Founded

2008 (as REvolution

Computing)

Number of

200+

customers

Office Locations

Palo Alto (HQ), Seattle

Investors

•

Northbridge Venture Partners

(Engineering)

•

Intel Capital

Singapore

•

Platform Vendor

London

Web site:

•

www.revolutionanalytics.com

CEO

David Rich - Revolution Analytics

Our Vision:

Our Mission:

R becomes the de-

Drive enterprise

facto standard for

adoption of R by

enterprise predictive

providing enhanced R

analytics

products tailored to

meet enterprise

challenges - Revolution Analytics Builds & Delivers:

Software Products:

Support & Services

Stable Distributions

Commercial Support Programs

Broad Platform Support

Training Programs

Professional Services

Big Data Analytics in R

Application Integration

Community Programs

Deployment Platforms

Academic Support Programs

Agile Development Tooling

Contributions to Open Source R

Future Platform Support

Open Source Extensions

Sponsorship of R User Groups - Revolution Analytics Technical Innovations

R Options from Open Source

Production Deployment

to Enterprise

Support

Parallelized Analytical

Multi-Platform Deployment

Computation

Legacy Data Format Support

In-Database & In-Hadoop

Multiple IDE Options

Analytics

PMML Model Export

Big Data Scalability

Remote Execution - The Revolution R Product Suite

Revolution R Open

• Free and open source R distribution

• Enhanced and distributed by Revolution Analytics

Revolution R Plus

• Open-source distribution of R, packages, and other components

• Enhanced, supported and indemnified by Revolution Analytics

Revolution R Enterprise

• Secure, Scalable and Supported Distribution of R

• With proprietary components created by Revolution Analytics - Polling Question #3:

State Play: In your company you are…

– Building Our “Data Lake”

– Running R + Hadoop Data Today

– Running R inside Hadoop using Open source

– Running RRE inside Hadoop

– Deploying Business Apps. Using Analytics from Hadoop Data

– Looking at Next Steps e.g. Spark, etc. - Revolution Analytics:

Eight Alternatives for Integrating R & Hadoop

Open Source

1. Open Source R

2. Revolution R Open

3. Open Source Parallelization on Workstations & Servers

4. rHadoop: Open Source Parallelization with rHadoop

Commercial

5. Revolution R Enterprise on Servers & Workstations

6. Revolution R Enterprise on Edge Nodes

7. Revolution R Enterprise Inside Hadoop

8. Combined Edge Node & Inside Hadoop - 1. Open Source R Integrated With Hadoop

Traditional Open Source R “Beside” Architecture:

• Traditional

Open Source

CRAN

Algorithms

• Memory-

rODB

Limited

C

rHDFS

rHbas

• Data Moves

e

rHive - 2. Revolution R Open On Workstations & Servers

Replace Open Source R “Beside” Architecture with Revolution R Open

As with Open Source R:

•

Still Free.

•

Still Memory Based.

CRAN

•

Data Still Moves.

Algorithms

rODB

C

Improvements:

rHDFS

rHbas

e

•

Accelerates Math

rHive

with Intel MKL

•

Improves R-based

packages

Limitations

• No Effect

for non-R Code - Accelerate R Math with Intel Math Kernel Lib’s.

Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html - 3. Write Parallel Algorithms PC, Server or Clusters

Write R Code to Explicitly Parallelize – Deploy Across Several Systems

Example Uses:

•

Bootstrapping

As with Previous:

•

Simulation

•

Still Free.

•

HPC

•

Still Memory Based.

•

Data Still Moves.

rODB

C

•

Intel MKL with RRO

rHDFS

rHbas

e

rHive

Improvements:

•

Parallelized Execution

ForEach & Iterator

Can Include CRAN

•

DoParallel (PC, server)

Algorithms “Carefully”

•

Limitations:

DoMPI (cluster)

•

RRE RxEXEC

•

Parallelization Difficulty

•

Data Movement

•

Platform Specific - 4. rHadoop: Custom Parallel Execution for Hadoop

Execute R Code & CRAN Algorithms Inside Hadoop

As With Previous:

Still Free.

R Code

Optional Intel MKL

in RRO

rHbase

rHDFS

Remote

Desktop

Improvements:

Runs R in

rMapReduce

Hadoop

Streaming

MapReduce

Example Uses:

No Data Movement

•

Scoring

•

Transformation

•

Limitations:

Easily Parallelized

Algorithms

Manual

Parallelization

Can Include CRAN

Hadoop Specific

Algorithms

“Carefully” - 5. Revolution R Enterprise (RRE) PEMAs inside

Hadoop

Traditional “Beside” Architecture with Optimized Algorithms

Available for Windows, Linux

As With Previous:

Includes Intel MKL in RRO

Advantages

Revolution R Enterprise:

Speed: PEMAs Parallelize

•

ScaleR PEMA

Across Threads, Cores &

Algorithms

rODB

Sockets

plus

C

rHDFS

Scale: PEMAs “Chunk” -

•

rHbas

All of CRAN

no Memory Limits

e

(subject to memory limits)

rHive

All of CRAN Available

Portability

Fully Supported

Limitations:

Data Movement

Single Machine - Revolution R Enterprise

is….

the only big data big analytics platform

based on open source R

High Performance, Scalable Analytics

Portable Across Enterprise Platforms

Easier to Build & Deploy Analytics - ScaleR

Refactor Algorithms for Dramatic Performance and Capacity Improvement - ScaleR

High Performance Algorithms for the Most Common Uses

Data Step

Statistical Tests

Variable Selection

Data import – Delimited, Fixed, SAS, SPSS,

Chi Square Test

Stepwise Regression

OBDC

Kendall Rank Correlation

Variable creation & transformation

Fisher’s Exact Test

Recode variables

Student’s t-Test

Simulation

Factor variables

Simulation (e.g. Monte Carlo)

Sampling

Missing value handling

Parallel Random Number Generation

Subsample (observations & variables)

Sort, Merge, Split

Random Sampling

Aggregate by category (means, sums)

Descriptive Statistics

Predictive Models

Cluster Analysis

Min / Max, Mean, Median (approx.)

Sum of Squares (cross product matrix for set

Quantiles (approx.)

K-Means

variables)

Standard Deviation

Multiple Linear Regression

Variance

Generalized Linear Models (GLM) exponential

Correlation

family distributions: binomial, Gaussian, inverse

Classification

Covariance

New in

Gaussian, Poisson, Tweedie. Standard link

Sum of Squares (cross product matrix for set

Decision Trees

7.3

functions: cauchit, identity, log, logit, probit. User

variables)

Decision Forests

defined distributions & link functions.

Pairwise Cross tabs

Gradient Boosted Decision Trees

Covariance & Correlation Matrices

Risk Ratio & Odds Ratio

Logistic Regression

Combination

Cross-Tabulation of Data (standard tables & long

Classification & Regression Trees

form)

PEMA-R API

Predictions/scoring for models

Marginal Summaries of Cross Tabulations

rxDataStep

Residuals for all models

rxExec

Revolution Analytics Confidential – Under NDA

25 - What’s a PEMA?

Parallel External Memory Algorithms

• Not Limited to Available

Memory

• Unlimited Data Scale

• Ingests Data One Chunk

ScaleR PEMA

Script Calls

At A Time.

Load Block At A

ScaleR

Analyze Each

Time

• Adjustable Memory

Algorithm

Block

Start & Manage

Footprint

Processing

• Multi-Thread Execution

Master

Data

Performance

Algorithm

Scripts can call CRAN Open

Process

• Highly-Optimized

Source Algorithms

Combine

Algorithms

Individual

• Algorithm Math Fully

Results

Refactored for Parallelism

• Delivered as ScaleR

Library in Revolution R

Enterprise - 6. Run Revolution R Enterprise on Hadoop

Edge Node(s)

Fast Single-Server Alternative for Modest Data Scale

As With Previous:

Single Machine Execution

(opt.)

PEMA Scale & Speed (Single

rODB

Machine)

Edge

C

rHDFS

Thin Client or

rHbas

Node

Use ScaleR + CRAN

Remote

e

Accelerate R with Intel MKL

rHive

Desktop

Local

File

Improvements:

System

Easily Shared via

ScaleR + CRAN

No Data Movement

Algorithms

Develop on Desktop Run on

Edge Node

Limitations:

“Shorter Trip” for Data - 7. Fast, Transparent Parallel Computation

Inside Hadoop YARN/MapReduce

Fast Parallelized Analytics on Large Data Sets In Hadoop

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Use CRAN Where Appropriate

ScaleR

Algorithms

Accelerate R Math with MKL

Custom Parallelized Algo’s

Desktop & Server

Remote

Tools and

Advantages

Execution

Applications

Parallel Computation

DeployR

No Data Movement

We

jobtracker

b

Web

ScaleR PEMA Parallelization

Ser

Services

Can Parallelize CRAN “Carefully”

vice

Portable Coding

s

Limitations:

Hadoop Workload Profiles - One Client’s Experience with RRE on Hadoop

Test Cluster - 9 Nodes

Task

Processing Time

Importing and Filtering Datasets from

HDFS

14 Million Observations

82 sec.

9 Task

2 Admin

Nodes

1 Edge

227 Million Observations

310 sec.

Nodes

Node

Modeling and Estimation

1.2 M Correlations

2771 sec.

128GB

Simple Linear Regression, 227 M

128GB

24 cores

Observations

61 sec.

24 cores

each

each

Multiple Linear Regression, Three

64GB

Variables, 227 M Observations

58 sec.

24 cores

Multiple Linear Regression, Four

each

Variables, 227 M Observations

58 sec.

Random Forest, 10 Predictor Variables,

227 M Observations, 10 Trees with Max

Depth of 10 Splits

2 hr. 3 min.

29 - 8. Combined Edge Node & In-Hadoop

Maximized Flexibility, Performance & Workload Handling

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Remote

ScaleR

Execution

Use CRAN Where Appropriate

Algorithms

Thin Client

Accelerate R Math with MKL

Development

Custom Parallelized Algo’s

Advantages

Flexibility for Blended Workloads

Desktop & Server

We

Tools and

rStudio

Little or No Data Movement

b

Applications

Ser

DeployR

Maximize CRAN Capabilities by

vice

Sharing Large RAM Edge Nodes

s - Occasionally

Conflicting Criteria

Data Science Criteria:

Performance

Infrastructure Criteria:

Self Service

Big Data Platform

Flexibility

Vendor Choice

Collaboration

Data Ingest

Sharing

Data Security

Capability

Data Governance - Key Questions:

Where are the bulk of your skills? SAS? R? Java? Python? SQL?

Where do you build models today?

Do you have the skills to parallelize algorithms?

Can models be built on a big shared server?

How will you run models?

Do you have the budget to purchase commercial solutions?

How will your needs change over time?

What is your future architecture plan?

How risk averse is your management team regarding new platforms and

open source? - Key Questions (cont.)

What Workloads Do You Anticipate?

What Use Cases Will You

— How May Users?

Encounter?

— What Workloads?

— Traditional statistical

exploration, modeling?

Workload Realities:

— Behavior Prediction?

—

—

Outlier Detection?

Many small tasks do not run well

in MapReduce

— Simulation and HPC?

— Large data movements /

— Massively wide data?

duplications are costly

— Real-Time scoring?

— Internet of Things? - Eight Steps to Fast, Scalable R Analytics with

Hadoop

Open Source Options

No Clear Winner:

1. Open Source R

Budget & use case determine

2. Revolution R Open

optimal path

3. Open Source Parallelization…

Compelling options in both open

4. rHadoop…

source & commercial source

Commercial Options

RRE ScaleR uniquely provides

5. RRE on Servers &

automatic parallelization

Workstations

Current Hadoop platforms are

6. RRE on Edge Nodes

fast for large scale analytics.

7. RRE Inside Hadoop

Combined in-server & in-hadoop

fits majority of cases

8. RRE on Edge Node & Inside

Hadoop - 2015 Challenges & Opportunities

•

Evolving Hadoop Architectures

•

In-Memory Analytics – Spark, YARN Containers, Caching

•

Additional Algorithm Parallelization

•

Cluster Management

•

Cloud and Hybrid Cloud Clusters

•

SQL on Hadoop “Battle-Royale”

•

Addressing the Resource Reality

•

Integration, Deployment Both Drain on Expensive Resources

•

Leverage other skills

•

Design efficient collaboration

•

“Analytics for the Rest of Us”

•

New Consumption Targets – Mobile

•

New Participants in Design – Business Users - Recommended Resources

Revolution Analytics Products

– http://www.revolutionanalytics.com/products

– http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws

Whitepaper: “Delivering Value from Big Data with Revolution R

Enterprise and Hadoop

– http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-

revolution-r-enterprise-and-hadoop

Revolution Analytics on Social Media:

– http://blog.revolutionanalytics.com/

– @revolutionr on Twitter

– @bill_jacobs on Twitter