このページは http://www.slideshare.net/nitishupreti/blinkdb の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年弱前 (2014/12/01)にアップロードinテクノロジー

Introduction to BlinkDB : Queries with Bounded Errors and Bounded Response Times on Very Large Data

- Goal : Solve Big Data !

How to achieve the

best Performance ?

2 - 100 TB on 1000 machines

½ - 1 Hour

1 - 5 Minutes

1 second

?

Hard Disks

Memory

3 - Better and Faster

Frameworks ?

Evolved To

?

Evolves To

4 - If we cannot do better

than In-Memory than

what?

5 - Can we use

Approximate

Computing ?

6 - Can you tolerate

Errors ?

Well, It depends on

the scenario right…

7 - Overview of Big Data

Space

8 - Massive log Batch

processing

9 - Can we use

Approximate Computing

?

Answer : YES / NO

10 - Streaming data

processing

11 - Can we use

Approximate Computing

?

Answer : MAYBE

12 - Exploratory Data

Analysis

13 - Exploratory

/

Interactive

Data

Processing

-- Getting a sense of data (Data

Scientists)

--

Debugging

?

(SREs

/

DevOps)

14 - Can we use

Approximate Computing

?

Answer : YES !

15 - 1) BlinkDB : Queries with Bounded Errors and

bounded Response Times on Very Large Data.

2) Blink and It’s Done : Interactive Queries on Very

Large Data.

3) A General Bootstrap Performance Diagnostic.

4) Knowing When You’re Wrong : Building Fast and

Reliable Approximate Query Processing Systems.

Sameer Agrawal, Ariel Kleiner, Henry Milner, Barzan

Mozafari, Ameet Talwalkar, Michael Jordan, Samuel

Madden, Ion Stoica - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

17 - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

AV

A G

V , COUNT,

SUM, STDEV

E ,

V

PERCEN

CE TIL

TI E etc

t .

18 - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

FI

F L

I TERS

TER , GROU

O P BY

BY clauses

19 - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT AVG(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

LEFT OUTER JOIN logs2

ON very_big_log.id = logs.id

JOIN

I S, Nes2te

0 d Q

d uerie

i s etc. - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

ML P

rimit

im iv

it es,

s

WHERE src = ‘hadoop’

Use

s r D

efine

f

d

LEFT OUTER JOIN logs2

Func

F

tions

ON very_big_log.id = logs.id

21 - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

ERROR WITHIN 10% AT CONFIDENCE 95%

22 - Our Goal

Support interactive SQL-like aggregate

queries over massive sets of data

blinkdb> SELECT my_function(jobtime)

FROM very_big_log

WHERE src = ‘hadoop’

WITHIN 5 SECONDS

23 - Query Execution on Samples

I

City

Buff

(Exploration Query)

D

Ratio

1

NYC

0.78

What is the average buffering

2

NYC

0.13

ratio in the table?

3

Berkeley 0.25

4

NYC

0.19

5

NYC

0.11

6

Berkeley 0.09

7

NYC

0.18

8

NYC

0.15

9

Berkeley 0.13

1

Berkeley 0.49

0

0.2325 (Precise)

1

NYC

0.19

1

1

Berkeley 0.10

2

24 - Query Execution on Samples

I

City

Buff

What is the average buffering

D

Ratio

1

NYC

0.78

ratio in the table?

2

NYC

0.13

3

Berkeley 0.25

4

NYC

0.19

ID

City

Buff

Sampling

5

NYC

0.11

Ratio

Rate

6

Berkeley 0.09

2

NYC

0.13

1/4

7

NYC

0.18

Uniform

Sample

8

NYC

0.15

6

Berkele

0.25

1/4

y

9

Berkeley 0.13

8

NYC

0.19

1/4

1

Berkeley 0.49

0

0.2325 (Precise)

1

NYC

0.19

1

0.19

1

Berkeley 0.10

2

25 - Query Execution on Samples

I

City

Buff

What is the average buffering

D

Ratio

1

NYC

0.78

ratio in the table?

2

NYC

0.13

3

Berkeley 0.25

4

NYC

0.19

ID

City

Buff

Sampling

5

NYC

0.11

Ratio

Rate

6

Berkeley 0.09

2

NYC

0.13

1/4

7

NYC

0.18

Uniform

Sample

8

NYC

0.15

6

Berkele

0.25

1/4

y

9

Berkeley 0.13

8

NYC

0.19

1/4

1

Berkeley 0.49

0

0.2325

1

NYC

0.19

1

0.19

(Prec +/- 0.

ise)

05

1

Berkeley 0.10

2

26 - Query Execution on Samples

I

City

Buff

What is the average buffering

D

Ratio

1

NYC

0.78

ratio in the table?

2

NYC

0.13

ID

City

Buff

Sampling

3

Berkeley 0.25

Ratio

Rate

4

NYC

0.19

2

NYC

0.13

1/2

5

NYC

0.11

3

Berkele

0.25

1/2

6

Berkeley 0.09

y

7

NYC

0.18

Uniform

5

NYC

0.19

1/2

Sample

8

NYC

0.15

6

Berkele

0.09

1/2

9

Berkeley 0.13

y

1

Berkeley 0.49

0

8

NYC

0.18

1/2

0.2325

1

NYC

0.19

12

Berkele

0.49

1/2

1

y

(Prec

0.19 ise)

+/- 0.05

1

Berkeley 0.10

2

$0.22 +/- 0.

27

02 - Speed/Accuracy Trade-of

Interactive

Queries

Error

Time to

Execute on

Entire Dataset

2 sec

30 mins

Execution Time (Sample Si2z

8 e) - Speed/Accuracy Trade-of

Interactive

Queries

Error

Time to

Execute on

Entire Dataset

2 sec

30 mins

Pre-Existing

Noise

Execution Time (Sample Si2z

9 e) - Where do you want to

be on the curve ?

30 - Sampling Vs No Sampling on 100 Machines

1000

1020

900

800

10x as response time

e (Seconds) 700

10x as response tim

is

is domina

m

t

ina e

t d by

b I/IO

600

500

400

300

200

102

uery Response Tim 100

Q

18

13

10

8

0

1

10-1

10-2

10-3

10-4

10-5

Fraction of full data (10

31 TB) - Sampling Vs No Sampling

1000

1020

900

800

700

e (Seconds) 600

500

Erro

rr r B

r ar

a s

r

400

300

200

(0.02%)

103

(0.07%) (1.1%) (3.4%) (11%)

100

uery Response Tim

Q

18

13

10

8

0

1

10-1

10-2

10-3

10-4

10-5

Fraction of full data 32 - Okay, so you can

tolerate

errors…

What are some of the fundamental

challenges

?

What types of Sample to create ? (cannot

Sample

everything)

This boils to : What is our assumption

on the nature of future query workload

?

33 - Usual

Assumption:

Future

queries

are

SIMILAR

to

past

queries.

What is Similarity ?

( Choosing the wrong notion has a heavy

penalty : Under / Over fitting )

34 - Workload Taxonomy

35 - Predictable QCS

• Fits well on the model of exploratory

queries. (Queries are usually distinct

but most will use the same column)

• What kind of videos are popular for a

region ?

- Require looking at data from thousands of videos

and hundreds of geographical regions. However fixed

column sets : “video titles” (for grouping) and

“viewer location” (for filtering).

• Backed by empirical evidence from

Conviva & Facebook. Key reason for

36

BlinkDB efficiency. ( Lots of work in

Database theory) - 37
- BlinkDB Overview

38 - What is BlinkDB?

A framework built on Shark and Spark that …

- Creates and maintains a variety of offline

samples from underlying data.

- Returns fast, approximate answers with

error bars by executing queries on samples

of data ( Runtime Error Latency Profile for

Sample selection )

- Verifies the correctness of the error bars

that it returns at runtime.

39 - 1) Sample Creation

40 - Building Samples for

Queries

• Uniform

Sampling

Vs

Stratified

Sampling.

• Uniform sampling is however inefficient

for queries that compute aggregates

from group :

- We could simply miss under-representing group.

- We care about error of each query equally, with

uniform sampling we would be assigning more

samples to a group which is more represented.

• Solution : Sample size assignment is

deterministic and not random.4 1This can

be achieved with Stratified sampling. - Some Terminology …

42 - QCS to Sample On

43 - What QCS to sample

on ?

• Formulation

as

an

optimization

problem, where three major factors to

consider are : “sparsity” of data,

“workload

characteristics”

and

“storage cost of samples”.

• Sparsity : Define a sparsity function as

the number of groups whose size in ‘T’

is less than some number ‘M’.

44 - QCS

to

sample

(Contd)…

• Workload : A query with QCS ‘q ’ has

j

some unknown probability ‘p ’. The

j

best estimate of p is past frequency of

j

queries with QCS q .j

• Storage Cost : Assume a simple

formulation where K is same for each

group. For a set of columns in

ϕ the

storage cost is |S(ϕ,K)| :

45 - Goal : Maximize the

weighted

sum

of

coverage.

where ‘coverage’ for a query ‘q’ given a

i

sample is defined as the probability that a given

value ‘x’ for the columns is also present among

the rows of S(ϕ ,K).

i

46 - Optimization Problem

Optimize the following MILP :

where ‘m’ are all possible QCS and ‘j’ indexes over all

queries and ‘i’ over all column sets.

47 - How to sample ?

48 - Given a known QCS

…

• Compute Sample Count for a Group :

- K = min ( n’ / D( ) ,

ϕ |T x | )

0

• Take Samples as :

- For each group, sample K rows uniformly at random

without replacement forming sample Sx

• The entire sample S( ,

ϕ K) is the disjoint

union of multiple S :

x

- If |T | > K, we answer based on K random tuples

x

otherwise we can provide an exact answer.

• For aggregate function AVG, SUM,

49

COUNT and Quartile, K directly

determines error . - Sharing QCS

• Multiple queries with different ‘t’ and

‘n’ will share the same QCS. We need

to select a subset from our sample

dependency.

• We need an appropriate storage

technique to allow such subsets to be

identified at runtime.

50 - Storage Technique

• The rows of stratified sample S(ϕ,K) are

stored sequentially according to order of

columns in ϕ.

• When S is spread over multiple HDFS

x

blocks, each block contains a random

subset from S .

x

• It is then enough to read any subset of

blocks comprising S as long as these

x

blocks contained minimum needed

records.

51 - B = Data Block

ij

(HDFS)

52 - Storage Requirement

A table with 1 billion (10^12) tuples and

a column set with a Zipf distribution

(heavy tailed) with an exponent of 1.5,

it turns out that the storage required by

sample S( ,

ϕ K) is only 2.4% of the

original table for K = 10^4, 5.2% for K =

10^5 , and 11.4% for K = 10^6. This is

also consistent with real world data

from Conviva & Facebook.

53 - What is BlinkDB?

A framework built on Shark and Spark that …

- creates and maintains a variety of samples

from underlying data.

- returns fast, approximate answers by

executing queries on samples of data with

error bars.

- verifies the correctness of the error bars

that it returns at runtime.

54 - 2) BlinkDB Runtime

55 - Selecting a Sample

• If BlinkDB finds one or more stratified

samples on a set of columns ‘ϕ ’ such

i

that our query ‘q’ ⊆ ϕ , we pick the ϕ

i

i

with smallest number of columns.

• If no such QCS samples exist, run ‘q’

on in-memory subsets of all samples

maintained by the system. Out of these,

we select those with high selectivity.

- Selectivity = Number of Rows Selected by q /

Number of rows read by q

56 - Selecting right Sample

Size

• Construct an ELP (Error Latency Profile)

that characterizes the rate at which the

error decreases ( and time increases)

with increase in sample size by running

query on smaller samples.

• The scaling rate depends on query

structure like JOINS, GROUP BY,

Physical data placement and underlying

data distribution.

57 - Error Profile

• Given Q’s error constraints : Idea is to

predict the size of smallest sample that

satisfies constraints.

• Variance and Closed form aggregate

functions are estimated using standard

closed form formulas.

• Also BlinkDB estimates query selectivity,

input data distribution by running queries

on smaller subsamples.

• Number of rows are thus c58alculated

using Statistical Error estimates. - 59
- Latency Profile

• Given Q’s time constraints : Idea is to

predict the maximum size sample that we

should run query on within the constraints.

• Value of ‘n’ depends on input data, physical

placement of disk, query structure and

available resources. So as a simplification :

BlinkDB simply predicts ‘n’ by assuming

latency scales linearly in input size.

• For very small in-memory samples : BlinkDB

runs a few smaller samples until performance

seems to grow linearly and then estimate the

linear scaling constants.

60 - Correcting Bias

• Running a query on a non-uniform

sample introduces certain amount of

statistical bias and introduce subtle

inaccuracies.

• Solution

:

BlinkDB

periodically

replaces the samples use using a low

priority

background

task

which

periodically ( daily ) samples the

original data which are then used by

the system.

61 - Error Estimation

Closed Form Aggregate Functions

- Central Limit Theorem

- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

62 - Error Estimation

Closed Form Aggregate Functions

- Central Limit Theorem

- Applicable to AVG, COUNT, SUM,

VARIANCE and STDEV

Generalized Aggregate Functions

- Statistical Bootstrap

- Applicable to complex and nested

queries, UDFs, joins etc.

- Very computationally expensive.

63 - But we are not done

yet …

• Statistical functions like CLT and

Bootstrap operate under a set of

assumption on query / data.

• We need to have some correctness

verifiers !

64 - What is BlinkDB?

A framework built on Shark and Spark that …

- creates and maintains a variety of samples

from underlying data

- returns fast, approximate answers with error

bars by executing queries on samples of

data

- verifies the correctness of the error bars

that it returns at runtime

65 - Kleiner’s

Diagnostics

More Data Higher Accuracy

300 Data Points 97% Accuracy

[KDD 13]

Error

Sample Size

66 - 300 Data Points ≈ 30K

Queries

for

Bootstrap !

67 - So In an Approximate

QP :

• One query that estimates the answer.

• Hundred Queries on Resample of data

that computes the error.

• Tens of Thousand of Queries to verify if

this error is correct.

• BAD PERFORMANCE !

• Solution: Single Pass Execution

framework.

68 - What is BlinkDB?

A framework built on Shark and Spark that …

- creates and maintains a variety of samples

from underlying data

- returns fast, approximate answers with error

bars by executing queries on samples of

data

- verifies the correctness of the error bars

that it returns at runtime

69 - BlinkDB

Implementation

70 - BlinkDB

Architecture

Command-line Sh

S e

h ll

Thrif

Th t/JD

rif

B

t/JD C

Driv

D

er

e

Phys

Ph ic

ys al

Plan

Me

M ta

SQL

Query

ry

UD

U Fs

D

store

s

Parse

Pars r

Optimiz

Optim er

Execution

ut

Hadoop/Spark

p

/Pres

/Pre to

Hadoop S

p tor

t ag

a e (

e e.g

. .

g , HD

FS

F ,

S Hbase,

e Pre

P sto)

t - 72
- Implementation

Changes

• Additions in Query Language Parser.

• Parser can trigger a sample creation and

maintenance module.

• A sample selection module that re-writes the

query and assigns it an approximately sized

sample.

• Uncertainty module to modify all pre-existing

aggregation function to return error bars and

confidence intervals.

• A module periodically samples from the

original data, creating new samples

73 which are

then used by the system. (Co-Relation +

Workload Changes) - BlinkDB Evaluation

74 - BlinkDB Vs. No

Sampling

2.5 TB

from

Cache

7.5 TB

from Disk

Log Scale

75 - Scaling BlinkDB

Each query operates on 100N GB of data.

76 - Response

Time

and

Error Bounds …

20 Conviva queries averaged over 10runs

77 - Take Away …

• The only way for now to escape

memory performance barrier is to use

Approximate Computing .

• A huge role to play in exploratory data

analysis.

• BlinkDB provides a framework for AQP

+ Error Bars + Verifies them.

• Great Performance on real world

workloads.

79 - Personal Takeaway :

Take

a

STATISTICS

class!

80 - Questions ?

THANK YOU!

82