このページは http://www.slideshare.net/plinux/performance-of-fractal-tree-databases の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Performance of Fractal-Tree

Databases

Michael A. Bender - The Problem

Problem: maintain a dynamic dictionary on disk.

Motivation: file systems, databases, etc.

State of the art:

• B-tree [Bayer, McCreight 72]

• cache-oblivious B-tree [Bender, Demaine, Farach-Colton 00]

• buffer tree [Arge 95]

• buffered-repository tree[Buchsbaum,Goldwasser,Venkatasubramanian,Westbrook 00]

• Bε tree [Brodal, Fagerberg 03]

• log-structured merge tree [O'Neil, Cheng, Gawlick, O'Neil 96]

• string B-tree [Ferragina, Grossi 99]

• etc, etc!

State of the practice:

• B-trees + industrial-strength features

Michael Bender -- Performance of Fractal-Tree Databases

2 - The Problem

Problem: maintain a dynamic dictionary on disk.

Motivation: file systems, databases, etc.

State of the art (algorithmic perspective):

• B-tree [Bayer, McCreight 72]

• cache-oblivious B-tree [Bender, Demaine, Farach-Colton 00]

• buffer tree [Arge 95]

• buffered-repository tree[Buchsbaum,Goldwasser,Venkatasubramanian,Westbrook 00]

• Bε tree [Brodal, Fagerberg 03]

• log-structured merge tree [O'Neil, Cheng, Gawlick, O'Neil 96]

• string B-tree [Ferragina, Grossi 99]

• etc, etc!

State of the practice:

• B-trees + industrial-strength features

Michael Bender -- Performance of Fractal-Tree Databases

3 - The Problem

Problem: maintain a dynamic dictionary on disk.

Motivation: file systems, databases, etc.

State of the art (algorithmic perspective):

• B-tree [Bayer, McCreight 72]

• cache-oblivious B-tree [Bender, Demaine, Farach-Colton 00]

• buffer tree [Arge 95]

• buffered-repository tree[Buchsbaum,Goldwasser,Venkatasubramanian,Westbrook 00]

• Bε tree [Brodal, Fagerberg 03]

• log-structured merge tree [O'Neil, Cheng, Gawlick, O'Neil 96]

• string B-tree [Ferragina, Grossi 99]

• etc, etc!

State of the practice:

• B-trees + industrial-strength features/optimizations

Michael Bender -- Performance of Fractal-Tree Databases

4 - B-trees are Fast at Sequential Inserts

Sequential inserts in B-trees have near-optimal

data locality

Michael Bender -- Performance of Fractal-Tree Databases

5 - B-trees are Fast at Sequential Inserts

Sequential inserts in B-trees have near-optimal

data locality

These B-tree nodes reside

Insertions are into

in memory

this leaf node

• One disk I/O per leaf (which contains many inserts).

• Sequential disk I/O.

• Performance is disk-bandwidth limited.

Michael Bender -- Performance of Fractal-Tree Databases

6 - B-Trees Are Slow at Ad Hoc Inserts

High entropy inserts (e.g., random) in B-trees

have poor data locality

These B-tree nodes reside

in memory

• Most nodes are not in main memory.

• Most insertions require a random disk I/O.

• Performance is disk-seek limited.

• ≤ 100 inserts/sec/disk (≤ 0.05% of disk bandwidth).

Michael Bender -- Performance of Fractal-Tree Databases

7 - B-trees Have a Similar Story for Range Queries

Leaf nodes are scattered

across disk in aged B-tree.

Range queries in newly built B-trees have good

locality

Range queries in aged B-trees have poor locality

• Leaf blocks are scattered across disk.

• For page-sized nodes, as low as 1% disk bandwidth.

Michael Bender -- Performance of Fractal-Tree Databases

8 - B-trees Have a Similar Story for Range Queries

Leaf nodes are scattered

across disk in aged B-tree.

Range queries in newly built B-trees have good

locality

Range queries in aged B-trees have poor locality

• Leaf blocks are scattered across disk.

• For page-sized nodes, as low as 1% disk bandwidth.

Michael Bender -- Performance of Fractal-Tree Databases

9 - Results

Cache-Oblivious Streaming B-tree [Bender, Farach-

Colton, Fineman, Fogel, Kuszmaul, Nelson 07]

• Replacement for Traditional B-tree

• High entropy inserts/deletes run up to 100x faster

• No aging --> always fast range queries

• Streaming B-tree is cache-oblivious

‣ Good data locality without memory-specific parameterization.

Michael Bender -- Performance of Fractal-Tree Databases

10 - Results (cont)

Fractal TreeTM database

Application Layer

• TokuDB is a storage engine for MySQL

‣ A storage engine is a structure that stores on-disk data.

Database

‣ Traditionally a storage engine is a B-tree.

• MySQL is an open-source database

SQL Processing,

Query Optimization…

‣ Most installations of any database

• Built in context of our startup Tokutek.

TokuDB for

MySQL

Performance

• 10x-100x faster index inserts

File System

• No aging

• Faster queries in important cases

Michael Bender -- Performance of Fractal-Tree Databases

11 - Creative Fundraising for Startup

Michael Bender -- Performance of Fractal-Tree Databases

12 - Algorithmic Performance Model

Minimize # of block transfers per operation

Disk-Access Machine (DAM) [Aggrawal, Vitter 88]

Memory

• Two-levels of memory.

• Two parameters:

block-size B, memory-size M.

B

B

Cache-Oblivious Model (CO) [Frigo,

Leiserson, Prokop, Ramachandran 99]

• Parameters B and M are unknown

to the algorithm or coder.

Disk

• (Of course, used in proofs.)

Michael Bender -- Performance of Fractal-Tree Databases - Algorithmic Performance Model

Minimize # of block transfers per operation

Disk-Access Machine (DAM) [Aggrawal, Vitter 88]

Memory

• Two-levels of memory.

• Two parameters:

block-size B, memory-size M.

B =?

B =?

Cache-Oblivious Model (CO) [Frigo,

Leiserson, Prokop, Ramachandran 99]

• Parameters B and M are unknown

to the algorithm or coder.

Disk

• (Of course, used in proofs.)

Michael Bender -- Performance of Fractal-Tree Databases - Fractal Tree Inserts (and Deletes)

B-tree

Streaming B-tree

logN

logN

Insert

O(logBN)=O(

logB )

O(

B )

Example: N=1 billion, B=4096

• 1 billion 128-byte rows (128 gigabytes)

‣ log2 (1 billion) = 30

• Half-megabyte blocks that hold 4096 rows each

‣ log2 (4096) = 12

• B-trees require

logN = 30/12 = 3 disk seeks (modulo swapping,

but at least 1* logB

• Streaming B-trees require = 30/4096 = 0.007

logN

disk seeks

B

Michael Bender -- Performance of Fractal-Tree Databases

15 - Fractal Tree Inserts (and Deletes)

B-tree

Streaming B-tree

logN

logN

Insert

O(logBN)=O(

logB )

O(

B )

Example: N=1 billion, B=4096

• 1 billion 128-byte rows (128 gigabytes)

‣ log2 (1 billion) = 30

• Half-megabyte blocks that hold 4096 rows each

‣ log2 (4096) = 12

• B-trees require

logN = 30/12 = 3 disk seeks (modulo caching,

insertion pattern)logB

• Streaming B-trees require = 30/4096 = 0.007

logN

disk seeks

B

Michael Bender -- Performance of Fractal-Tree Databases

16 - Inserts into Prototype Fractal Tree

Random Inserts into Fractal Tree (“streaming B-

tree”) and B-tree (Berkeley DB)

Fractal Tree

B- Tree

Michael Bender -- Performance of Fractal-Tree Databases

17 - Searches in Prototype Fractal Tree

Point searches ~3.5x slower (N=230)

• Searches/sec improves as more of data structure fits in

cache)

S B-tree

Fractal Tree

B- Tree

Michael Bender -- Performance of Fractal-Tree Databases

18 - Asymmetry Between Inserts and Key Searches

Small specification changes affect complexity

E.g., duplicate keys

• Slow: Return an error when a duplicate key is inserted

‣ Hidden search

• Fast: Overwrite duplicates or maintain all versions

‣ No hidden search

E.g. deletes

• Return number elements deleted is slow

‣ Hidden search

• Delete without feedback is fast

‣ No hidden search

19 - Asymmetry Between Inserts and Key Searches

Small specification changes affect complexity

E.g., duplicate keys

• Slow: Return an error when a duplicate key is inserted

‣ Hidden search

• Fast: Overwrite duplicates or maintain all versions

‣ No hidden search

E.g. deletes

• Slow: Return number of elements deleted

‣ Hidden search

• Fast: Delete without feedback

‣ No hidden search

20 - Asymmetry Between Inserts and Key Searches

Small specification changes affect complexity

E.g., duplicate keys

• Slow: Return an error when a duplicate key is inserted

‣ Hidden search

• Fast: Overwrite duplicates or maintain all versions

‣ No hidden search

E.g. deletes

• Slow: Return number of elements deleted

‣ Hidden search

• Fast: Delete without feedback

‣ No hidden search

Next slide: extra difficulty of key searches

21 - Extra Difficulty of Key Searches
- Asymmetry Between Inserts and Key Searches

Inserts/point query asymmetry has impact on

• System design. How to redesign standard mechanisms

(e.g., concurrency-control mechanism).

• System use. How to take advantage of faster inserts

(e.g., to enable faster queries).

Michael Bender -- Performance of Fractal-Tree Databases

23 - Overview of Talk

24 - Overview

External-memory dictionaries

Performance limitations of B-trees

Fractal-Tree data structure (Streaming B-tree)

Search/point-query asymmetry

Impact of search/point-query asymmetry on

database use

How to build a streaming B-tree

Impact of search/point-query asymmetry on system

design

Scaling into the future

Michael Bender -- Performance of Fractal-Tree Databases

25 - Search/point-query asymmetry affecting

database use - How B-trees Are Used in Databases

Select via Index

Select via Table Scan

select d where 270 ≤ a ≤ 538

select d where 270 ≤ e ≤ 538

key

value

key

value

a

b c d e

a

b c d e

Data maintained in rows and stored in B-trees. - How B-trees Are Used in Databases

Select via Index

Select via Table Scan

select d where 270 ≤ a ≤ 538

select d where 270 ≤ e ≤ 538

key

value

key

value

a

b c d e

a

b c d e

Data maintained in rows and stored in B-trees. - How B-trees Are Used in Databases (Cont.)

Selecting via an index can be slow, if it is

coupled with point queries.

select d where 270 ≤ b ≤ 538

main table

index

key value

key value

key value

a

b c d e

b

a

c

a

Michael Bender -- Performance of Fractal-Tree Databases

29 - How B-trees Are Used in Databases (Cont.)

Covering index can speed up selects

• Key contains all columns necessary to answer query.

select d where 270 ≤ b ≤ 538

main table

covering index

key value

key value

key value

a

b c d e

bd

a

c

a

Michael Bender -- Performance of Fractal-Tree Databases

30

But coverirock. - Insertion Pain Can Masquerade as Query Pain

People often don’t use these indexes.

They use simplistic schema.

• Sequential inserts via autoincrement key

• Few indexes, few covering indexes

key value

Autoincrement key

(effectively a timestanp)

t

a b c d e

Then insertions are fast but queries are slow.

Michael Bender -- Performance of Fractal-Tree Databases

31 - Insertion Pain Can Masquerade as Query Pain

People often don’t use these indexes.

They use simplistic schema.

• Sequential inserts via autoincrement key

• Few indexes, few covering indexes

key value

Autoincrement key

(effectively a timestanp)

t

a b c d e

Then insertions are fast but queries are slow.

Adding sophisticated indexes helps queries

• B-trees cannot afford to maintain them.

Fractal Trees can.

Michael Bender -- Performance of Fractal-Tree Databases

32 - How to Build a Fractal Tree and How it

Performs - Simplified (Cache-Oblivious) Fractal Tree

20

21

22

23

O((logN)/B) insert cost & O(log2N) search cost

• Sorted arrays of exponentially increasing size.

• Arrays are completely full or completely empty

(depends on the bit representation of # of elmts).

• Insert into the smallest array.

Merge arrays to make room.

Michael Bender -- Performance of Fractal-Tree Databases

34 - Simplified (Cache-Oblivious) Fractal Tree (Cont.)

Michael Bender -- Performance of Fractal-Tree Databases

35 - Analysis of Simplified Fractal Tree

Insert Cost:

• cost to flush buffer of size X = O(X/B)

• cost per element to flush buffer = O(1/B)

• max # of times each element is flushed = log N

• insert cost = O((log N))/B) amortized memory transfers

Search Cost

• Binary search at each level

• log(N/B) + log(N/B) - 1 + log(N/B) - 2 + ... + 2 + 1

= O(log2(N/B))

Michael Bender -- Performance of Fractal-Tree Databases

36 - Idea of Faster Key Searches in Fractal Tree

O(log (N/B)) search cost

• Some redundancy of elements between levels

• Arrays can be partially full

• Horizontal and vertical pointers to redundant elements

• (Fractional Cascading)

37 - Why The Previous Data Structure is a Simplification

• Need concurrency-control mechanisms

• Need crash safety

• Need transactions, logging+recovery

• Need better search cost

• Need to store variable-size elements

• Need better amortization

• Need to be good for random and sequential inserts

• Need to support multithreading.

• Need compression

Michael Bender -- Performance of Fractal-Tree Databases

38 - iiBench Insertion Benchmark

iiBench - 1B Row Insert Test!

50,000!

45,000!

40,000!

35,000!

! 30,000!

25,000!

InnoDB!

Rows/Second 20,000!

TokuDB!

15,000!

10,000!

5,000!

0!

0!

200,000,000!

400,000,000!

600,000,000!

800,000,000!

1,000,000,000!

Rows Inserted!

Fractal Trees scale with disk bandwidth not seek time.

• In fact, now we are compute bound, so cannot yet take full advantage of more

cores or disks. (This will change.)

39 - iiBench Deletions

iiBench - 500M Row Insert/Delete Test!

40,000!

35,000!

Insertions only here

Insertions + deletions here

30,000!

! 25,000!

20,000!

TokuDB!

Rows/Second

InnoDB!

15,000!

10,000!

5,000!

0!

0!

100,000,000!

200,000,000!

300,000,000!

400,000,000!

500,000,000!

Rows Inserted!

40 - Search/point query asymmetry when

building Fractal-Tree Database

41 - Building TokuDB Storage Engine for MySQL

Engineering to do list

• Need concurrency-control mechanisms

• Need crash safety

• Need transactions, logging+recovery

• Need better search cost

• Need to store variable-size elements

• Need better amortization

• Need to be good for random and sequential inserts

• Need to support multithreading.

• Need compression

Michael Bender -- Performance of Fractal-Tree Databases

42 - Building TokuDB Storage Engine for MySQL

Engineering to do list

• Need concurrency-control mechanisms

• Need crash safety

• Need transactions, logging+recovery

• Need better search cost

• Need to store variable-size elements

• Need better amortization

• Need to be good for random and sequential inserts

• Need to support multithreading.

• Need compression

Michael Bender -- Performance of Fractal-Tree Databases

43 - Concurrency Control for Transactions

A

B

E

D

C

A

D

B

E

C

Transactions

• Sequence of durable operations.

• Happen atomically.

Atomicity in TokuDB via pessimistic locking

• readers lock: A and B can both read row x of database.

• writers lock: if A writes to row x, B cannot read x until A

completes.

Michael Bender -- Performance of Fractal-Tree Databases

44 - Concurrency Control for Transactions (cont)

B-tree implementation: maintain locks in leaves

• Insert row t

• Search for row u

• Search for row v and put a cursor

• Increment cursor. Now cursor points to row w.

t

u

v

w

writer lock

reader lock

reader range lock

Doesn’t work for Fractal Trees: maintaining locks

involves implicit searches on writes.

Michael Bender -- Performance of Fractal-Tree Databases

45 - Scaling Fractal Trees into the Future
- iiBench on SSD

35000

30000

25000

TokuDB

20000

FusionIO

X25E

RAID10

15000

Insertion Rate

10000

5000

InnoDB

FusionIO

X25-E

RAID10

0 0

5e+07

1e+08

1.5e+08

Cummulative Insertions

B-trees are slow on SSDs, probably b/c they waste bandwidth.

• When inserting one row, a whole block (much larger) is written.

47 - B-tree Inserts Are Slow on SSDs

Inserting an element of size x into a B-tree dirties a

leaf block of size B.

x

B

We can write keys of size x into a B-tree using at

most a O(x/B) fraction of disk bandwidth.

Fractal trees do efficient inserts into SSDs because

they transform random I/O into sequential I/O.

48 - B-tree Inserts Are Slow on SSDs

Inserting an element of size x into a B-tree dirties a

leaf block of size B.

x

B

We can write keys of size x into a B-tree using at

most a O(x/B) fraction of disk bandwidth.

Fractal trees do efficient inserts on SSDs because

they transform random I/O into sequential I/O.

49 - Disk Hardware Trends

Disk capacity will continue to grow quickly

Year

Capacity

Bandwidth

2008

2 TB

100MB/s

2012

4.5 TB

150MB/s

2017

67 TB

500MB/s

but seek times will change slowly.

• Bandwidth scales as square root of capacity.

Source: http://blocksandfiles.com/article/4501

50 - Fractal Trees Enable Compact Systems

B-trees require capacity, bandwidth, and

random I/O

• B-tree based systems achieve large random I/O rates by

using more spindles and lower capacity disks.

Fractal Trees require only capacity & bandwidth

• Fractal Trees enable the use of high-capacity disks.

51 - Fractal Trees Enable Big Disks

B-trees require capacity, bandwidth, and seeks.

Fractal trees require only capacity and bandwidth.

Today, for a 50TB database,

• Fractal tree with 25 2TB disks gives 500K ins/s.

• B-tree with 25 2TB disks gives 2.5K ins/s.

• B-tree with 500 100GB disks gives 50K ins/s but costs $, racks, and

power.

In 2017, for a 1500TB database:

• Fractal tree with 25 67TB disks gives 2500K ins/s.

• B-tree with 25 67TB disks gives 2.5K ins/s.

B-trees need spindles, and spindle density increases

slowly.

52 - Power Management in High-Density Data

Centers

Michael Bender

Using Big Disks Also Saves Energy

Power consumption of disks

• Enterprise 80 to 160 GB disk runs at 4W (idle power).

• Enterprise 1-2 TB disk runs at 8W (idle power).

Data centers/server farms use 80-160 GB disks

• Use many small-capacity disks, not large ones.

Using large disks may save factor >10 in

Storage Costs

• Other considerations modify this factor

‣ e.g., CPUs necessary to drive disks, scale-out infrastructure, cooling, etc.

‣ Metric: e.g., Watts/MB versus Inserts/Joule

Michael Bender -- Performance of Fractal-Tree Databases

53

2