このページは https://speakerdeck.com/timonk/philippe-flajolets-contribution-to-streaming-algorithms の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/06/20)にアップロードin学び

Jérémie Lumbroso's talk at the AK Data Science Summit on Streaming and Sketching in Big Data and ...

Jérémie Lumbroso's talk at the AK Data Science Summit on Streaming and Sketching in Big Data and Analytics on 06/20/2013 at 111 Minna.

For more information: http://blog.aggregateknowledge.com/ak-data-science-summit-june-20-2013

- Philippe Flajolet’s

contribution to streaming algorithms

Jérémie Lumbroso

Université de Caen

Data Science Summit

June 20nd, 2013

1/22 - Philippe Flajolet (1948 - 2011)

analysis of algorithms

worst-case analysis

1970: Knuth, average case analysis

1980: Rabin, introduce randomness in computations

wide scientific production

two books with Robert Sedgewick

200+ publications

founder of the topic of “analytic combinatorics”

published the first sketching/streaming algorithms

2 - 0. DATA STREAMING ALGORITHMS

Stream: a (very large) sequence S over (also very large) domain D

S = s1 s2 s3 · · · s ,

sj ∈ D

consider S as a multiset

M = m f1

f2

fn

1

m2

· · · mn

Interested in estimating the following quantitive statistics:

— A. Length :=

— B. Cardinality := card(mi ) ≡ n (distinct values)

← this talk

— C. Frequency moments :=

f p p ∈

v ∈D v

R

Constraints:

very little processing memory

on the fly (single pass + simple main loop)

no statistical hypothesis

accuracy within a few percentiles

3 - Historical context

1970: average-case → deterministic algorithms on random input

1976-78: first randomized algorithms (primality testing, matrix

multiplication verification, find nearest neighbors)

√

1979: Munro and Paterson, find median in one pass with Θ( n)

space with high probability

⇒ (almost) first streaming algorithm

In 1983, Probabilistic Counting by Flajolet and Martin is (more or less)

the first streaming algorithm (one pass + constant/logarithmic memory).

Combining both versions: cited about 750 times = second most cited

element of Philippe’s bibliography, after only Analytic Combinatorics.

4 - Databases, IBM, California...

In the 70s, IBM researches relational databases (first PRTV in UK, then

System R in US) with high-level query language: user should not have to

know about the structure of the data.

⇒ query optimization; requires cardinality (estimates)

SELECT name FROM participants

WHERE

sex = "M" AND

nationality = "France"

Min. comparisons: compare first sex or nationality?

G. Nigel N. Martin (IBM UK) invents first version of “probabilistic

counting”, and goes to IBM San Jose, in 1979, to share with System R

researchers. Philippe discovers the algorithm in 1981 at IBM San Jose.

5 - 1. HASHING: reproducible randomness

1950s: hash functions as tools for hash tables

1969: Bloom filters → first time in an approximate context

1977/79: Carter & Wegman, Universal Hashing, first time

considered as probabilistic objects + proved uniformity is possible in

practice

hash functions transform data into i.i.d. uniform random variables or

in infinite strings of random bits:

h : D → {0, 1}∞

that is, if h(x ) = b1b2 · · · ,

then P[b1 = 1] = P[b2 = 1] = . . . = 1/2

Philippe’s approach was experimental

unhashed

hashed

later theoretically validated in 2010: Mitzenmacher & Vadhan

proved hash functions “work” because they exploit the entropy of the

hashed data

6 - Intuition: because strings are uniform, prefix pattern 1k 0 · · · appears

with probability 1/2k+1

⇒ seeing prefix 1k 0 · · · means it’s likely there is n

2k+1 different strings

Idea:

keep track of prefixes 1k 0 · · · that have appeared

estimate cardinality with 2p, where p = size of largest prefix

2. PROBABILISTIC COUNTING (1983)

(with G. Nigel N. Martin)

For each element in the string, we hash it, and look at it

S = s1 s2 s3 · · ·

⇒

h(s1) h(s2) h(s3) · · ·

h(v ) transforms v into string of random bits (0 or 1 with prob. 1/2).

So you expect to see:

0xxxx ... → P = 1/2

10xxx ... → P = 1/4

110xx... → P = 1/8

Indeed

1

P

1

1

0

x

x

· · ·

= P[b1 = 1] · P[b2 = 1] · P[b3 = 0] = 8

7 - 2. PROBABILISTIC COUNTING (1983)

(with G. Nigel N. Martin)

For each element in the string, we hash it, and look at it

S = s1 s2 s3 · · ·

⇒

h(s1) h(s2) h(s3) · · ·

h(v ) transforms v into string of random bits (0 or 1 with prob. 1/2).

So you expect to see:

0xxxx ... → P = 1/2

10xxx ... → P = 1/4

110xx... → P = 1/8

Indeed

1

P

1

1

0

x

x

· · ·

= P[b1 = 1] · P[b2 = 1] · P[b3 = 0] = 8

Intuition: because strings are uniform, prefix pattern 1k 0 · · · appears

with probability 1/2k+1

⇒ seeing prefix 1k 0 · · · means it’s likely there is n

2k+1 different strings

Idea:

keep track of prefixes 1k 0 · · · that have appeared

estimate cardinality with 2p, where p = size of largest prefix

7 - Bias correction: how analysis is FULLY INVOLVED in design

Described idea works, but presents small bias (i.e. E[2p] = n).

Without analysis (original algorithm)

With analysis (Philippe)

Philippe determines that

E[2p] ≈ φn

where φ ≈ 0.77351 . . . is defined by

√

∞

(−1)ν(p)

eγ

2

(4p + 1)(4p + 2)

φ =

3

(4p)(4p + 3)

p=1

the three bits immediately after the first 0 such that we can apply a simple cor-

are sampled, and depending on whether they rection and have unbiased estimator,

are 000, 111, etc. a small ±1 correction is

applied to p = ρ(bitmap)

1

Z :=

2p

E[Z ] = n

φ

8 - Analysis close-up: “Mellin transforms”

transformation of a function to the complex plane

∞

f (s) =

f (x)xs−1dx.

0

factorizes linear superpositions of a base function at different scales

links singularities in the complex plane of the integral, to

asymptotics of the original function

precise analysis (better than “Master Theorem”) of all divide and conquer

type algorithms (QuickSort, etc.) with recurrences such as

fn = f n/2 + f n/2 + tn

9 - (graphic: M. Golin)

10 - The basic algorithm

h(x ) = hash function, transform data x into uniform {0, 1}∞ string

ρ(s) = position of first bit equal to 0, i.e. ρ(1k 0 · · · ) = k + 1

procedure ProbabilisticCounting(S : stream)

bitmap := [0, 0, . . . , 0]

for all x ∈ S do

bitmap[ρ(h(x ))] := 1

end for

P := ρ(bitmap)

return 1 · 2P

φ

end procedure

Ex.: if bitmap = 1111000100 · · · then P = 5, and n ≈ 25/φ = 20.68 . . .

Typically estimates are one binary order of magnitude off the exact result:

too inaccurate for practical applications.

11 - Stochastic Averaging

√

To improve accuracy of algorithm by 1/ m,

elementary idea is to use m different hash

functions (and a different bitmap table for each

function) and take average.

⇒ very costly (hash m time more values)!

Split elements in m substreams ran-

domly using first few bits of hash

h(v ) = b1b2b3b4b5b6 · · ·

which are then discarded (only

b3b4b5 · · · is used as hash value).

For instance for m = 4,

00b3b4 · · ·

→

bitmap [ρ(b3b4 · · ·)] = 1

00

01b

[ρ(b

h(x ) =

3b4 · · ·

→

bitmap01

3b4 · · ·)] = 1

10b3b4 · · ·

→

bitmap [ρ(b3b4 · · ·)] = 1

10

11b3b4 · · ·

→

bitmap [ρ(b

11

3b4 · · ·)] = 1

12 - Theorem [FM85]. The estimator Z of Probabilistic Counting is an

asymptotically unbiased estimator of cardinality, in the sense that

En[Z ] ∼ n

and has accuracy using m bitmaps is

σn[Z ]

0.78

= √

n

m

Concretely, need O(m log n) memory (instead of O(n) for exact).

Example: can count cardinalities up to n = 109 with error ±6%, using

only 4096 bytes = 4 kB.

13 - 3. from Prob. Count. to LogLog (2003)

(with Marianne Durand)

PC: bitmaps require k bits to count cardinalities up to n = 2k

Reasoning backwards (from observations), it is reasonable, when

estimating cardinality n = 23, to observe a bitmap 11100 · · · ; remember

b1 = 1 means n

2

b2 = 1 means n

4

b3 = 1 means n

8

WHAT IF instead of keeping track of all the 1s we set

in the bitmap, we only kept track of the position of the

largest? It only requires log log n bits!

In algorithm, replace

bitmap [ρ(h(x ))] := 1

by

bitmap := max {ρ(h(x )), bitmap }

i

i

i

For example, compared evolution of “bitmap”:

Prob. Count.:

00000 · · ·

00100 · · ·

10100 · · ·

11100 · · ·

11110 · · ·

LogLog:

1

4

4

4

5

14 - loss of precision in LogLog?

Probabilistic Counting and LogLog often find the same estimate:

Probabilistic Counting

5

LogLog

5

bitmap

1

1

1

1

0

0

0

0

· · ·

But sometimes differ:

Probabilistic Counting

5

LogLog

8

bitmap

1

1

1

1

0

0

1

0

· · ·

Other way of looking at it, the distribution of the rank (= max of n

geometric variables with p = 1/2) used by LogLog has long tails:

250

200

150

100

50

10

15

20

25

(still there is concentration: idea of compressing the sketches, e.g.

optimum by Kane et al. 2000)

15 - SuperLogLog (same paper)

The accuracy (want it to be smallest possible):

√

Probabilistic Counting: 0.78/ m for m registers of 32 bits

√

LogLog: 1.36/ m for m small registers of 5 bits

In LogLog, loss of accuracy due to some (rare but real) registers that are

too big, too far beyond the expected value.

SuperLogLog is LogLog, in which we remove δ largest registers before

estimating, i.e., δ = 70%.

involves a two-time estimation

analysis is much more complicated√

but accuracy much better: 1.05/

m

16 - from SuperLogLog to HyperLogLog... DuperLogLog?!

17 - 4. “HyperLogLog:

the analysis of a near-optimal cardinality estimation algorithm” (2007)

(with Eric Fusy, Frédéric Meunier & Olivier Gandouet)

2005: Giroire (PhD student of Philippe’s) publishes thesis with

cardinality estimator based on order statistics

2006: Chassaing and Gerin, using statistical tools find best

estimator based on order statistics in an information theoretic sense

The note suggests using a harmonic mean: initially dismissed as a

theoretical improvement, it turns out simulations are very good. Why?

18 - Harmonic means ignore too large values

X1, X2, . . ., Xm are estimates of a stream’s cardinality

Arithmetic mean

Harmonic mean

X

m

A

1 + X2 + . . . + Xm

:=

H :=

m

1 + 1 + . . . + 1

X1

X2

Xm

Plot of A and H for X1 = . . . = X31 = 20 000 and X32 varying between

and 5 000 and 80 000 (two binary orders of magnitude)

21 500

21 000

20 500

20 000

X32

20 000

30 000

40 000

50 000

60 000

70 000

80 000

19 000

how A and H vary when only one term differs from the rest

18 500

19 - The end of an adventure. HyperLogLog = sensibly same

precision as SuperLogLog, but substitutes algorithmic clev-

erness with mathematical elegance.

√

Accuracy is 1.03/

m with m small loglog bytes (≈ 4 bits).

Whole of Shakespeare summarized:

ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh

igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg

hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif

fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl

Estimate ˜

n ≈ 30 897 against n = 28 239. Error is ±9.4% for 128 bytes.

Pranav Kashyap: word-level encrypted texts, classification by language.

20 - a beautiful algorithm (with Wegman), Adaptive Sampling, 1989,

which was ahead of its time, and was grossly unappreciated... until it

was rediscovered in 2000: how do you count the number of elements

which appear only once in a stream using constant size memory?

Left out of discussion:

Philippe’s finding and analysing of Approximate Counting, 1982:

how to count up to n with only log log n memory

21 - Left out of discussion:

Philippe’s finding and analysing of Approximate Counting, 1982:

how to count up to n with only log log n memory

a beautiful algorithm (with Wegman), Adaptive Sampling, 1989,

which was ahead of its time, and was grossly unappreciated... until it

was rediscovered in 2000: how do you count the number of elements

which appear only once in a stream using constant size memory?

21 - Let S be a stream of size

(with n distinct elements)

S = x1 x2 x3 · · · x

a straight sample [Vitter 85..] of size m (each xi taken with prob. ≈ m/ )

a x x x x b b x c d d d b h x x ...

allows us to deduce ‘a’ repeated ≈ /m times in S, but impossible

to say anything about rare elements, hidden in the mass = problem

of needle in haystack

a distinct sample (with counters)

(a, 9) (x, 134) (b, 25) (c, 12) (d, 30) (g, 1) (h, 11) (z, 1)

takes each element with probability 1/n = independently from its

frequency of appearing

Textbook example: sample 1 element of stream (1, 1, 1, 1, 2, 1, 1, . . . , 1),

= 1000; with straight sampling, prob. 999/1000 of taking 1 and 1/1000 of

taking 2; with distinct sampling, prob. 1/2 of taking 1 and 1/2 of taking 2.

A. adaptive/DISTINCT sampling

22 - A. adaptive/DISTINCT sampling

Let S be a stream of size

(with n distinct elements)

S = x1 x2 x3 · · · x

a straight sample [Vitter 85..] of size m (each xi taken with prob. ≈ m/ )

a x x x x b b x c d d d b h x x ...

allows us to deduce ‘a’ repeated ≈ /m times in S, but impossible

to say anything about rare elements, hidden in the mass = problem

of needle in haystack

a distinct sample (with counters)

(a, 9) (x, 134) (b, 25) (c, 12) (d, 30) (g, 1) (h, 11) (z, 1)

takes each element with probability 1/n = independently from its

frequency of appearing

Textbook example: sample 1 element of stream (1, 1, 1, 1, 2, 1, 1, . . . , 1),

= 1000; with straight sampling, prob. 999/1000 of taking 1 and 1/1000 of

taking 2; with distinct sampling, prob. 1/2 of taking 1 and 1/2 of taking 2.

22