このページは http://www.slideshare.net/tbyasu/kdd2015readingtabei の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- KDD2015読み会@京大, 2015年8月29日(土)

Monitoring Least Squares Models of

Distributed Streams

M. Gabel*, D.Keren+, A. Schuster*

*Israel Institute of Technology,

+Haifa University

発表者 : 田部井 靖生 (JST/東工大) - 何故この論文を選んだか？

• 問題設定が新しい(?)

• 手法はシンプル (導出は複雑) - 問題設定の概要

• 時系列に流れてくるデータにおいて, モデルの変

化を監視する

– ノードがk個ある

– 各ノードには, 時系列にデータがやってくる

– 監視するモデルは一つ

• モデルの更新にはコストがかかる.

コスト = データを集める & βの更新

• モデルに大きな変化があったときのみ, モデル

を更新したい. - 問題の概要

ノード V

データ (X,y)

i

V

1

１つのモデル

(X11,y11)

(X12,y12)

V

2

(X21,y21)

V

3

(X31,y31)

(X32,y32)

V

k

(Xｋ1,yk1) - 手法の概要

• Naïve

– 毎回βを更新 (通信のコストがかかる)

– 時刻T毎にβを更新

(時間間隔とβの誤差をバランスをとるのは困難)

• 各ノードにおいてβが大きく変化する可能性があると

きのみ, βを更新する.

• 問題点

どうやってグローバルなβの変化を各ノードにおける

ローカルな変化から予測するのか - 手法

• 基本的なアイディア

– データの小さい変化領域Cを定義しする (convex

safe zone)

– 各ノードにおいてデータの変化がconvex safe zone

にあるならば, が成立

– あるノードでconvex safe zoneを超えるデータの変

化があるとき, βを更新する

(Geometric monitoring)

• βの更新の際には, 各ノードにおけるデータを一箇所に

集める - 手法 (詳細)

• において, と

する

• このとき, Aとcは各ノードにおけるAjとcjの線形和で書

ける, i.e.,

• よって, と書ける

• すなわち, βの変化はAjとcjの変化量ペアー(Δj,δj)に影

響する - 手法 (詳細)

• と書けるので, βは

A とC の平均で書ける

j

j

• (Δ,δ)の凸部分空間Cを以下満たすのように定

義する

• [Lemma1] Cに関して, 以下が成立する

If for all j, then - Sliding window と Inﬁnite window

• Sliding window : βをWの範囲のデータから計算, β0を

最後のsync前のWの範囲のデータから計算

Ø になる条件 :

• Inﬁnite window : βをこれまでのすべてのデータから計

算, β0を最後のsyncまでのすべてのデータから計算

Ø になる条件 :

than summed) shall be denoted with ˆ·. Hence initial values

sync

now

now

Time

W

W

k

X

k

X

ˆ

1

1

A0 =

Aj

cj

Aj

Aj

k

0 , ˆ

c0 = k

0 , ˆ0 = ˆ

A 1

0

ˆ

c0 ,

0

sliding

j=1

j=1

window

and current values

Aj0

Aj

new

k

old

X

k

X

common

ˆ

1

1

A =

Aj , ˆ

c =

cj , ˆ = ˆ

A 1ˆ

c

.

k

k

infinite window

j=1

j=1

Aj

Note ( 1 A) 1 = kA 1 thus ˆ = ˆ

A 1ˆ

c = A 1c =

and

0

Aj

k

likewise ˆ0 = 0. In other words, we can compute the OLS

model from the averages of local Aj, cj rather than the sums:

! 1

!

Figure 3: Sliding and infinite window models. When

1 X

1 X

P

P

=

Aj

cj

= ˆ

A 1ˆ

c

(3)

Aj overlaps Aj ,

j = Aj

Aj =

x

x

0

0

ixT

i

ixT

i .

k

k

new

old

j

j

4.2 Convex Safe Zones

Substitute in Eq. (4) to finally obtain:

We propose to solve the monitoring problem by means of

“good” convex subsets, called safe zones, of the data space.

8j ( j, j) 2 C

=)

k( ˆ

A0 + ˆ ) 1(ˆ

c0 + ˆ)

ˆ

A 1

0

ˆ

c0k =

Each node monitors its own drift: as long as current values

k ˆ

ˆ0k = k

0k ✏

at local nodes (Aj, cj) are sufficiently similar to their values

at sync time (Aj , cj ),

which completes the proof.

0

0

0 is guaranteed to be close to

.

Formally, we define a convex subset C in the space of

4.3 Infinite and Sliding Window

matrix-vector pairs, such that (0m⇥m, 0m) 2 C and

We di↵erentiate between two di↵erent variations for com-

( , ) 2 C =) k( ˆ

A0 +

) 1(ˆ

c0 + )

ˆ

A 1

0

ˆ

c0k ✏ , (4)

puting the global model: sliding window and infinite window.

for any drift ( , ), where 0

In the sliding window model,

is computed from the last W

m⇥m and 0m are the m ⇥ m zero

matrix and length m zero vector. Ideally, C should be “big”:

samples seen at each node, and similarly 0 is computed from

as local data slowly drifts over time, it is desirable that drifts

the last W samples before sync. Conversely, in the infinite

remain in C (otherwise communication is needed). Convexity

window model

is computed over all observations seen thus

plays a key role in our paradigm: if all drifts are in C, then

far, while 0 is computed from all observations seen until

their average is also in C.

last sync. Figure 3 illustrates these two models. Though the

Given such a subset C, the basic monitoring paradigm is

sliding window is clearly more practical, the infinite window

simple. As long as (Aj

Aj , cj

cj )

model may be useful in some settings and so we discuss both.

0

0

2 C, node j can remain

silent. If all nodes are silent, then k ˆ

ˆ

0

k = k 0

k ✏.

Sliding Window.

If a violation of the local condition does occur at any node j,

In the sliding window model each node

some form of violation recovery must take place, for example

computes Aj from the W samples seen at node j, while

recomputing the global model and restarting monitoring.

Aj (and hence ˆ

A

0

0) is built from the last W samples before

We now prove the correctness of the paradigm.

sync. Computing

j and j, however, requires substracting

observations that left the sliding window. If Aj , cj and Aj, cj

0

0

Lemma 1. Let C be a convex subset that satisfies Eq. (4).

do not overlap (Figure 3, top), then clearly

j = Aj Aj and

0

If ( j, j) 2 C for all j, then k

j

0k ✏.

= cj

cj . It is also possible, however, that the current

0

window overlaps the window used to build

0.

Figure 3

Proof. Express ˆ

A, ˆ

c using the average of local deviations:

(middle) illustrates this case:

j , j become the sum of new

1 X

samples from Aj, cj minus the sum of old (non-overlapping)

( ˆ

A, ˆ

c)

=

(Aj, cj)

k

samples from Aj , cj .

0

0

j

The convex constraint C on ( , ) for this model is:

1 X

=

( ˆ

A0, ˆ

c0) +

(Aj

Aj

k

0, cj

cj0)

✏k ˆ

A 1

0

k + k ˆ

A 1

0

k + k ˆ

A 1

0

0k ✏

,

(7)

j

1 X

where kAk is the L

=

( ˆ

A

2 operator norm of the matrix A. The

0, ˆ

c0) +

( j, j)

(5)

k

derivation of the convex constraint C is quite technical, and

j

the details are available in Appendix A.

And from C’s convexity,

Alg. 1 shows the resulting monitoring algorithm each node

1 X

runs. Note monitoring does not require any matrix inversions.

8j ( j, j) 2 C =)

( j, j) 2 C

(6)

Each node applies the local constraint from Eq. (7) to its own

k j

data. When a violation occurs at any node, it is reported

P

to a coordinator node. The coordinator (Alg. 2) polls all

Denote ( ˆ , ˆ) = 1

( j, j) and rewrite Eq. (5) and (6):

k

j

nodes for their local data, computes a updated global model

( ˆ

A, ˆ

c)

=

( ˆ

A

0 and distributes it to all nodes, along with updated ˆ

A 1

0

0, ˆ

c0) + ( ˆ , ˆ)

used in the constraint. Monitoring then resumes. This is

8j ( j, j) 2 C

=)

( ˆ , ˆ) 2 C

the simplest violation resolution protocol. We briefly discuss - リッジ回帰とGLS

• リッジ回帰 :

βは閉じた式で書ける

l Generalized Least Squares (GLS) :

βは閉じた式で書ける , where

Ø 同様の手法でβを監視できる - 実験

• Distributed Least Square monitor(DILSQ)とT間隔毎

にモデルを更新するPER(T)を比較

• DILSQは, sliding windowを採用

• 評価尺度として, モデルエラーとnormalized message

を用いた

– それぞれのノードで送られるメッセージの平均

• データセットは, 人工データ, Traﬃc Monitoring, Gas

Sensor Time Seriesを用いた - 人口データを用いた実験

Fig •u re それぞれの

4: DILSQ model

Round

error (black) におい

and syncs て

(bo ,

tt y=

om vx

erT

ti β

cal true

lines)+n

per におい

round, co て

mpa ,

r x

ed は

to PER(100)

error (green), for k = 10 simulated nodes with m = 10 dimensions, and threshold ✏ = 1.35. Both algorithms

reduce N(0,1)

communica の

tion i.i.d

to 1%, ,

b n

ut D N

IL (0,σ

SQ only s 2

y )

ncs when

changes (bottom purple line shows k k). PER(100)

syncs every 100 rounds, but is unable to maintain error below the threshold (dashed horizontal line).

• DILSQのエラーは閾値ε=1.35を超えることはない

guarantees maximum model error below the user-selected

t

•

hres hol DILSQ

d ✏, but PE は

R do ,

es β

not. の変化に応じてモデ

Hence, when comparing

ルを変化させる

the two, we find a posteriori the maximum period T (hence

minimum communication) for which the maximum error of

PER(T ) is equal or below that of DILSQ. Note this gives

PER an unrealistic advantage. First, in a realistic setting we

cannot know a priori the optimal period T . Second, model

changes in realistic settings are not necessarily stationary:

the rate of model change may evolve, which DILSQ will

handle gracefully while PER cannot.

5.1 Synthetic Datasets

(a) Fixed dataset

(b) Drift dataset

We use two types of synthetic dataset. In the fixed dataset,

Figure 5: Communication for DILSQ (black) and

the true model true 2 Rm is fixed, with elements drawn i.i.d

periodic algorithm tuned to achieve same max error

from N [0, 1]2. We generate R rounds with k nodes, each

(green) at di↵erent threshold values. DILSQ com-

receiving at each round a new data vector x of size m and

munication on fixed model drops to zero for more

scalar y. x is drawn i.i.d from N (0, 1), and y = xT true + n

permissive ✏ (not shown on logarithmic scale).

where n ⇠ N(0, 2) is Gaussian white noise of strength .

In the drift dataset the coefficients of true change rapidly

during 25% of one epoch, and are fixed during the rest of

the epoch. We generate observations for E epochs using the

syncrhonizes every 100 iterations even during the periods

same procedure. For each experiment we generate new data.

where

changes very little.

Default parameter values are k = 10 nodes, m = 10

dimensions, noise magnitude

= 10 (to generate interesting

5.1.1 Effect of Threshold

results given the large window), window size W = 1300 and

Figure 5 shows the communication required for di↵erent

maximum error threshold ✏ = 0.5, which is quite strict3. We

threshold levels for the DILSQ algorithm, and the minimal

generate R = 16900 rounds for the fixed dataset, or E = 5

communication required to match DILSQ using the PER

epochs of 3900 rounds each for drift dataset.

algorithm with optimal period, as discussed above. For the

Figure 4 shows the behavior of the monitoring algorithm

fixed model dataset (Figure 5a) neither algorithm needs to

over such a simulation on the drift dataset with ✏ = 1.35

sync very often to provide an accurate estimate. Had there

and 3 epochs. For this configuration, DILSQ achieves com-

been no noise, a single initial synchronization would have

munication of 0.01 messages per node per round, and the

been sufficient, regardless of threshold. Note that for more

model error is always below the threshold. Conversely, the

permissive threshold values (or smaller noise magnitude )

equivalent PER(100) algorithm is unable to maintain the

DILSQ achieved zero communication (beyond initial sync)

error below the threshold, which would require a higher up-

for the fixed dataset (not shown in this log-scale figure).

date frequency. When model changes in

are large and

Performance on the drift dataset (Figure 5b) is more in-

frequent DILSQ performs more syncrhonizations, resulting

teresting. When ✏ is very strict, both algorithms perform

in updated 0 =

that decreases the error. When

is stable

roughly the same, with normalized messages of 0.25–0.75. As

(it is never truly constant due to noise), syncrhonizations

✏ grows DILSQ develops an increasing advantage over PER

are much rarer. The periodic algorithm, on the other hand,

with optimal period. The optimal period must be low enough

2

to match the quickly changing model, and is wasteful on the

therefore k k2 ⇠ 2m

3

intervals where

is quiescent. For our dataset,

Given that elements of both

true is con-

0 and

are i.i.d N (0, ), then

stant during roughly 75% of each epoch. For datasets with

k

0 k

p

⇠

2

m. The probability that a random e =

0

larger quiescent periods (or smaller window), the advantage

will overwhelm ✏ is P = 1

CDF

( ✏

p

) > 1

10 8.

m

of DILSQ will be even larger.

2 - 閾値εがモデルの更新コストに影響

Figure 4: DILSQ model error (black) a •

n (a)真のモデルは固定, (b)真のモデルは変化

d syncs (bottom vertical lines) per round, compared to PER(100)

error (green), for k = 10 simulated nod •

es with m = 10 dimensions, and threshold ✏ = 1.35. Both algorithms

reduce communication to 1%, but DILSQ o nl PER(T)

y syncs when のパラ

changes (bメータは

ottom purple li ,

ne 最大エ

shows k k). ラーが

PER(100) DILSQと

syncs every 100 rounds, but is unable to ma同じになるように設定

intain error below the threshold (dashed horizontal line).

guarantees maximum model error below the user-selected

threshold ✏, but PER does not. Hence, when comparing

the two, we find a posteriori the maximum period T (hence

minimum communication) for which the maximum error of

PER(T ) is equal or below that of DILSQ. Note this gives

PER an unrealistic advantage. First, in a realistic setting we

cannot know a priori the optimal period T . Second, model

changes in realistic settings are not necessarily stationary:

the rate of model change may evolve, which DILSQ will

handle gracefully while PER cannot.

5.1 Synthetic Datasets

(a) Fixed dataset

(b) Drift dataset

We use two types of synthetic dataset. In the fixed dataset,

Figure 5: Communication for DILSQ (black) and

the true model true 2 Rm is fixed, with elements drawn i.i.d

periodic algorithm tuned to achieve same max error

from N [0, 1]2. We generate R rounds with k nodes, each

(green) at di↵erent threshold values. DILSQ com-

receiving at each round a new data vector x of size m and

munication on fixed model drops to zero for more

scalar y. x is drawn i.i.d from N (0, 1), and y = xT true + n

permissive ✏ (not shown on logarithmic scale).

where n ⇠ N(0, 2) is Gaussian white noise of strength .

In the drift dataset the coefficients of true change rapidly

during 25% of one epoch, and are fixed during the rest of

the epoch. We generate observations for E epochs using the

syncrhonizes every 100 iterations even during the periods

same procedure. For each experiment we generate new data.

where

changes very little.

Default parameter values are k = 10 nodes, m = 10

dimensions, noise magnitude

= 10 (to generate interesting

5.1.1 Effect of Threshold

results given the large window), window size W = 1300 and

Figure 5 shows the communication required for di↵erent

maximum error threshold ✏ = 0.5, which is quite strict3. We

threshold levels for the DILSQ algorithm, and the minimal

generate R = 16900 rounds for the fixed dataset, or E = 5

communication required to match DILSQ using the PER

epochs of 3900 rounds each for drift dataset.

algorithm with optimal period, as discussed above. For the

Figure 4 shows the behavior of the monitoring algorithm

fixed model dataset (Figure 5a) neither algorithm needs to

over such a simulation on the drift dataset with ✏ = 1.35

sync very often to provide an accurate estimate. Had there

and 3 epochs. For this configuration, DILSQ achieves com-

been no noise, a single initial synchronization would have

munication of 0.01 messages per node per round, and the

been sufficient, regardless of threshold. Note that for more

model error is always below the threshold. Conversely, the

permissive threshold values (or smaller noise magnitude )

equivalent PER(100) algorithm is unable to maintain the

DILSQ achieved zero communication (beyond initial sync)

error below the threshold, which would require a higher up-

for the fixed dataset (not shown in this log-scale figure).

date frequency. When model changes in

are large and

Performance on the drift dataset (Figure 5b) is more in-

frequent DILSQ performs more syncrhonizations, resulting

teresting. When ✏ is very strict, both algorithms perform

in updated 0 =

that decreases the error. When

is stable

roughly the same, with normalized messages of 0.25–0.75. As

(it is never truly constant due to noise), syncrhonizations

✏ grows DILSQ develops an increasing advantage over PER

are much rarer. The periodic algorithm, on the other hand,

with optimal period. The optimal period must be low enough

2

to match the quickly changing model, and is wasteful on the

therefore k k2 ⇠ 2m

3

intervals where

is quiescent. For our dataset,

Given that elements of both

true is con-

0 and

are i.i.d N (0, ), then

stant during roughly 75% of each epoch. For datasets with

k

0 k

p

⇠

2

m. The probability that a random e =

0

larger quiescent periods (or smaller window), the advantage

will overwhelm ✏ is P = 1

CDF

( ✏

p

) > 1

10 8.

m

of DILSQ will be even larger.

2 - パラメータを変化させたときの結果
- Traﬃc monitoring

• 問題 : 複数個のセンサーの分毎の車の平均速度

から速度を補完する

• DILSQ(黒)はExact LSM(紫)と 色なく補完できている - 閾値εがモデルの更新コストに影響

it is satisfied, communication is avoided. If not, violation is

resolved by collecting data from all nodes and computing a

new global model. Evaluation on real-world datasets shows

a communication reduction of up to two orders of magnitude.

Simulations on synthetic datasets show our algorithm scales

well with the number of nodes.

We emphasize that correctness of the local constraint is

independent of network topology and the algorithm used

to compute the model

0.

Hence it is straightforward to

adapt our method to other settings. First, the role of the

coordinator can easily be replaced with convergecasting [4,

(a) Window size W = 60

(b) Window size W = 30

39], yielding a peer-to-peer monitor. Alternatively, our dis-

tributed monitoring approach can easily be combined with an

Figure 8: Communication for DILSQ (black) and

efficient distributed computation technique, enjoying the best

periodic algorithm (green) on the traffic dataset at

of both worlds: the current model can be computed during

di↵erent ✏ values.

sync using any of several existing algorithms, be they exact,

iterative, or distributed [7, 21]. Similarly, our method is

5.3 GLS on Gas Sensor Time Series

compatible with recent communication reduction techniques

from the field of distributed streams, such as reference point

Data in this experiment consists of measurements collected

prediction [6], individualized constraints or slack [14, 5], and

by an array of 16 chemical sensors recorded at a sampling

local violation resolution [15]. We leave such extensions for

rate of 25Hz for 5 minutes, resulting in 7500 data points

future work.

for each sensor. This dataset is described in [42], and is

publicly available [19]. The original goal in [42] is to identify

certain gas classes given high-level frequency features. Since

7. ACKNOWLEDGMENTS

the original target variable is nominal and fixed throughout

The research leading to these results has received funding

the run in each experiment, we defined a di↵erent regression

from the European Union’s Seventh Framework Programme

problem. We divided the 16 sensors to k = 4 “nodes”, where

FP7-ICT-2013-11 under grant agreement No 619491 and

in each node three sensors serve as the data x while the

No 619435.

remaining sensor serves as the response y. We also added

a constant variable 1 to x, to allow intercept in the model,

hence m = 4. The regression task is therefore to predict the

8. REFERENCES

value of the 4th sensor in each node using the first three.

[1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An

Note that in this setting measurement errors cannot be

algorithm for designing overcomplete dictionaries for sparse

representation. IEEE Trans. Sig. Proc., 2006.

assumed to be independent, so an OLS models is ill-suited

[2] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of

here. Instead, we assume errors are an AR(1) process and

a large dimensional sample covariance matrix. Ann. Prob.,

monitor the generalized least squares model [9]. We used

1993.

an AR(1) parameter value

= 0.95 for the autocorrelation

[3] K. Bhaduri, K. Das, and C. Giannella. Distributed

matrix [32]. Average k k is 0.3, so we use ✏ = 0.1, resulting

monitoring of the R2 statistic for linear regression. In Proc.

in 0.17 normalized messages for DILSQ. We note that using

SDM, 2011.

an OLS model with the same ✏ resulted in 1.15 normalized

[4] K. Bhaduri and H. Kargupta. An efficient local algorithm for

messages – the OLS model had to be updated very frequently

distributed multivariate regression in peer-to-peer networks.

In Proc. SDM, 2008.

as it was unstable.

[5] M. Gabel, D. Keren, and A. Schuster.

We repeated the experiment for various ✏ values in the

Communication-efficient distributed variance monitoring and

range [0.01,1] (figure omitted for lack of space). For ✏ < 0.1

outlier detection for multivariate time series. In Proc.

DILSQ is clearly superior: PER must communicate every

IPDPS, 2014.

round (T = 1) in order to match DILSQ, which achieves

[6] N. Giatrakos, A. Deligiannakis, M. Garofalakis, I. Sharfman,

communication between 0.2 and 1 (for ✏ = 0.01). When ✏ is

and A. Schuster. Prediction-based geometric monitoring over

more permissive, however, PER is superior and can obtain

distributed data streams. In Proc. SIGMOD. ACM, 2012.

the same maximum error with less communication: with an

[7] C. Guestrin, P. Bod´ık, R. Thibaux, M. A. Paskin, and

S. Madden. Distributed regression: an efficient framework

extremely permissive ✏ = 1, DILSQ requires 0.04 normalized

for modeling sensor network data. In Proc. IPSN, 2004.

messages while PER requires 0.015 for the same maximum

[8] R. Gupta, K. Ramamritham, and M. K. Mohania. Ratio

error (though, of course, optimal T must be known a priori

threshold queries over distributed data sources. PVLDB,

to achieve this performance).

2013.

[9] F. Hayashi. Econometrics. Princeton University Press, 2000.

6. CONCLUSIONS

[10] L. Huang, X. Nguyen, M. N. Garofalakis, J. M. Hellerstein,

M. I. Jordan, A. D. Joseph, and N. Taft.

DILSQ is the first communication-efficient monitoring al-

Communication-efficient online detection of network-wide

gorithm for least-squares regression models that limits the

anomalies. In INFOCOM, 2007.

error in model coefficients. By monitoring the deviation of

[11] M. Jelasity, A. Montresor, and O. Babaoglu. Gossip-based

aggregation in large dynamic networks. ACM TOCS, 2005.

the existing model from the true model, our approach is able

[12] M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman.

to avoid costly communication and model recomputations,

Communication-efficient distributed online prediction by

while guaranteeing bounded model error. Each round, each

dynamic model synchronization. In Proc. ECML PKDD,

node checks a simple local constraint on its own local data; if

2014. - まとめ

• 分散したストリームデータ上で, モデルの変化を

監視する手法

• あるノードでモデルの変化に影響のあるデータ

が来たときのみモデルを更新

– 効率的なモデルの更新条件を導出

Ø 少ない通信のオーバヘッドで, モデルの変化を追

跡できる