このページは http://www.slideshare.net/kisa12012/icml2013-24126156 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- PoisoningAttackSVM (ICMLreading2012)4年以上前 by Hidekazu Oiwa
- FOBOS5年弱前 by Hidekazu Oiwa
- OnlineClassifiers5年以上前 by Hidekazu Oiwa

- Large-Scale Learning with Less RAM

via Randomization

[Golovin+ ICML13]

ICML読み会 2013/07/09

Hidekazu Oiwa (@kisa12012) - 読む論文

• Large-Scale Learning with Less RAM via

Randomization (ICML13)

• D. Golovin, C. Sculley, H. B. McMahan, M.

Young (Google)

• http://arxiv.org/abs/1303.4664 (論文)

• NIPS2012のWorkshopが初出

• 図／表は，論文より引用

3 - 一枚概要

• 重みベクトルの省メモリ化

• GPU/L1-cacheに載せるためビット数を自動調節

• SGDベースのアルゴリズムを提案

= (1.0, . . . , 1.52)

ˆ = (1.0, . . . , 1.50)

ﬂoat (32bits)

Ex. (5bits)

Q2,2

• 精度をほとんど落とさず，メモリ使用量の削減を実現

• 学習時：50%, 予測時：95%

• Regretによる理論保証もあり

4 - 導入
- 背景：ビッグデータ！！

• メモリ容量が重要な制約に

• 全データがメモリに載らない

• GPU／L1Cacheでデータ処理可能か？

• 学習時のみでなく予測時にも重要

• 検索広告やメールフィルタのレイテンシに影響

• 重みベクトルのメモリ量を削減したい

6 - ﬂoat型は実用上オーバースペック

線形分類器の重みベクトル値のヒストグラム図

Large-Scale Learning with Less RAM via Randomization

特徴種類数

(Van Durme & Lall, 2009). To our knowledge, this

32bitもの精度で

paper gives the first algorithms and analysis for online

値を保持する必要があるのか？

learning with randomized rounding and counting.

Per-Coordinate Learning Rates Duchi et al.

(2010) and McMahan & Streeter (2010) demon-

strated that per-coordinate adaptive regularization

(i.e., adaptive learning rates) can greatly boost pre-

Figure 1. Histogram of coefficients in a typical large-scale

diction accuracy. The intuition is to let the learning

linear model trained from real data. Values are tightly

rate for common features decrease quickly, while keep-

grouped near zero; a large dynamic range is superfluous.

ing the learning rate high for rare features. This adap-

7

Contributions

This paper gives the following theo-

tivity increases RAM cost by requiring an additional

retical and empirical results:

statistic to be stored for each coordinate, most often

as an additional 32-bit integer. Our approach reduces

1. Using a pre-determined fixed-point representation

this cost by using an 8-bit randomized counter instead,

of coefficient values reduces cost from 32 to 16 bits

using a variant of Morris’s algorithm (Morris, 1978).

per value, at the cost of a small linear regret term.

2. The cost of a per-coordinate learning rate sched-

3. Learning with Randomized

ule can be reduced from 32 to 8 bits per coordinate

using a randomized counting scheme.

Rounding and Probabilistic Counting

3. Using an adaptive per-coordinate coarse represen-

For concreteness, we focus on logistic regression with

tation of coefficient values reduces memory cost

binary feature vectors x 2 {0, 1}d and labels y 2

further and yields a no–regret algorithm.

{0, 1}. The model has coefficients

2 Rd, and gives

4. Variable-width encoding at prediction time allows

predictions p (x) ⌘ ( · x), where (z) ⌘ 1/(1+e z)

coefficients to be encoded even more compactly

is the logistic function. Logistic regression finds the

(less than 2 bits per value in experiments) with

model that minimizes the logistic–loss L. Given a la-

negligible added loss.

beled example (x, y) the logistic–loss is

Approaches 1 and 2 are particularly attractive, as they

L(x, y; ) ⌘ y log (p (x))

(1

y) log (1

p (x))

require only small code changes and use negligible ad-

where we take 0 log 0 = 0. Here, we take log to be

ditional CPU time. Approaches 3 and 4 require more

the natural logarithm. We define kxk

sophisticated data structures.

p as the `p norm

of a vector x; when the subscript p is omitted, the

`2 norm is implied. We use the compressed summa-

2. Related Work

P

tion notation g

t

1:t ⌘

g

s=1 s for scalars, and similarly

P

f

t

f

In addition to the sources already referenced, related

1:t(x) ⌘

s=1

s(x) for functions.

work has been done in several areas.

The basic algorithm we propose and analyze is a vari-

ant of online gradient descent (OGD) that stores coef-

Smaller Models

A classic approach to reducing

ficients

in a limited precision format using a discrete

memory usage is to encourage sparsity, for example via

set (✏Z)d. For each OGD update, we compute each

the Lasso (Tibshirani, 1996) variant of least-squares

new coefficient value in 64-bit floating point represen-

regression, and the more general application of L1 reg-

tation and then use randomized rounding to project

ularizers (Duchi et al., 2008; Langford et al., 2009;

the updated value back to the coarser representation.

Xiao, 2009; McMahan, 2011). A more recent trend

A useful representation for the discrete set (✏Z)d is

has been to reduce memory cost via the use of feature

the Qn.m fixed-point representation. This uses n bits

hashing (Weinberger et al., 2009). Both families of ap-

for the integral part of the value, and m bits for the

proaches are e↵ective. The coarse encoding schemes

fractional part. Adding in a sign bit results in a total

reported here may be used in conjunction with these

of K = n + m + 1 bits per value. The value m may be

methods to give further reductions in memory usage.

fixed in advance, or set adaptively as described below.

We use the method RandomRound from Algorithm 1

Randomized

Rounding Randomized rounding

to project values onto this encoding.

schemes have been widely used in numerical com-

puting and algorithm design (Raghavan & Tompson,

The added CPU cost of fixed-point encoding and ran-

1987).

Recently, the related technique of random-

domized rounding is low. Typically K is chosen to

ized counting has enabled compact language models

correspond to a machine integer (say K = 8 or 16), - bit数を削減 : どのように？

• 固定長にすれば良い？

• 最適な重みベクトルが固定bit長で表せない場

合，永遠に収束しない

⇤

• アイデア

• 学習時はステップ幅に応じて表現方法を変える

8 - アルゴリズム
- bit長の記法の定義

• ：固定bit長表現の記法

Qn.m

• ：仮数部

n

n

m

•

1.5 ⇥ 2 1

：指数部

m

• では(n+m+1)bits使用

Qn.m

• 1bitは符号

• ：表現可能な点の間のgap

✏

✏

10 - アルゴリズム

Large-Scale Learning with Less RAM via Randomization

Algorithm 1 OGD-Rand-1d

grid of resolution ✏t on round t, and an adaptive learn-

input: feasible set F = [ R, R], learning rate schedule

ing rate ⌘t. We then run one copy of this algorithm

⌘t, resolution schedule ✏t

for each coordinate of the original convex problem,

define fun Project ( ) = max( R, min( , R))

implying that we can choose the ⌘t and ✏t schedules

Initialize ˆ1 = 0

appropriately for each coordinate. For simplicity, we

for t=1, . . . , T do

assume the ✏

Play the point ˆ

t resolutions are chosen so that

R and

t, observe gt

普通のSGD

+R are always gridpoints. Algorithm 1 gives the one-

t+1 = Project

ˆt

⌘tgt

ˆ

dimensional version, which is run independently on

Random Roundで

t+1 RandomRound( t+1, ✏t)

bit数を に落とす

Q

each coordinate (with a di↵erent learning rate and dis-

n.m

function RandomRound( , ✏)

cretization schedule) in Algorithm 2. The core result

⌅ ⇧

⌃ ⌥

a ✏

; b ✏

is a regret bound for Algorithm 1 (omitted proofs can

✏

✏

(b with prob. ( a)/✏

be found in the Appendix):

return

a

otherwise

Theorem 3.1. Consider running Algorithm 1 with

adaptive non-increasing learning-rate schedule ⌘t, and

11

discretization schedule ✏t such that ✏t ⌘t for a con-

so converting back to a floating point representa-

stant

> 0. Then, against any sequence of gradi-

tions requires a single integer-float multiplication (by

ents g1, . . . , gT (possibly selected by an adaptive ad-

✏ = 2 m). Randomized rounding requires a call to

versary) with |gt| G, against any comparator point

a pseudo-random number generator, which may be

⇤ 2 [ R, R], we have

done in 18-20 flops. Overall, the added CPU overhead

is negligible, especially as many large-scale learning

(2R)2

1

p

E[Regret( ⇤)]

+ (G2 + 2)⌘

T .

methods are I/O bound reading from disk or network

2⌘

1:T +

R

T

2

rather than CPU bound.

By choosing

sufficiently small, we obtain an expected

3.1. Regret Bounds for Randomized Rounding

regret bound that is indistinguishable from the non-

rounded version (which is obtained by taking

= 0).

We now prove theoretical guarantees (in the form of

In practice, we find simply choosing

= 1 yields ex-

upper bounds on regret) for a variant of OGD that

cellent results. With some care in the choice of norms

uses randomized rounding on an adaptive grid as well

used, it is straightforward to extend the above result

as per-coordinate learning rates. (These bounds can

to d dimensions. Applying the above algorithm on a

also be applied to a fixed grid). We use the standard

per-coordinate basis yields the following guarantee:

definition

Corollary 3.2. Consider running Algorithm 2 on

T

X

T

X

the feasible set F = [ R, R]d, which in turn runs

Regret ⌘

ft( ˆt)

arg min

ft( ⇤)

Algorithm 1 on each coordinate.

We use per-

⇤

t=1

2F

t=1

p

coordinate learning rates ⌘t,i = ↵/ ⌧t,i with ↵ =

p

p

given a sequence of convex loss functions ft. Here the

2R/ G2 + 2, where ⌧t,i t is the number of non-

ˆt our algorithm plays are random variables, and since

zero gs,i seen on coordinate i on rounds s = 1, . . . , t.

we allow the adversary to adapt based on the pre-

Then, against convex loss functions ft, with gt a sub-

viously observed ˆt, the ft and post-hoc optimal ⇤

gradient of ft at ˆt, such that 8t, kgtk1 G, we have

are also random variables. We prove bounds on ex-

d ✓

q

◆

pected regret, where the expectation is with respect

X

p

E[Regret]

2R

2⌧

⌧

.

to the randomization used by our algorithms (high-

T,i(G2 +

2) + R

T,i

i=1

probability bounds are also possible). We consider

regret with respect to the best model in the non-

The proof follows by summing the bound from The-

discretized comparison class F = [ R, R]d.

orem 3.1 over each coordinate, considering only the

We follow the usual reduction from convex to lin-

rounds when gt,i 6= 0, and then using the inequality

P

p

T

p

ear functions introduced by Zinkevich (2003); see also

1/ t

T to handle the sum of learning rates

t=1

2

Shalev-Shwartz (2012, Sec. 2.4). Further, since we

on each coordinate.

consider the hyper-rectangle feasible set F = [ R, R]d,

The core intuition behind this algorithm is that for fea-

the linear problem decomposes into n independent

tures where we have little data (that is, ⌧i is small, for

one-dimensional problems.1 In this setting, we con-

sider OGD with randomized rounding to an adaptive

choosing the hyper-rectangle simplifies the analysis; in

practice, projection onto the feasible set rarely helps per-

1Extension to arbitrary feasible sets is possible, but

formance. - Random Round図示

a : b

b

a

％

a + b

a + b

ˆ

ˆ

Example

1 : 4

80％

20％

ˆ

ˆ

12 - Large-Scale Learning with Less RAM via Randomization

Large-Scale Learning with Less RAM via Randomization

Algorithm 1 OGD-Rand-1d

grid of resolution ✏t on round t, and an adaptive learn-

✏

input: feasible set F = [ R, R], learning rate s

A c

l hedu

gori le

thm in

1 g r

O at

G e

D ⌘

- t.

R W

ande- th

1d engap の決め方

run one copy of

t this algorithm

grid of resolution ✏t on round t, and an adaptive learn-

⌘t, resolution schedule ✏t

for each coordinate of the original convex problem,

input: feasible set

ing rate ⌘

define fun Project ( ) = max( R, min( , R))

F = [ R, R], learning rate schedule

t. We then run one copy of this algorithm

implying that we can choose the ⌘t and ✏t schedules

⌘

for each coordinate of the original convex problem,

Initialize ˆ

t, resolution schedule ✏t

1 = 0

appropriately for each coordinate. For simplicity, we

for t=1, . . . , T do

define • SGDのステップ幅 と任意の定数 より，

⌘t

> 0

fun Project ( ) = max( R, min( , R))

implying that we can choose the ⌘

assume the ✏

t and ✏t schedules

Play the point ˆ

t resolutions are chosen so that

R and

Initialize ˆ

t, observe gt

1 = 0

+R are always gridpoints. Algorithm 1 gives the one

ap -propriately for each coordinate. For simplicity, we

for t=1, . . . , T do

t+1 = Project

ˆt

⌘tgt

assume the ✏t resolutions are chosen so that

R and

ˆ

dimensional version, which is run independently on

Play the point ˆ

t+1 RandomRound( t+1, ✏t)

• を満たす

✏t ⌘t

ようにbit長を設定

t, observe gt

each coordinate (with a di↵erent learning rate and di

+ s-

R are always gridpoints. Algorithm 1 gives the one-

t+1 = Project

ˆt

⌘tgt

function RandomRound( , ✏)

cretization schedule) in Algorithm 2. The core resu

di lt

mensional version, which is run independently on

⌅ ⇧

⌃ ⌥

ˆt+1 RandomRound( t+1, ✏t)

⌘

a ✏

; b ✏

is a regret bound for Algorithm 1 (omitted proof

t s c

e an

ach coordinate (with a di↵erent learning rate and dis-

✏

✏

(b with prob. ( a)/✏

be found in the Appendix):

function RandomRound( , ✏)

cretization schedule) in Algorithm 2. The core result

return

⌅ ⇧

⌃ ⌥

a

otherwise

Theorem 3.1. Consider running A

✏ lgorithm 1 w

t

i i

s th

a ✏

; b ✏

a regret bound for Algorithm 1 (omitted proofs can

✏

✏

(adaptive non-increasing learning-rate schedule ⌘t, and

b

with prob.

(

a)/✏

be found in the Appendix):

return

discretization schedule ✏t such that ✏t ⌘t for a con-

so converting back to a floating point representa-

Theorem 3.1. Consider running Algorithm 1 with

st

a antothe >

rwi 0

s .e Then, against any sequence of gradi-p

tions requires a single integer-float multiplication (by

ents g

adaptive non-increasing learning-rate schedule ⌘t, and

1, . . . , gT (possibly selected by an adaptive a

O d

( - T )

✏ = 2 m). Randomized rounding requires a call to • この時，Regretの上限が で求まる

versary) with |g

discretization schedule ✏

t| G, against any comparator point

t such that ✏t

⌘t for a con-

a pseudo-random number generator, whichs m

o acy b

on e

verting⇤ back to a floating point representa-

2 [ R, R], we have

stant

> 0. Then, against any sequence of gradi-

done in 18-20 flops. Overall, the added CPU o

ti ver

on h

s erad Thm. 3.1.

equires a single integer-float multiplication (by

ents g1, . . . , gT (possibly selected by an adaptive ad-

is negligible, especially as many large-scale✏ lear

= ni

2 ng

m

(2R)2

1

p

). Randomized rounding requires a call to

E[Regret( ⇤)]

+ (G2 + 2)⌘

T .

versa where

ry) with |g

methods are I/O bound reading from disk or network

t| G, against any comparator point

a pseudo-random number ge 2

n⌘

1:T +

R

eT

rator 2

, which may be

⇤

rather than CPU bound.

2 [ R, R], we have

done in 18-20 flops. Overall, the added CPU overhead

By choosing

sufficiently small, we obtain an expected

is negligible, especially as many large-scale learning

! 0

(2R)2

1

でﬂoat型のRegret上限

p

3.1. Regret Bounds for Randomized Rounding

regret bound that is indistinguishable from the non

E-[Regret( ⇤)]

+ (G2 + 2)⌘

T .

methods are I/O bound reading from disk or net

1:T +

R

13

work

rounded version (which is obtained by taking

= 0).

2⌘T

2

We now prove theoretical guarantees (in the

r for

ath m

er of

than In

C p

P r

Uacbtice

ou ,n w

d. e find simply choosing = 1 yields ex-

upper bounds on regret) for a variant of OGD that

cellent results. With some care in the choice of norm

B s

y choosing

sufficiently small, we obtain an expected

uses randomized rounding on an adaptive grid as well

3.1. Regre u

t sed

B , i

ou tn is

dsstfrai

orght

Rforw

an ar

d d

om tio

z e

e x

d ten

R d t

ou h

ned ab

in o

gve resru

e lt

gret bound that is indistinguishable from the non-

as per-coordinate learning rates. (These bounds can

to d dimensions. Applying the above algorithm on

r a

ounded version (which is obtained by taking

= 0).

also be applied to a fixed grid). We use the st

Wan

e d

n ar

o d

w prop

v eer-c

t o

h or

e d

or ientat

ic e

al bas

guis

aryie

an ltdese t

s h(ei f

n oltlo

h w

e in

f g

or gu

m ar

ofantee:In practice, we find simply choosing = 1 yields ex-

definition

upper boun C

ds orol

on lar

re y

gre 3.

t) 2.f C

or oansvider

ari r

an u

t nn

of ing

O A

G l

Dgotri

h th

atm 2coenllent results. With some care in the choice of norms

T

X

T

X

uses randomtihzeedfea

r si

oubl

nedi set

ng F

on =

an [adR,

ap R

ti ]d

v ,e w

grhiidchas inw t

elulrn runssed, it is straightforward to extend the above result

Regret ⌘

ft( ˆt)

arg min

ft( ⇤)

as per-coordA

i l

ngo

atreith

l m

ear 1

nin o

g nr e

at a

e ch

s.

c

( o

Tor

h d

e i

s n

e at

be.

ound W

s e

c u

an se per-

to d dimensions. Applying the above algorithm on a

⇤

t=1

2F

t=1

p

also be appl c

i o

e o

d rd

t ionatefi l

x ear

d nin

gri g

d) r

. ates

We ⌘t,i

use =

the↵/

st ⌧

ant,i

d with

ard

↵ =

p

p

per-coordinate basis yields the following guarantee:

given a sequence of convex loss functions f definition

t. Here the

2R/

G2 + 2, where ⌧t,i t is the number of non-

ˆ

Corollary 3.2. Consider running Algorithm 2 on

t our algorithm plays are random variables, and since

zero gs,i seen on coordinate i on rounds s = 1, . . . , t.

T

T

the feasible set

we allow the adversary to adapt based on the pre-

Then, aga

X inst convex loss funct

X ions f

F = [ R, R]d, which in turn runs

t, with gt a sub-

Regret ⌘

f

f

viously observed ˆ

t( ˆt)

arg min

t( ⇤)

Algorithm 1 on each coordinate.

We use per-

t, the ft and post-hoc optimal

⇤

gradient of ft at ˆt, su

⇤ ch that 8t, kgtk1 G, we have

2F

p

are also random variables. We prove bounds on ex-

t=1

t=1

coordinate learning rates ⌘t,i = ↵/ ⌧t,i with ↵ =

d ✓

q

◆

p

p

pected regret, where the expectation is with respect

X

p

given a sequeE

n [R

ce egr

of et

c ]

onvex l 2

os R

s fu 2

n ⌧

ctions ft. Here the ⌧

.2R/

G2 + 2, where ⌧t,i t is the number of non-

to the randomization used by our algorithm

ˆs (high-

T,i(G2 +

2) + R

T,i

t our algorithm plays ar i=1

e random variables, and since

zero gs,i seen on coordinate i on rounds s = 1, . . . , t.

probability bounds are also possible). We consider

we allow the adversary to adapt based on the pre-

Then, against convex loss functions ft, with gt a sub-

regret with respect to the best model in the non-

The proof follows by summing the bound from The-

viously observed ˆ

gradient of f

discretized comparison class F = [ R, R]d.

t, the ft and post-hoc optimal

⇤

t at ˆt, such that 8t, kgtk1 G, we have

orem 3.1 over each coordinate, considering only the

are also random variables. We prove bounds on ex-

d ✓

◆

We follow the usual reduction from convex to lin-

rounds when gt,i 6= 0, and then using the inequality

q

pected regre P

t, where the p

expectation is with respect

X

p

T

p

ear functions introduced by Zinkevich (2003); see also

1/ t

T to handle the sum of learning rate

E s[Regret]

2R

2⌧

⌧

.

t=1

2

to the randomization used by our algorithms (high-

T,i(G2 +

2) + R

T,i

Shalev-Shwartz (2012, Sec. 2.4). Further, since we

on each coordinate.

i=1

probability bounds are also possible). We consider

consider the hyper-rectangle feasible set F = [ R, R]d,

The core intuition behind this algorithm is that for fea-

regret with respect to the best model in the non-

the linear problem decomposes into n independent

The proof follows by summing the bound from The-

tures where we have little data (that is, ⌧i is small, for

one-dimensional problems.1 In - Per-coordinate learning rates (a.k.a. AdaGrad)

[Duchi+ COLT10]

• 特徴毎にステップ幅を変化

• 頻繁に出現する特徴は，ステップ幅をどんどん下げる

• 稀にしか出現しない特徴は，ステップ幅をあまり下げ

ない

• 非常に高速に最適解を求めるための手法

• 各特徴の出現回数を保持しなければならない

• 32bits int型

• Morris algorithm [Morris+ 78] で近似 -> 8bits

14 - Large-Scale Learning with Less RAM via Randomization

Large-Scale Learning with Less RAM via Randomization

Algorithm 2 OGD-Rand

s of all coefficients (without end of string delimiters),

Algorithm 2 OGD-Rand

s of all coefficients (without end of string delimiters),

input: feasible set F = [

store a second binary string of length s with ones at

input:

R, R]d, p fe

a a

rasib

m leetese

rst F

↵, = [> R,

0 R]d, parameters ↵, > 0

store a second binary string of length s with ones at

Initialize ˆ

the coefficient boundaries, and use any of a number of

1 = 0 2 Rd; 8i, ⌧i In

= it0ialize ˆ

the coefficient boundaries, and use any of a number of

1 = 0 2 Rd; 8i, ⌧i = 0

for t=1, . . . , T do

for t=1, . . . , T do

rank/select data strucrtan

urke/s

s etle

o citn d

d at

exai st

n r

t u

o cittu,res

e. t

g. o

, itn

h d

e ex into it, e.g., the

Play the point ˆt, observe lo Pla

ss y

fu th

nc e

ti p

o o

n in

ft ˆ

t

t, observe loss fun

onct

e ion

of ft

Patrascu (

on

2008). e of Patrascu (2008).

for i=1, . . . , d do

for i=1, . . . , d do

let gt,i = rft(xt)i

let gt,i = rft(xt)i

if g

if g

3.2. Approximate

3.

Fe 2.

at A

ur p

e pr

C oxi

ou m

ntat

s e Feature Counts

t,i = 0 then continue

t,i = 0 then continue

⌧i

⌧

⌧i + 1

i ⌧i + 1

p

Online convex optimization methods typically use a

p

let ⌘

Online convex optimization methods typically use a

t,i = ↵/

⌧i and ✏t,i = ⌘t,i

let ⌘t,i = ↵/ ⌧i and ✏t,i = ⌘t,i

learning rate that decreases over time, e.g., setting ⌘t

learning rate that decreases over time, p

e.g., setting ⌘

t+1,i Project ˆt,i

⌘t,igt,i

t

p

t+1,i Project ˆt,i

⌘t,ig ˆt,i

proportional to 1/ t. Per-coordinate learning rates

proportional to 1/ t. Per-coordinate learning rates

t+1,i

ˆ

RandomRound( t+1,i, ✏t,i)

t+1,i RandomRound( t+1,i, ✏t,i)

require storiMorris

ng a unique cou Algorithm

nt ⌧i for each coordinate,

require storing a unique count ⌧

where ⌧

i for each coordinate,

i is the number of times coordinate i has ap-

where ⌧i is the number of times coordinate i has ap-

peared with a non-zero gradient so far. Significant

example rare words in a bag-of-wor

p d

e s

ar r

e e

dpre

w sie

t n

h tat

a ion

n ,

on-ze srpoace

gr is

adi seav

nteds b

o yf u

ar s.ingSia 8-

gni b

fi ict r

an an

t domized counting

example rare words in a b ide

ag-nt

of i-fie

w d

or b

d y

s a

re b

p irnear

se y

ntfeat

ati ur

on e

, ), usin

p g

ac a

e fi

i n

s es-p

a r

v e

e c

d isi

bon

y using a 8-bit randomized counting

identified by a binary feature), using a fine-precision

•sc頻度カウンタを確率変数にする

heme rather than a 32-bit (or 64-bit) integer to store

coefficient is unnecessary, as we can

s ’cth esti

m m

e at

r e

at t

h h

e e

r c

t or

h -

an a the

32-b d

it t(otal

or

cou

64-b n

itt)s.inW

te e u

gerset a

o s v

t ar

orian

e t of Morris’ prob-

coefficient is unnecessary, rect

as we co

c effi

an’ c

t ie

e n

s t

ti w

m ith

ate m

t u

h c

e h

c c

oron

- fidence

the .d Tthis

ot i

al scin

ou fac

nt t

s. W ab

e i

uli

s s

e tic

a c

vou

ar n

i tin

ant g al

of gor

M it

or h

rim

s’ (1978)

prob- analyzed by Flajo-

rect coefficient with much th

c e

on s

fiam

de e

nc r

ee.ason

Thi u

s siin

s g

in af lar

act ger learning rate is ap-

abilistic counting al • 特徴が出現するたびに，以下の操作を行う

l

gor e

i t

th(1985)

m (

. Sp

1978) eci

an fic

al al

yzly

e ,

d w

be

y in

F ilti

a al

j i

o-ze a counter C = 1,

the same reason using a p

l rop

ar r

ge i

r atle, s

ar o

ni it

ng isr no

ate coi

is ncid

ap e

- nce the theory suggests

let (1985). Specificall an

y, d

w on

e i e

niac

ti h

al iin

z c

e re

a m

c ent

ou op

nte e

r rat

C ion

= , w

1, e increment C with

choosing ✏

propriate, so it is no coincidence tht and ⌘

e the t to be of the same magnitude.

ory suggests

and on each increment •

prob

ope 確率 でCを1インクリ

ability p(C) = b C, where base b is a par

ration, we increment C with

メント

ameter.

choosing ✏t and ⌘t to be of the same magnitude.

probability p(C) = b W

C , e e

w s

hti

e m

re at

be t

as h

e eb ciou

s n

a tp as

ar ˜

⌧ (

amC

e )

te = bC b

r.

, which is an

b 1

Fixed Discretization

Rather than implementing

We estimate the count •

unbi

as の値を頻度とし

ased estimator of the true count.

˜

⌧ (C) = bC b , w p

hich is an

て返す

We then use

an adaptive discretization schedule, it is more straight-

learning rates ⌘b 1

t,i = ↵/

˜

⌧t,i + 1, which ensures that

Fixed Discretization Rather than implementing

unbiased estimator of the true count. We then use

forward and more efficient to choose a fixed grid res-

even

p when ˜⌧t,i = 0 we don’t divide by zero.

an adaptive discretization schedule, it is more straight-

learning rates ⌘

˜ ˜

⌧

olution, for example a 16-bit Qn.m representation i

t,i = ↵/

s

• は真の頻度カウント数の不偏推定量

( t,i

C)+ 1, which ensures that

We compute high-probability bounds on this counter

forward and more efficient su

t ffi

o ccie

h n

o tosfor

e aman

fix y

edapp

grilic

d at

r i

e on

s- s.2 In

ev th

en is case

when, on

˜

⌧ e

t,i can

= 0 we don’t divide by zero.

in Lemma A.1. Using these bounds for ⌘

olution, for example a

ap

16-bi ptly the

Qn.m ab

r o

e v

p er t

e h

seeor

nt y,

atibu

ont siismply stop decreasing the

t,i in conjunc-

We compute high-pr t

ob ion

abilw

itit

y hbTh

ou eor

nd e

s m 3.

on 1,

thiw

s ec ob

ou tai

nt n

er the following result

sufficient for many applicatle

i ar

on n

s.2ingI r

n at

t e

hi on

s c

c e

as iet, reac

one he

c s

ansay ✏ (= 2 m). Then, the

in Lemma A.1. Using (

t p

h reosof

e d

b ef

ouer

n r

desdf to

or t

⌘ he

t,i ap

in p

c en

ondjix

u )

n .c-

apply the above theory, bu ⌘

t 1:

siTm te

plrm

y s i

t n t

op he

d r

c e

r gr

e e

asti b

n ou

g t n

h d

e yields a linear term like

15

tion with Theorem 3.1, we obtain the following result

learning rate once it reach O

es (✏

s T

a )

y ;✏ this

(= i

2 smun

). av

Toid

he ab

n, le w

the hen using a fixed reso-

Theorem 3.3. Consider running the algorithm of

(proof deferred to the appendix).

⌘

lution ✏. One cou

e l

ar dt l

e e

r t

mth

liek l

e earning rate continue to

Corollary 3.2 under the assumptions specified there,

1:T term in the regret bound yields a lin p

decrease like 1/ t, but this would pr

Thov

e id

or e no

m be

3.ne

3. fit;

but using approximate counts ˜

⌧

Consider running the algorithm i in place of the exact

O(✏T ); this is unavoidable when using a fixed reso-

of

in fact, lower-bounding the learnin

Cg-

o r

r at

ol e

la i

rsy kn

3 o

. w

2 n

u t

n o

counts ⌧i. The approximate counts are computed using

lution ✏. One could let the learning rate continue to

der the assumptions specified there,

p

allow online gradient descent to provide regret bounds

the randomized counter described above with any base

decrease like 1/ t, but this would provide no benefit;

but using approximate counts ˜

⌧i in place of the exact

against a moving comparator (Zinkevich, 2003).

b > 1. Thus, ˜

⌧t,i is the estimated number of times

in fact, lower-bounding the learning-rate is known to

counts ⌧i. The approximate counts are computed using

gs,i 6= 0 on rounds s = 1, . . . , t, and the per–coordinate

the randomized counter described above with a p

allow online gradient descent to provide regret bounds

ny base

learning rates are ⌘

˜

⌧

Data Structures

There are several viable ap-

t,i = ↵/

t,i + 1. With an appro-

against a moving comparator (Zinkevich, 2003).

b > 1. Thus, ˜

⌧t,i is ptrhia

e teesch

ti oi

m caeteodf ↵

nuwe

mb ha

er veof times

proaches to storing models with variable–sized coef-

gs,i 6= 0 on rounds s = 1, . . . , t, and the per–coordinate

ficients. One can store all keys at a fixed (low) preci-

p

⇣ p

⌘

learning rates are ⌘

˜

⌧

Data Structures Ther

t,i = ↵/

t,i + 1. With an appro-

s

eion,

ar tehen

se m

veai

r n

altain

vi a

absleequen

ap c

- e of maps (e.g., as hash-

E[Regret(g)] = o R

G2 + 2T 0.5+

for all

> 0,

priate choice of ↵ we have

proaches to storing model tsabl

w e

i s

t )

h, e

vac

arh

i con

abletai

–s n

izin

e g

d a

c m

oe ap

f- ping from keys t - Per-Coordinate版アルゴリズム

Large-Scale Learning with Less RAM via Randomization

Algorithm 2 OGD-Rand

s of all coefficients (without end of string delimiters),

input: feasible set F = [ R, R]d, parameters ↵, > 0

store a second binary string of length s with ones at

Initialize ˆ

the coefficient boundaries, and use any of a number of

1 = 0 2 Rd; 8i, ⌧i = 0

for t=1, . . . , T do

rank/select data structures to index into it, e.g., the

Play the point ˆt, observe loss function ft

one of Patrascu (2008).

for i=1, . . . , d do

let gt,i = rft(xt)i

if g

3.2. Approximate Feature Counts

t,i = 0 then continue

⌧i ⌧i + 1 p頻度のカウンティング．Morris

Online

s Algo.により8bits化可能． convex optimization methods typically use a

let ⌘t,i = ↵/ ⌧i and ✏t,i = ⌘t,i 頻度情報を使ってステップ幅を決定．

learning rate that decreases over time, e.g., setting ⌘t

p

t+1,i Project ˆt,i

⌘t,igt,i

ˆ

proportional to 1/ t. Per-coordinate learning rates

t+1,i RandomRound( t+1,i, ✏t,i)

require storing a unique count ⌧i for each coordinate,

where ⌧i is the number of times coordinate i has ap-

peared with a non-zero gradient so far. Significant

example rare words in a bag-of-words representation,

space is saved by using a 8-bit randomized counting

16

identified by a binary feature), using a fine-precision

scheme rather than a 32-bit (or 64-bit) integer to store

coefficient is unnecessary, as we can’t estimate the cor-

the d total counts. We use a variant of Morris’ prob-

rect coefficient with much confidence. This is in fact

abilistic counting algorithm (1978) analyzed by Flajo-

the same reason using a larger learning rate is ap-

let (1985). Specifically, we initialize a counter C = 1,

propriate, so it is no coincidence the theory suggests

and on each increment operation, we increment C with

choosing ✏t and ⌘t to be of the same magnitude.

probability p(C) = b C, where base b is a parameter.

We estimate the count as ˜

⌧ (C) = bC b , which is an

b 1

Fixed Discretization Rather than implementing

unbiased estimator of the true count. We then use

p

an adaptive discretization schedule, it is more straight-

learning rates ⌘t,i = ↵/ ˜

⌧t,i + 1, which ensures that

forward and more efficient to choose a fixed grid res-

even when ˜

⌧t,i = 0 we don’t divide by zero.

olution, for example a 16-bit Qn.m representation is

We compute high-probability bounds on this counter

sufficient for many applications.2 In this case, one can

in Lemma A.1. Using these bounds for ⌘

apply the above theory, but simply stop decreasing the

t,i in conjunc-

tion with Theorem 3.1, we obtain the following result

learning rate once it reaches say ✏ (= 2 m). Then, the

(proof deferred to the appendix).

⌘1:T term in the regret bound yields a linear term like

O(✏T ); this is unavoidable when using a fixed reso-

Theorem 3.3. Consider running the algorithm of

lution ✏. One could let the learning rate continue to

Corollary 3.2 under the assumptions specified there,

p

decrease like 1/ t, but this would provide no benefit;

but using approximate counts ˜

⌧i in place of the exact

in fact, lower-bounding the learning-rate is known to

counts ⌧i. The approximate counts are computed using

allow online gradient descent to provide regret bounds

the randomized counter described above with any base

against a moving comparator (Zinkevich, 2003).

b > 1. Thus, ˜

⌧t,i is the estimated number of times

gs,i 6= 0 on rounds s = 1, . . . , t, and the per–coordinate

p

learning rates are ⌘

˜

⌧

Data Structures There are several viable ap-

t,i = ↵/

t,i + 1. With an appro-

priate choice of ↵ we have

proaches to storing models with variable–sized coef-

ficients. One can store all keys at a fixed (low) preci-

⇣ p

⌘

sion, then maintain a sequence of maps (e.g., as hash-

E[Regret(g)] = o R

G2 + 2T 0.5+

for all

> 0,

tables), each containing a mapping from keys to coeffi-

cients of increasing precision. Alternately, a simple lin-

where the o-notation hides a small constant factor and

ear probing hash–table for variable length keys is effi-

the dependence on the base b.3

cient for a wide variety of distributions on key lengths,

as demonstrated by Thorup (2009). With this data

4. Encoding During Prediction Time

structure, keys and coefficient values can be treated as

strings over 4-bit or 8-bit bytes, for example. Bland-

Many real-world problems require large-scale predic-

ford & Blelloch (2008) provide yet another data struc-

tion. Achieving scale may require that a trained model

ture: a compact dictionary for variable length keys.

be replicated to multiple machines (Buciluˇa et al.,

Finally, for a fixed model, one can write out the string

2006).

Saving RAM via rounding is especially at-

tractive here, because unlike in training accumulated

2If we scale x ! 2x then we must take

! /2 to

make the same predictions, and so appropriate choices of

3Eq. (5) in the appendix provides a non-asymptotic (but

n and m must be data-dependent.

more cumbersome) regret bound. - 予測時はさらに近似可能

• 予測時は予測への影響が少なければ，bit数はかな

り大胆に減らせる

• Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss

の場合の近似の程度に応じた発生しうる誤差分析

• さらに圧縮を使えば情報論的下限までメモリを削減

可能

Large-Scale Learning with Less RAM via Randomization

(Van Durme & Lall, 2009). To our knowledge, this

paper gives the first algorithms and analysis for online

learning with randomized rounding and counting.

Per-Coordinate Learning Rates

Duchi et al.

(2010) and McMahan & Streeter (2010) demon-

strated that per-coordinate adaptive regularization

(i.e., adaptive learning rates) can greatly boost pre-

Figure 1. Histogram of coefficients in a typical large-scale

diction accuracy. The int

ただし，下限は uition is to let the learning

linear model trained from real data. Values are tightly

rate for common features decrease quickly, while keep-

grouped near zero; a large dynamic range is superfluous.

あまり小さくない

ing the learning rate high for rare features. This adap-

Contribu 17

tions

This paper gives the following theo-

tivity increases RAM cost by requiring an additional

retical and empirical results:

statistic to be stored for each coordinate, most often

as an additional 32-bit integer. Our approach reduces

1. Using a pre-determined fixed-point representation

this cost by using an 8-bit randomized counter instead,

of coefficient values reduces cost from 32 to 16 bits

using a variant of Morris’s algorithm (Morris, 1978).

per value, at the cost of a small linear regret term.

2. The cost of a per-coordinate learning rate sched-

3. Learning with Randomized

ule can be reduced from 32 to 8 bits per coordinate

using a randomized counting scheme.

Rounding and Probabilistic Counting

3. Using an adaptive per-coordinate coarse represen-

For concreteness, we focus on logistic regression with

tation of coefficient values reduces memory cost

binary feature vectors x 2 {0, 1}d and labels y 2

further and yields a no–regret algorithm.

{0, 1}. The model has coefficients

2 Rd, and gives

4. Variable-width encoding at prediction time allows

predictions p (x) ⌘ ( · x), where (z) ⌘ 1/(1+e z)

coefficients to be encoded even more compactly

is the logistic function. Logistic regression finds the

(less than 2 bits per value in experiments) with

model that minimizes the logistic–loss L. Given a la-

negligible added loss.

beled example (x, y) the logistic–loss is

Approaches 1 and 2 are particularly attractive, as they

L(x, y; ) ⌘ y log (p (x))

(1

y) log (1

p (x))

require only small code changes and use negligible ad-

where we take 0 log 0 = 0. Here, we take log to be

ditional CPU time. Approaches 3 and 4 require more

the natural logarithm. We define kxk

sophisticated data structures.

p as the `p norm

of a vector x; when the subscript p is omitted, the

`2 norm is implied. We use the compressed summa-

2. Related Work

P

tion notation g

t

1:t ⌘

g

s=1 s for scalars, and similarly

P

f

t

f

In addition to the sources already referenced, related

1:t(x) ⌘

s=1

s(x) for functions.

work has been done in several areas.

The basic algorithm we propose and analyze is a vari-

ant of online gradient descent (OGD) that stores coef-

Smaller Models

A classic approach to reducing

ficients

in a limited precision format using a discrete

memory usage is to encourage sparsity, for example via

set (✏Z)d. For each OGD update, we compute each

the Lasso (Tibshirani, 1996) variant of least-squares

new coefficient value in 64-bit floating point represen-

regression, and the more general application of L1 reg-

tation and then use randomized rounding to project

ularizers (Duchi et al., 2008; Langford et al., 2009;

the updated value back to the coarser representation.

Xiao, 2009; McMahan, 2011). A more recent trend

A useful representation for the discrete set (✏Z)d is

has been to reduce memory cost via the use of feature

the Qn.m fixed-point representation. This uses n bits

hashing (Weinberger et al., 2009). Both families of ap-

for the integral part of the value, and m bits for the

proaches are e↵ective. The coarse encoding schemes

fractional part. Adding in a sign bit results in a total

reported here may be used in conjunction with these

of K = n + m + 1 bits per value. The value m may be

methods to give further reductions in memory usage.

fixed in advance, or set adaptively as described below.

We use the method RandomRound from Algorithm 1

Randomized

Rounding

Randomized rounding

to project values onto this encoding.

schemes have been widely used in numerical com-

puting and algorithm design (Raghavan & Tompson,

The added CPU cost of fixed-point encoding and ran-

1987).

Recently, the related technique of random-

domized rounding is low. Typically K is chosen to

ized counting has enabled compact language models

correspond to a machine integer (say K = 8 or 16), - 実験
- RCV1 Dataset

Train

Test

Feature

RCV1

20,242 677,399

47,236

Large-Scale Learning with Less RAM via Randomization

19

Figure 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coordinate

learning rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to 40 bits.

Using adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of adaptive

precision is seen more on the larger CTR data.

roundo↵ error is no longer an issue. This allows even

E[ ˆi] = i, then for any x 2 {0, 1}d our predicted log-

more aggressive rounding to be used safely.

odds ratio, ˆ · x is distributed as a sum of independent

Consider a rounding a trained model

to some ˆ.

random variables { ˆi | xi = 1}.

We can bound both the additive and relative e↵ect on

Let k = kxk0. In this situation, note that | · x

ˆ ·

logistic–loss L(·) in terms of the quantity | · x

ˆ · x|:

x| ✏kxk

ˆ

1 = ✏k, since | i

i| ✏ for all i. Thus

Lemma 4.1 (Additive Error). Fix , ˆ and (x, y). Let

Lemma 4.1 implies

= | · x

ˆ · x|. Then the logistic–loss satisfies

L(x, y; ˆ)

L(x, y; ) ✏ kxk1.

L(x, y; ˆ)

L(x, y; ) .

Similarly, Lemma 4.2 immediately provides an upper

bound of e✏k

1 on relative logistic error; this bound

is relatively tight for small k, and holds with proba-

Proof. It is well known that @L(x,y; )

1 for all

@ i

bility one, but it does not exploit the fact that the

x, y,

and i, which implies the result.

randomness is unbiased and that errors should cancel

out when k is large. The following theorem gives a

Lemma 4.2 (Relative Error). Fix

, ˆ and (x, y) 2

bound on expected relative error that is much tighter

{0, 1}d ⇥ {0, 1}. Let = | · x

ˆ · x|. Then

for large k:

Theorem 4.3. Let ˆ be a model obtained from

L(x, y; ˆ)

L(x, y; ) e 1.

using unbiased randomized rounding to a precision ✏

L(x, y; )

grid as described above. Then, the expected logistic–

loss relative error of ˆ on any input x is at most

p

See the appendix for a proof. Now, suppose we are

2 2⇡k exp ✏2k/2 ✏ where k = kxk0.

using fixed precision numbers to store our model co-

efficients such as the Qn.m encoding described earlier,

Additional Compression

Figure 1 reveals that co-

with a precision of ✏. This induces a grid of feasi-

efficient values are not uniformly distributed. Stor-

ble model coefficient vectors. If we randomly round

ing these values in a fixed-point representation means

each coefficient i (where | i| 2n) independently up

that individual values will occur many times. Basic

or down to the nearest feasible value ˆi, such that

information theory shows that the more common val- - Large-Scale Learning with Less RAM via Randomization

CTR Dataset

非公開の検索広告クリックログデータ

Data

Feature

CTR

30M

20M

Figure 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coordinate

learning rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to 40 bits.

Data

Feature

Using adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of adaptive

precision is seen more on the larger CTR data.

CTR

Billion

Billion

のデータでもほとんど同じ結果だよと言ってる

roundo↵ error is no longer an issue. This allows even

E[ ˆi] = i, then for any x 2 {0, 1}d our predicted log-

more aggressive rounding to be used safely.

20 odds ratio, ˆ · x is distributed as a sum of independent

Consider a rounding a trained model

to some ˆ.

random variables { ˆi | xi = 1}.

We can bound both the additive and relative e↵ect on

Let k = kxk0. In this situation, note that | · x

ˆ ·

logistic–loss L(·) in terms of the quantity | · x

ˆ · x|:

x| ✏kxk

ˆ

1 = ✏k, since | i

i| ✏ for all i. Thus

Lemma 4.1 (Additive Error). Fix , ˆ and (x, y). Let

Lemma 4.1 implies

= | · x

ˆ · x|. Then the logistic–loss satisfies

L(x, y; ˆ)

L(x, y; ) ✏ kxk1.

L(x, y; ˆ)

L(x, y; ) .

Similarly, Lemma 4.2 immediately provides an upper

bound of e✏k

1 on relative logistic error; this bound

is relatively tight for small k, and holds with proba-

Proof. It is well known that @L(x,y; )

1 for all

@ i

bility one, but it does not exploit the fact that the

x, y,

and i, which implies the result.

randomness is unbiased and that errors should cancel

out when k is large. The following theorem gives a

Lemma 4.2 (Relative Error). Fix

, ˆ and (x, y) 2

bound on expected relative error that is much tighter

{0, 1}d ⇥ {0, 1}. Let = | · x

ˆ · x|. Then

for large k:

Theorem 4.3. Let ˆ be a model obtained from

L(x, y; ˆ)

L(x, y; ) e 1.

using unbiased randomized rounding to a precision ✏

L(x, y; )

grid as described above. Then, the expected logistic–

loss relative error of ˆ on any input x is at most

p

See the appendix for a proof. Now, suppose we are

2 2⇡k exp ✏2k/2 ✏ where k = kxk0.

using fixed precision numbers to store our model co-

efficients such as the Qn.m encoding described earlier,

Additional Compression

Figure 1 reveals that co-

with a precision of ✏. This induces a grid of feasi-

efficient values are not uniformly distributed. Stor-

ble model coefficient vectors. If we randomly round

ing these values in a fixed-point representation means

each coefficient i (where | i| 2n) independently up

that individual values will occur many times. Basic

or down to the nearest feasible value ˆi, such that

information theory shows that the more common val- - 予測モデルの近似性能

Large-Scale Learning with Less RAM via Randomization

of tradeo↵s available for adaptive-precision rounding

Table 1. Rounding at Prediction Time for CTR Data.

with randomized counts, varying the precision scalar

Fixed-point encodings are compared to a 32-bit floating

point control model. Added loss is negligible even when

to plot the tradeo↵ curve (dark red). For all ran-

using only 1.5 bits per value with optimal encoding.

domized counts a base of 1.1 was used. Other than

these di↵erences, the algorithms tested are identical.

Encoding

AucLoss

Opt. Bits/Val

Using a single global learning rate, a fixed

q2.3

+5.72%

0.1

q2.13 en-

coding saves 50% of the RAM at no added loss com-

q2.5

+0.44%

0.5

pared to the baseline. The addition of per-coordinate

q2.7

+0.03%

1.5

learning rates gives significant improvement in predic-

q2.9

+0.00%

3.3

tive performance, but at the price of added memory

情報論的下限まで小さく

consumption, increasing from 32 bits per coordinate to

した時のサイズ

ues may be encoded with fewer bits. The theoret-

64 bits per coordinate in the baselines. Using random-

ical bound for a whole model with d coefficients is

ized counts reduces this down to 40 bits per coordi-

Pd

log p(

i=1

i ) bits per value, where p(v) is the proba-

nate. However, both the fixed-precision and the adap-

d

21

bility of occurrence of v in

across all dimensions d.

tive precision methods give far better results, achiev-

Variable length encoding schemes may approach this

ing the same excellent predictive performance as the

limit and achieve further RAM savings.

64-bit method with 24 bits per coefficient or less. This

saves 62.5% of the RAM cost compared to the 64-bit

method, and is still smaller than using 32-bit floats

5. Experimental Results

with a global learning rate.

We evaluated on both public and private large data

The benefit of adaptive precision is only apparent on

sets.

We used the public RCV1 text classification

the larger CTR data set, which has a “long tail” distri-

data set, specifically from Chang & Lin (2011). In

bution of support across features. However, it is useful

keeping with common practice on this data set, the

to note that the simpler fixed-precision method also

smaller “train” split of 20,242 examples was used for

gives great benefit. For example, using q2.13 encod-

parameter tuning and the larger “test” split of 677,399

ing for coefficient values and 8-bit randomized counters

examples was used for the full online learning exper-

allows full-byte alignment in naive data structures.

iments. We also report results from a private CTR

data set of roughly 30M examples and 20M features,

sampled from real ad click data from a major search

engine. Even larger experiments were run on data sets

Rounding at Prediction Time

We tested the ef-

of billions of examples and billions of dimensions, with

fect of performing coarser randomized rounding of a

similar results as those reported here.

fully-trained model on the CTR data, and compared to

The evaluation metrics for predictions are error rate

the loss incurred using a 32-bit floating point represen-

for the RCV1 data, and AucLoss (or 1-AUC) relative

tation. These results, given in Table 1, clearly support

to a control model for the CTR data. Lower values

the theoretical analysis that suggests more aggressive

are better. Metrics are computed using progressive

rounding is possible at prediction time. Surprisingly

validation (Blum et al., 1999) as is standard for online

coarse levels of precision give excellent results, with

learning: on each round a prediction is made for a

little or no loss in predictive performance. The mem-

given example and record for evaluation, and only after

ory savings achievable in this scheme are considerable,

that is the model allowed to train on the example. We

down to less than two bits per value for q2.7 with the-

also report the number of bits per coordinate used.

oretically optimal encoding of the discrete values.

Rounding During Training

Our main results are

6. Conclusions

given in Figure 2. The comparison baseline is online

Randomized storage of coefficient values provides an

logistic regression using a single global learning rate

efficient method for achieving significant RAM savings

and 32-bit floats to store coefficients. We also test the

both during training and at prediction time.

e↵ect of per-coordinate learning rates with both 32-

bit integers for exact counts and with 8-bit random-

While in this work we focus on OGD, similar ran-

ized counts. We test the range of tradeo↵s available

domized rounding schemes may be applied to other

for fixed-precision rounding with randomized counts,

learning algorithms. The extension to algorithms that

varying the number of precision m in q2.m encoding to

efficiently handle L1 regularization, like RDA (Xiao,

plot the tradeo↵ curve (cyan). We also test the range

2009) and FTRL-Proximal (McMahan, 2011), is rela- - まとめ

• 重みベクトルの省メモリ化

• Randomized Rounding

= (1.0, . . . , 1.52)

ˆ = (1.0, . . . , 1.50)

ﬂoat (32bits)

Ex. (5bits)

Q2,2

• GPUやL1cacheに載る形で学習／予測可能

• FOBOS等への拡張もStraightforward

• と著者は書いていて，脚注にProof Sketch

• 本当に成立するかどうか，各自調べる必要がありそう

22