このページは http://www.slideshare.net/HAL9801/20141204journal-club の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- 20140726.西野研セミナー2年以上前 by Shouno Hayaru
- 20140705.西野研セミナー2年以上前 by Shouno Hayaru
- 20140530.journal club2年以上前 by Shouno Hayaru

- Journal Club 論文紹介

K- ‐SVD An Algorithm for Designing

Overcomplete Dic=onaries for

Sparse Representa=on

Aharon, M. et al.

Nov., 2006, IEEE Trans Signal Processing 54(11),

4311- ‐4321

2014/12/04

shouno@uec.ac.jp - やりたいこと/やったこと

• Sparse Coding [Olshausen&Field 94] を

K- ‐means 法の文脈にそって理解してみよう

D と x を求める

0.1

0.3

0.2

なるべくゼロである係数が多くなるようにする - Sparse Coding のNN的解釈

ゼロ要素が多いと

伝送エネルギーが小さい

0.1

0.3

0.2

なるべくゼロである係数が多くなるようにする - Sparse Coding [Olshausen&Field 94]

• Wavelet like な表現

• V1 の simple cell like な特徴抽出

50

100

150

50

200

100

250

150

300

200

50

350

250

100

400

300

150

450

200

350

500

50

100

150

200250

250

300

350

400

450

500

400

300

450

350

500

50

10

4000

150

200

250

300

350

400

450

500

450

500

50

100

150

200

250

300

350

400

450

500

入力画像 Y (自然画像)

辞書 D の要素 - 定式化
- 4316

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

corresponding to (16) is that of searching the best possible dic-

tionary for the sparse representation of the example set

subject to

(17)

A similar objective could alternatively be met by considering

subject to

(18)

for a fixed value . In this paper, we mainly discuss the first

problem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-

tively. First, we fix

and aim to find the best coefficient matrix

Fig. 1. The

-means algorithm.

that can be found. As finding the truly optimal

is impos-

sible, we use an approximation pursuit method. Any such al-

sentation MSE per

is defined as

, and the

gorithm can be used for the calculation of the coefficients, as

overall MSE is

long as it can supply a solution with a fixed and predetermined

number of nonzero entries

.

Once the sparse coding task is done, a second stage is per-

(15)

formed to search for a better dictionary. This process updates

one column at a time, fixing all columns in

except one,

,

and finding a new column

and new values for its coefficients

The VQ training problem is to find a codebook

that minimizes

that best reduce the MSE. This is markedly different from all

the error , subject to the limited structure of , whose columns

the -means generalizations that were described in Section III.

must be taken from the trivial basis

All those methods freeze

while finding a better

. Our ap-

proach is different, as we change the columns of

sequentially

and allow changing the relevant coefficients. In a sense, this ap-

subject to

for some

proach is a more direct generalization of the -means algorithm

(16)

because it updates each column separately, as done in -means.

The

-means algorithm is an iterative method used for de-

One may argue that in

-means the nonzero entries in

are

signing the optimal codebook for VQ [39]. In each iteration

fixed during the improvement of

, but as we shall see next,

there are two stages: one for sparse coding that essentially

this is true because in the

-means (and the gain-shape VQ),

evaluates

and one for updating the codebook. Fig. 1 gives a

the column update problems are decoupled, whereas in the more

more detailed description of these steps.

general setting, this should not be the case.

The sparse coding stage assumes a known codebook

The process of updating only one column of

at a time

and computes a feasible

that minimizes the value of (16).

is a problem having a straightforward solution based on the

Similarly, the dictionary update stage fixes

as known and

singular value decomposition (SVD). Furthermore, allowing a

seeks an update of

so as to minimize (16). Clearly, at each

change in the coefficient values while updating the dictionary

iteration, either a reduction or no change in the MSE is en-

columns accelerates convergence, since the subsequent column

sured. Furthermore, at each such stage, the minimization step

updates will be based on more relevant coefficients. The overall

is optimal under the assumptions. As the MSE is bounded from

effect is very much in line with the leap from gradient descent

below by zero, and the algorithm ensures a monotonic decrease

to Gauss–Seidel methods in optimization.

of the MSE, convergence to at least a local minimum solution is

Here one might be tempted to suggest skipping the step of

guaranteed. Note that we have deliberately chosen not to discuss

sparse coding and using only updates of columns in

, along

stopping rules for the above-described algorithm, since those

with their coefficients, applied in a cyclic fashion, again and

vary a lot but are quite easy to handle [39].

again. This, however, will not work well, as the support of the

representations will never be changed, and such an algorithm

will necessarily fall into a local minimum trap.

B.

-SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-

C.

-SVD—Detailed Description

alization of the VQ objective (16), in which we allow each input

We shall now discuss the

-SVD in detail. Recall that our

signal to be represented by a linear combination of codewords,

objectivS

e parse

function C

is oding Formula=on

which we now call dictionary elements. Therefore the coeffi-

cients vector is now allowed more than one nonzero entry, and

these can have arbitrary values. For this case, the minimization

subject to

(19)

NP- ‐hard な問題なので普通に難しい

Inputs Y

Dic*onary D

Coef. X

N

Y = yi

{ }

dk

i

yi - 4314

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

[40], [41]. This method follows more closely the -means out-

likelihood expression as before and add a prior on the dictionary

line, with a sparse coding stage that uses either OMP or FO-

as a new ingredient.

CUSS followed by an update of the dictionary. The main con-

These works considered several priors

and proposed

tribution of the MOD method is its simple way of updating the

corresponding formulas for the dictionary. The efficiency of

dictionary. Assuming that the sparse coding for each example

the MOD in these methods is manifested in the efficient sparse

is known, we define the errors

. The overall rep-

coding, which is carried out with FOCUSS. The proposed al-

resentation mean square error is given by

gorithms in this family deliberately avoid a direct minimization

with respect to

as in MOD, due to the prohibitive

matrix

(10)

inversion required. Instead, iterative gradient descent is used.

When no prior is chosen, the update formula is the very one

Here we have concatenated all the examples

as columns of

used by Olshausen and Field, as in (9). A prior that constrains

the matrix

and similarly gathered the representations coef-

to have a unit Frobenius norm leads to the update formula

ficient vectors

to build the matrix

. The notation

stands for the Frobenius norm, defined as

.

(13)

Assuming that

is fixed, we can seek an update to

such

As can be seen, the first two terms are the same as in (9). The last

that the above error is minimized. Taking the derivative of (10)

term compensates for deviations from the constraint. This case

with respect to

, we obtain the relation

,

allows different columns in

to have different norm values.

leading to

As a consequence, columns with small norm values tend to be

underused, as the coefficients they need are larger and as such

(11)

more penalized.

This led to the second prior choice, constraining the columns

MOD is closely related to the work by Olshausen and Field,

of

to have a unit

-norm. The new update equation formed

with improvements both in the sparse coding and the dictionary

is given by

MO

update D: r

stages. elated

Whereas to

the w [O

ork linshau

[23],

se

[24], n&

and Fi

[22]eld94

applies

a steepest descent to evaluate

, those are evaluated much more ]

(14)

efficiently with either OMP or FOCUSS. Similarly, in updating

1.

the dictionary, the update relation given in (11) is the best that

辞書 D

can be achiev を固定

ed for fixed (Sp

. ars

The e cod

iterativ ieng phas

steepest e)

descent up-

where

is the th column in the matrix

.

date O

in MP

(9) is/F

f O

ar C

sloUSS

wer. で係数 X

Interestingly, を

in 推定

both

stages of the algo-

Compared to the MOD, this line of work provides slower

rithm, the difference

4314

is in deploying a second order (Newtonian)

training

IEEE TR algorithms.

ANSACTIONS ON Simulations

SIGNAL PROCESS reported

ING, VOL. 54,in

NO [37],

. 11, NO [41],

VEMBER [42],

2006

update instead of a first-order one. Looking closely at the update

[44] on synthetic and real image data seem to provide encour-

2. 係数

relation in X

(9),

[40], を

it 固定し

could

[41].

be

This

て辞書を

written

method as

follows 更新

more (Dic

closely t. u

the pdate

-means ) aging

out-

results.

likelihood expression as before and add a prior on the dictionary

line, with a sparse coding stage that uses either OMP or FO-

as a new ingredient.

CUSS followed by an update of the dictionary. The main E.

con- Unions of

These Orthonormal

works

Bases

considered several priors

and proposed

tribution of the MOD method is its simple way of updating the

corresponding formulas for the dictionary. The efficiency of

The very recent work reported in [45] considers a dictionary

dictionary. Assuming that the sparse coding for each example

the MOD in these methods is manifested in the efficient sparse

composed as a union of orthonormal bases

is known, we define the errors

. The overall rep-

coding, which is carried out with FOCUSS. The proposed al-

(12)

resentation mean square error is given by

gorithms in this family deliberately avoid a direct minimization

with respect to

as in MOD, due to the prohibitive

matrix

Using infinitely many iterations of this sort, and using small

where

(10)

inversion required. Instead, iterative gradient descent is used.

enough , this leads to a steady-state outcome that is exactly

where When no

,

prior is chosen, the

are

update orthonormal

formula is the matrices.

very one

the MOD update

Here matrix

we have(11). Thus, while

concatenated all the

the e MOD method

xamples

as

as-

columnsSuch

of

a dictionary

used by

structure

Olshausen and is quite

Field, as restricti

in (9). A ve, b

prior ut its

that updating

constrains

sumes known coef

the

ficients

matrix

at

and each iteration

similarly g

and

athered deri

the

ves the best

representations

may

coef-

potentially

to have a be made

unit

more

Frobenius effi

normcient.

leads to the update formula

possible dictionary

ficient ,vthe ML

ectors

method

to b

by

uild Olshausen

the matrix

and

.

Field

The

only

notation

The coefficients of the sparse representations

can be de-

(13)

gets closer to this

stands best

for current

the

solution

Frobenius norm,and

defi then

ned turns

as

to calcu-

composed

.

to

pieces, each referring to a different orthonormal

late the coefficients. Note,

Assuming

ho

that we

isver

fix, that

ed,

in

we both

can

methods

seek an

a

update nor

to - basis.

such

Thus

As can be seen, the first two terms are the same as in (9). The last

malization of the

that dictionary

the above

columns

error is

is required

minimized. T

and

aking the done.

derivative of (10)

term compensates for deviations from the constraint. This case

with respect to

, we obtain the relation

,

allows different columns in

to have different norm values.

leading to

D. Maximum A-Posteriori Probability Approach

As a consequence, columns with small norm values tend to be

where

is the

underused, as matrix

the

containing

coefficients they the coef

need are ficients

larger

of

and the

as

or

such -

The same researchers that conceived the MOD method also

thonormal

(11)

more dictionary

penalized.

.

suggested a MAP probability setting for the training of dictio-

One of the

This led major

to the advantages

second prior of the

choice, union of

constraining orthonormal

the columns

MOD is closely related to the work by Olshausen and Field,

naries, attempting to merge the efficiency of the MOD with a

bases is

of theto relati

have vae simplicity

unit

-norm.of the

The

pursuit

new

algorithm

update equation needed

formed

with improvements both in the sparse coding and the dictionary

natural way to tak

is given by

update e into

stages. account

Whereas preferences

the work in

in

[23], the reco

[24], and vered

[22]

for

applies the sparse coding stage. The coefficients are found using

dictionary. In

a [37],

steepest[41], [42],

descent to e and

v

[44],

aluate

, a probabilistic

those are ev

point

aluated much

the

more block coordinate relaxation algorithm [46]. This is an ap-

(14)

of view is adopted,

effi

v

ciently ery

withsimilar

either to the

OMP or ML methods

FOCUSS.

discussed

Similarly, in

pealing

updating

way to solve

as a sequence of simple shrinkage

above. However

the , rather than

dictionary, the working

update

with

relation the

giv lik

en elihood

in (11) is func-

the best steps,

that

such that at each stage

is computed while keeping all

tion

, the

can be posterior

achieved for fixed is. used.

The

Using

iterative Bayes’

steepest rule,

descent the

up- other pieces

where

is of

the

fi

th xed. Thus,

column in

this

the

evaluation

matrix

.

amounts to a

Compared to the MOD, this line of work provides slower

we have

date in (9) is far slower.

, and thus

Interestingly, in we can

both

use

stages the

of the

simple

algo-

shrinkage.

rithm, the difference is in deploying a second order (Newtonian)

training algorithms. Simulations reported in [37], [41], [42],

update instead of a first-order one. Looking closely at the update

[44] on synthetic and real image data seem to provide encour-

relation in (9), it could be written as

aging results.

E. Unions of Orthonormal Bases

The very recent work reported in [45] considers a dictionary

composed as a union of orthonormal bases

(12)

Using infinitely many iterations of this sort, and using smal - 4316

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

corresponding to (16) is that of searching the best possible dic-

tionary for the sparse representation of the example set

K- ‐means method: Clustering Algorithm

subject to

(17)

A similar objective could alternatively be met by considering

subject to

(18)

for a fixed value . In this paper, we mainly discuss the first

problem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-

tively. First, we fix

and aim to find the best coefficient matrix

Fig. 1. The

-means algorithm.

that can be found. As finding the truly optimal

is impos-

sible, we use an approximation pursuit method. Any such al-

sentation MSE per

is defined as

, and the

gorithm can be used for the calculation of the coefficients, as

overall MSE is

long as it can supply a solution with a fixed and predetermined

Sparse Coding

Codebook Update

Sparse Coding number of nonzero entries .

Once the sparse coding task is done, a second stage is per-

(15)

formed to search for a better dictionary. This process updates

one column at a time, fixing all columns in

except one,

,

and finding a new column

and new values for its coefficients

The VQ training problem is to find a codebook

that minimizes

that best reduce the MSE. This is markedly different from all

the error , subject to the limited structure of , whose columns

the -means generalizations that were described in Section III.

must be taken from the trivial basis

All those methods freeze

while finding a better

. Our ap-

proach is different, as we change the columns of

sequentially

and allow changing the relevant coefficients. In a sense, this ap-

subject to

for some

proach is a more direct generalization of the -means algorithm

(16)

because it updates each column separately, as done in -means.

The

-means algorithm is an iterative method used for de-

One may argue that in

-means the nonzero entries in

are

signing the optimal codebook for VQ [39]. In each iteration

fixed during the improvement of

, but as we shall see next,

there are two stages: one for sparse coding that essentially

this is true because in the

-means (and the gain-shape VQ),

evaluates

and one for updating the codebook. Fig. 1 gives a

the column update problems are decoupled, whereas in the more

more detailed description of these steps.

general setting, this should not be the case.

The sparse coding stage assumes a known codebook

The process of updating only one column of

at a time

and computes a feasible

that minimizes the value of (16).

is a problem having a straightforward solution based on the

Similarly, the dictionary update stage fixes

as known and

singular value decomposition (SVD). Furthermore, allowing a

seeks an update of

so as to minimize (16). Clearly, at each

change in the coefficient values while updating the dictionary

iteration, either a reduction or no change in the MSE is en-

columns accelerates convergence, since the subsequent column

sured. Furthermore, at each such stage, the minimization step

updates will be based on more relevant coefficients. The overall

is optimal under the assumptions. As the MSE is bounded from

effect is very much in line with the leap from gradient descent

below by zero, and the algorithm ensures a monotonic decrease

to Gauss–Seidel methods in optimization.

of the MSE, convergence to at least a local minimum solution is

Here one might be tempted to suggest skipping the step of

guaranteed. Note that we have deliberately chosen not to discuss

sparse coding and using only updates of columns in

, along

stopping rules for the above-described algorithm, since those

with their coefficients, applied in a cyclic fashion, again and

vary a lot but are quite easy to handle [39].

again. This, however, will not work well, as the support of the

representations will never be changed, and such an algorithm

will necessarily fall into a local minimum trap.

B.

-SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-

C.

-SVD—Detailed Description

alization of the VQ objective (16), in which we allow each input

We shall now discuss the

-SVD in detail. Recall that our

signal to be represented by a linear combination of codewords,

objective function is

which we now call dictionary elements. Therefore the coeffi-

cients vector is now allowed more than one nonzero entry, and

these can have arbitrary values. For this case, the minimization

subject to

(19) - 4316

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

corresponding to (16) is that of searching the best possible dic-

tionary for the sparse representation of the example set

subject to

(17)

A similar objective could alternatively be met by considering

subject to

(18)

for a fixed value . In this paper, we mainly discuss the first

problem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-

tively. First, we fix

and aim to find the best coefficient matrix

Fig. 1. The

-means algorithm.

that can be found. As finding the truly optimal

is impos-

sible, we use an approximation pursuit method. Any such al-

sentation MSE per

is defined as

, and the

gorithm can be used for the calculation of the coefficients, as

overall MSE is

long as it can supply a solution with a fixed and predetermined

number of nonzero entries

.

Once the sparse coding task is done, a second stage is per-

(15)

formed to search for a better dictionary. This process updates

one column at a time, fixing all columns in

except one,

,

and finding a new column

and new values for its coefficients

The VQ training problem is to find a codebook

that minimizes

that best reduce the MSE. This is markedly different from all

the error , subject to the limited structure of , whose columns

the -means generalizations that were described in Section III.

must Kbe- ‐m

tak e

en ans

from m

the e

tri th

vial od:

basis Formula=on

All those methods freeze

while finding a better

. Our ap-

proach is different, as we change the columns of

sequentially

and allow changing the relevant coefficients. In a sense, this ap-

subject to

for some

proach is a more direct generalization of the -means algorithm

(16)

because it updates each column separately, as done in -means.

The

-means Sparse

algorithm cod

is in

an g に似て

iterative ない?

method

used for de-

One may argue that in -means the nonzero entries in

are

signing the optimal codebook for VQ [39]. In each iteration

fixed during the improvement of

, but as we shall see next,

there

Input

are two Y

stages: oneCodeb

for ook C

sparse

coding that Coef. X

essentially

this is true because in the -means (and the gain-shape VQ),

evaluates

and one for updating the codebook. Fig. 1 gives a

the column update problems are decoupled, whereas in the more

more detailed description of these steps.

general setting, this should not be the case.

The sparse coding stage assumes a known codebook

The process of updating only one column of

at a time

and computes a feasible

that minimizes the value of (16).

is a problem having a straightforward solution based on the

N

SimilarlyY

, = yi

{

the }dictionary

d

update stage fixes

as known and

k

singular value decomposition (SVD). Furthermore, allowing a

i

seeks an update of

so as to minimize (16). Clearly, at each

change in the coefficient values while updating the dictionary

yi

dk

iteration, either a reduction or no change in the MSE is en-

columns accelerates convergence, since the subsequent column

sured. Furthermore, at each such stage, the minimization step

updates will be based on more relevant coefficients. The overall

is optimal under the assumptions. As the MSE is bounded from

effect is very much in line with the leap from gradient descent

below by zero, and the algorithm ensures a monotonic decrease

to Gauss–Seidel methods in optimization.

of the MSE, convergence to at least a local minimum solution is

Here one might be tempted to suggest skipping the step of

guaranteed. Note that we have deliberately chosen not to discuss

sparse coding and using only updates of columns in

, along

stopping rules for the above-described algorithm, since those

with their coefficients, applied in a cyclic fashion, again and

vary a lot but are quite easy to handle [39].

again. This, however, will not work well, as the support of the

representations will never be changed, and such an algorithm

will necessarily fall into a local minimum trap.

B.

-SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-

C.

-SVD—Detailed Description

alization of the VQ objective (16), in which we allow each input

We shall now discuss the

-SVD in detail. Recall that our

signal to be represented by a linear combination of codewords,

objective function is

which we now call dictionary elements. Therefore the coeffi-

cients vector is now allowed more than one nonzero entry, and

these can have arbitrary values. For this case, the minimization

subject to

(19) - K- ‐SVD method: strategy

1. Sparse coding stage

辞書 D を固定して，表現するベクトルを選ぶ

– OMP: l0- ‐norm minimiza=onの為の近似ソルバ

2. Codebook update stage

辞書を update

– 辞書 D と係数 X を同時に更新

SVD 分解によるアップデート - AHARON et al.:

-SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES

4317

Let us first consider the sparse coding stage, where we assume

that

is fixed, and consider the above optimization problem as

a search for sparse representations with coefficients summarized

in the matrix

. The penalty term can be rewritten as

Therefore the problem posed in (19) can be decoupled to

distinct problems of the form

subject to

for

(20)

This problem is adequately addressed by the pursuit algorithms

discussed in Section II, and we have seen that if

is small

enough, their solution is a good approximation to the ideal one

that is numerically infeasible to compute.

We now turn to the second, and slightly more involved,

process of updating the dictionary together with the nonzero

coefficients. Assume that both

and

are fixed and we

put in question only one column in the dictionary

and the

coefficients that correspond to it, the th row in

, denoted as

(this is not the vector

which is the th column in

).

Fig. 2. The

-SVD algorithm.

Returning to the objective function (19), the penalty term can

be rewritten as

the zero entries, resulting with the row vector

of length

. Similarly, the multiplication

creates a matrix

of size

that includes a subset of the examples that

are currently using the

atom. The same effect happens

with

, implying a selection of error columns that

correspond to examples that use the atom

.

With this notation, we may now return to (21) and suggest

minimization with respect to both

and

, but this time force

(21)

the solution of

to have the same support as the original

.

This is equivalent to the minimization of

We have decomposed the multiplication

to the sum of

rank-1 matrices. Among those,

1 terms are assumed fixed,

(23)

and one—the th—remains in question. The matrix

stands

for the error for all the

examples when the th atom is re-

and this time it can be done directly via SVD. Taking the re-

moved. Note the resemblance between this error and the one

stricted matrix

, SVD decomposes it to

. We

defined in [45].

define the solution for

as the first column of

, and the

Here, it would be tempting to suggest the use of the SVD to

coefficient vector

as the first column of

multiplied by

find alternative

and

. The SVD finds the closest rank-1

. Note that, in this solution, we necessarily have that i)

matrix (in Frobenius norm) that approximates

, and this will

the columns of

remain normalized and ii) the support of all

effectively minimize the error as defined in (21). However, such

representations either stays the same or gets smaller by possible

a step will be a mistake, because the new vector

is very likely

nulling of terms.

to be filled, since in such an update of

we do not enforce the

We shall call this algorithm “ -SVD” to parallel the name

sparsity constraint.

-means. While -means applies

computations of means to

A remedy to the above problem, however, is simple and also

update the codebook,

-SVD obtains the updated dictionary

quite intuitive. Define

as the group of indices pointing to

by

SVD computations, each determining one column. A full

examples

that use the atom

, i.e., those where

is

description of the algorithm is given in Fig. 2.

nonzero. Thus

In the -SVD algorithm, we sweep through the columns and

use always the most updated coefficients as they emerge from

(22)

preceding SVD steps. Parallel versions of this algorithm can

also be considered, where all updates of the previous dictionary

Define

as a matrix of size

, with ones on the

are done based on the same

. Experiments show that while

th entries and zeros elsewhere. When multiplying

this version also converges, it yields an inferior solution and

, this shrinks the row vector

by discarding of

typically requires more than four times the number of iterations. - K- ‐SVD 法の理解

Inputs Y

Dic*onary D

Coef. X

⋯

⋯⋯

⋯

⋯⋯

xkT

⋯⋯

N

Y = yi

{ }

dk

i

yi

dk

dk 成分引っこ抜いたらどんだけ影響あるのか？

2

2

K

$

'

2

Y

2

k

− DX

= Y −

d x j

∑

= &Y −

d x j

k

= E − d x

&

∑

)

x

F

j T

j T ) − dk T

k

k

T F

j=1

%

j≠k

(

F

F

特異値分解(SVD)を使い E

の近似を求め と

d

x を

k

更新する

k

k

T - 行列の特異値分解(SVD)

p 任意の m × n の行列Aは次の形に分解できる．

A = U Σ VT

=

A

VT

U

Σ

– ここで，

• U： m×m の直交行列

• V： n×n の直交行列

• Σ： m×n の対角行列 Σ = diag (σ1 , σ2 , … , σn )

（ただし σ1 ≧σ2 ≧ … ≧σr > σr+1 = …= σn = 0）

p これらを使うと，A は以下のように表すことが出来る．

A = σ

T

T

T

1u1v1 + σ2u2v2 + … + σrurvr

A

= × ×

+ × ×

+ … + × ×

σ1

v

σr

v

1

σ2

v2

r

u1

u2

ur - SVDの前処理

• 係数の非ゼロ要素のみを抽出

k

k

x = x

R

R

T Ωk

E = E Ω

k

k

k

xk

xk

Ω

E R

E

Ω

R

T

k

k

k

k

1× ω

1× N

N × ω

n × ω

N × ω

k

k

n × N

k

k

2

2

E − d xk

E R − d xk

k

k T F

k

k

R F - K- ‐SVD法におけるSVDの使いかた

2

E R − d xk

k

k

R F

R

E

d

k

k

k

xR

d

u

k

1

×

xk

σ

R

1 × v1

R

E の特異値分解(SVD)

k

R

Ek

u1

u2 σ2

v

u

2

n σn

vn

σ1

v1

× ×

+ × ×

+

+ × × - 計算機実験
- 人工データによる再構成評価

• 再構成実験

– 辞書サイズ 20x50

– 入力: 3個の辞書要素の重ねあわせ＋加法ノイズ:

20次元 × 1500個

– 横軸: 加法ノイ

AHARON ズ

et al. : -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES

4319

– 縦軸: ランダムデータ50回の成功再生回数

Fig. 4. A collection of 500 random blocks that were used for training, sorted

by their variance.

Fig. 3. Synthetic results: for each of the tested algorithms and for each noise

(a)

(b)

(c)

level, 50 trials were conducted and their results sorted. The graph labels repre-

sent the mean number of detected atoms (out of 50) over the ordered tests in

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending order

groups of ten experiments.

of their variance and stretched to maximal range for display purposes. (b) The

overcomplete separable Haar dictionary and (c) the overcomplete DCT dictio-

nary are used for comparison.

and initializing in the same way. We executed the MOD algo-

rithm for a total number of 80 iterations. We also executed the

MAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-

gorithm was executed as is, therefore using FOCUSS as its de-

composition method. Here, again, a maximum of 80 iterations

were allowed.

D. Results

The computed dictionary was compared against the known

generating dictionary. This comparison was done by sweeping

through the columns of the generating dictionary and finding

the closest column (in

distance) in the computed dictionary,

measuring the distance via

(25)

Fig. 6. The root mean square error for 594 new blocks with missing pixels

using the learned dictionary, overcomplete Haar dictionary, and overcomplete

DCT dictionary.

where

is a generating dictionary atom and

is its corre-

sponding element in the recovered dictionary. A distance less

deploy the proposed techniques in large-scale image-processing

than 0.01 was considered a success. All trials were repeated

applications.

50 times, and the number of successes in each trial was com-

1) Training Data: The training data were constructed as a

puted. Fig. 3 displays the results for the three algorithms for

set of 11 000 examples of block patches of size 8

8 pixels,

noise levels of 10, 20, and 30 dB and for the noiseless case.

taken from a database of face images (in various locations). A

We should note that for different dictionary size (e.g.,

random collection of 500 such blocks, sorted by their variance,

20

30) and with more executed iterations, the MAP-based

is presented in Fig. 4.

algorithm improves and gets closer to the

-SVD detection

2) Removal of the DC: Working with real image data, we

rates.

preferred that all dictionary elements except one have a zero

mean. The same measure was practiced in previous work [23].

For this purpose, the first dictionary element, denoted as the DC,

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

was set to include a constant value in all its entries and was not

RESULTS

changed afterwards. The DC takes part in all representations,

and as a result, all other dictionary elements remain with zero

We carried out several experiments on natural image data,

mean during all iterations.

trying to show the practicality of the proposed algorithm and the

general sparse coding theme. We should emphasize that our tests

3) Running the -SVD: We applied the -SVD, training a

dictionary of size 64

441. The choice

came from

here come only to prove the concept of using such dictionaries

our attempt to compare the outcome to the overcomplete Haar

with sparse representations. Further work is required to fully

dictionary of the same size (see the following section). The coef-

2The authors of [37] have generously shared their software with us.

ficients were computed using OMP with a fixed number of coef- - 自然画像によるK- ‐SVD dic=onary

• 自然画像による辞書生成

– 辞書サイズ (8x8)x441

– 入力: 自然画像8x8パッチ x 11000個

AHARON et al.:

-SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES

4319

– 80 itera=ons

Fig. 4. A collection of 500 random blocks

入力画像例 that were used for training, sorted

by their variance.

Fig. 3. Synthetic results: for each of the tested algorithms and for each noise

(a)

(b)

(c)

level, 50 trials were conducted and their results sorted. The graph labels repre-

sent the mean number of detected atoms (out of 50) over the ordered tests in

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending order

groups of ten experiments.

of their variance and stretched to maximal range for display purposes. (b) The

overcomplete separable Haar dictionary and (c) the overcomplete DCT dictio-

nary are used for comparison.

and initializing in the same way. We executed the MOD algo-

rithm for a total number of 80 iterations. We also executed the

MAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-

gorithm was executed as is, therefore using FOCUSS as its de-

composition method. Here, again, a maximum of 80 iterations

were allowed.

D. Results

The computed dictionary was compared against the known

generating dictionary. This comparison was done by sweeping

through the columns of the generating dictionary and finding

the closest column (in

distance) in the computed dictionary,

measuring the distance via

(25)

Fig. 6. The root mean square error for 594 new blocks with missing pixels

using the learned dictionary, overcomplete Haar dictionary, and overcomplete

DCT dictionary.

where

is a generating dictionary atom and

is its corre-

sponding element in the recovered dictionary. A distance less

deploy the proposed techniques in large-scale image-processing

than 0.01 was considered a success. All trials were repeated

applications.

50 times, and the number of successes in each trial was com-

1) Training Data: The training data were constructed as a

puted. Fig. 3 displays the results for the three algorithms for

set of 11 000 examples of block patches of size 8

8 pixels,

noise levels of 10, 20, and 30 dB and for the noiseless case.

taken from a database of face images (in various locations). A

We should note that for different dictionary size (e.g.,

random collection of 500 such blocks, sorted by their variance,

20 30) and with more executed iterations, the MAP-based

is presented in Fig. 4.

algorithm improves and gets closer to the

-SVD detection

2) Removal of the DC: Working with real image data, we

rates.

preferred that all dictionary elements except one have a zero

mean. The same measure was practiced in previous work [23].

For this purpose, the first dictionary element, denoted as the DC,

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

was set to include a constant value in all its entries and was not

RESULTS

changed afterwards. The DC takes part in all representations,

and as a result, all other dictionary elements remain with zero

We carried out several experiments on natural image data,

mean during all iterations.

trying to show the practicality of the proposed algorithm and the

general sparse coding theme. We should emphasize that our tests

3) Running the -SVD: We applied the -SVD, training a

dictionary of size 64

441. The choice

came from

here come only to prove the concept of using such dictionaries

our attempt to compare the outcome to the overcomplete Haar

with sparse representations. Further work is required to fully

dictionary of the same size (see the following section). The coef-

2The authors of [37] have generously shared their software with us.

ficients were computed using OMP with a fixed number of coef- - AHARON et al.:

-SVD: AHARON

AHAR et al.

ON

ALGORITHMet:al.

FOR-SVD:

:

ALGORITHM

-SVD:

DESIGNING O

FOR

ALGORITHM

DESIGNING

FOR

VERCOMPLETE

OVERCOMPLETE

DESIGNING O

DICTIONARIES

DICTION

VERCOMPLETE

ARIES

DICTIONARIES

4319

4319

4319

Fig. 4. A collection of Fig.

500 4.

Fig. A

4. collection

A

random blocks of

collection

that500

of

random

500

were used blocks

random

for

that

blocks

training, were

that

used

were

sorted

for

used training,

for

sorted

training, sorted

by their variance.

by their

by

v

their 自然画像から

ariance.

variance.

の基底

K- ‐SVD による辞書

Overcomplete DCT

Fig. 3. Synthetic

Fig.

results:

3.

Fig.

for

Synthetic

3.

each of the results:

Synthetic

tested

for

results: each

for

algorithms of

each the

of

and tested

the

for eachalgorithms

testednoise

and

algorithms for

and each

for

noise

(a)

(a)(a)

(b)

(b)(b)

(c)

(c)

each noise

(c)

level, 50 trials were

level,

lev 50

el,

conducted

trials

50

and

were

trials

their

conducted

were

results sorted. and

conducted

The their

and

results

their

graph labelssorted.

resultsrepre- The

sorted. graph

The

labels

graph

repre-

labels repre-

sent the mean number sent

of

the

sent mean

the

detected

number

mean

atoms (out of

number

of detected

of

50) over atoms

detected

the

(out

atoms

ordered of

(out 50)

of

tests o

in v

50) er

ovthe

er

Fig. ordered

the

5. (a) tests

ordered

The

in

learned Fig. 5.

Fig. (a)

5.

dictionary. The

(a)

Its learned

The

elements dictionary.

learned

are sorted Its

dictionary.

in elements

Its

an

are

elements

ascending sorted

are

order in

sorted an

in ascending

an

order

tests in

ascending order

groups of ten e

groups

xperiments.

of

groups ten

of experiments.

of their variance and

of their

of

stretched v

toariance

their v

and

ariance

maximal stretched

and

range for to

stretched maximal

to

display

range

maximal

purposes. for

range

(b) display

for

The

purposes.

display

(b)

purposes. The

ten experiments.

(b) The

overcomplete separable overcomplete

ov

Haar

separable

ercomplete

dictionary and (c) Haar

separablethe odictionary

Haar

v

and

dictionary

ercomplete

(c)

and

DCT the

(c) ov

the ercomplete

o

dictio- v

DCT

ercomplete

dictio-

DCT dictio-

nary are used for

nary are

nary

comparison. used

are

for

used comparison.

for comparison.

and initializing in and

the initializing

and

same way. in

initializing

We the

in

ex same

the

ecuted w

same ay.

w

the W

ay. e

MODe

W x

e ecuted

ex

algo-

the

ecuted MOD

the

algo-

MOD algo-

rithm for a total

rithm

number for

rithm

of a

for

80 total

a

number

total

iterations. W of

number

e 80

of

also iterations.

80executed W

iterations.

thee also

We

ex

also ecuted

ex

the

ecuted the

MAP-based

MAP-based

algorithm of

algorithm

MAP-based

Kreutz-Delgado of

algorithmet Kreutz-Delg

ofal. [37].2

ado

Kreutz-Delg

This et

ado

al- al.

et [37].2

al.

This

[37].2

al-

This al-

gorithm was ex

gorithm

ecuted as is, w

gorithmas

w ex

as ecuted

ex

therefore

as

ecuted

using is,

as therefore

is,

FOCUSS as using

therefore

its de-FOCUSS

using

as

FOCUSS its

as de-

its de-

composition

composition

method. Here, ag method.

composition

ain, a

Here,

method.

maximum ag

Here,ofain,

ag

80 a

ain, maximum

a

iterations

of

maximum 80

of iterations

80 iterations

were allowed.

were allo

were wed.

allowed.

D. Results

D. Results

D. Results

The computed

The computed

The

dictionary was

dictionary

computed

compared ag w

dictionary as

w

ainst compared

as

the known ag

compared ainst

ag

the

ainst kno

the wn

known

generating

generating

dictionary. This

dictionary.

comparison wThis

as

comparison

done by

was

sweeping done by sweeping

generating dictionary. This comparison was done by sweeping

through the

through

columns of thethe columns

generating of the generating

dictionary and fi dictionary

nding

and finding

through the columns of the generating dictionary and finding

the closest column the

(in closest column

distance) in (in

the

distance)

computed

in the computed

dictionary,

dictionary,

the closest column (in

distance) in the computed dictionary,

measuring the

measuring

distance via

the distance via

measuring the distance via

Fig. 6. The (25)

Fig. 6. The root mean square error for 594 new blocks with missing pixels

(25)

root(25)

mean Fig.

square6. The

error root

for

mean

594 new square

blocks error

with for 594

missing new

pix blocks

els

with missing pixels

using the learned dictionary, overcomplete Haar dictionary, and overcomplete

using the learned

using

dictionary, ov the learned

ercomplete dictionary

Haar

, overcomplete

dictionary, and ov Haar dictionary

ercomplete

, and overcomplete

DCT dictionary.

where

is a generating dictionary atom andDCT is its corre-

dictionary.

DCT dictionary.

where

is a

where

generating

is a generating

dictionary atom and dictionary

is its atom

corre- and

is its corre-

sponding element in the recovered dictionary. A distance less

sponding element in sponding

the recov element

ered

in the

dictionary.reco

A vered dictionary.

distance less

A distance less deploy the proposed techniques in large-scale image-processing

than 0.01 was considered a success. All trials were

deploy repeated

the proposed deploy the

techniques proposed

in lar

techniques

ge-scale

in large-scale

image-processing

image-processing

than 0.01 was

than

considered 0.01

a

was considered

success. All trials a success.

were

All

repeated trials were repeated applications.

50 times, and the number of successes in each trial was com-

applications.

applications.

50 times, and the

50 times,

number of and the

successes number

in each of successes

trial was

in

com- each trial was com-

1) Training Data: The training data were constructed as a

puted. Fig. 3 displays the results for the three algorithms

1) Tr

for

aining Data: 1)

The Training

training Data:

data

The

were training data

constructed as were

a

constructed as a

puted. Fig. 3

puted.

displays the Fig. 3

results displays

for the

the

three results for

algorithms the

for three algorithms for set of 11 000 examples of block patches of size 8 8 pixels,

noise levels of 10, 20, and 30 dB and for the noiseless

set of case.

11 000 e

set

xamples of 11 000

block examples

patches of of block

size 8

patches

8 pixels,of size 8 8 pixels,

noise levels of 10,

noise

20, and le

30vels

dB of 10,

and

20,

for

and

the

30 dB

noiseless and for

case.

the noiseless case.

taken from a database of face images (in various locations). A

We should note that for different dictionary

taken size

from(e.g.,

a

tak

database en

of f from

ace

a database

images (in v of face

arious images (in

locations). v

A arious locations). A

We should note

W

that e should

for dif

note

ferent

that for

dictionary different

size

dictionary

(e.g.,

size (e.g., random collection of 500 such blocks, sorted by their variance,

20

30) and with more executed iterations, the MAP-based

random collection of random

500

collection

such blocks, of 500

sorted such

by

blocks,

their v

sorted

ariance,

by their variance,

20

30) and with

20

more 30)

ex

and

ecuted with more

iterations, executed

the

iterations,

MAP-based

the MAP-based is presented in Fig. 4.

algorithm improves and gets closer to the

-SVD

is

detection

presented in Fig. is

4. presented in Fig. 4.

algorithm improves algorithm

and gets impro

closer ves

to and

the

gets closer

-SVD

to the

detection

-SVD detection

2) Removal of the DC: Working with real image data, we

rates.

2) Removal of the 2)

DC: Remo

W val of

orking the

with DC:

real Working

image

with

data, we real image data, we

rates.

rates.

preferre - 4320

Applica=on:Imag

IEEE T e

RAN Si

ACn

TIO p

NS ai

ON SI n

GNA =

L PR n

OCE g

SSI

NG, VOL. 54, NO. 11, NOVEMBER 2006

Fig. 7. The corrupted image (left) with the missing pixels marked as points and the reconstructed results by the learned dictionary, the overcomplete Haar dictio-

nary, and the overcomplete DCT dictionary, respectively. The different rows are for 50% and 70% of missing pixels.

ficients, where the maximal number of coefficients is ten. Note

ages). All projections in the OMP algorithm included only

that better performance can be obtained by switching to FO-

the noncorrupted pixels, and for this purpose, the dictio-

CUSS. We concentrated on OMP because of its simplicity and

nary elements were normalized so that the noncorrupted

fast execution. The trained dictionary is presented in Fig. 5(a).

indexes in each dictionary element have a unit norm. The

4) Comparison Dictionaries: The trained dictionary was

resulting coefficient vector of the block

is denoted

.

compared with the overcomplete Haar dictionary, which in-

3) The reconstructed block

was chosen as

.

cludes separable basis functions, having steps of various sizes

4) The reconstruction error was set to

and in all locations (total of 441 elements). In addition, we

(where 64 is the number of pixels in each block).

build an overcomplete separable version of the DCT dictionary

The mean reconstruction errors (for all blocks and all corruption

by sampling the cosine wave in different frequencies to result a

rates) were computed and are displayed in Fig. 6. Two corrupted

total of 441 elements. The overcomplete Haar dictionary and the

images and their reconstructions are shown in Fig. 7. As can

overcomplete DCT dictionary are presented in Fig. 5(b) and (c),

be seen, higher quality recovery is obtained using the learned

respectively.

dictionary.

5) Applications: We used the -SVD results, denoted here

as the learned dictionary, for two different applications on im-

B. Compression

ages. All tests were performed on one face image which was

A compression comparison was conducted between the over-

not included in the training set. The first application is filling in

complete learned dictionary, the overcomplete Haar dictionary,

missing pixels: we deleted random pixels in the image and filled

and the overcomplete DCT dictionary (as explained before), all

their values using the various dictionaries’ decomposition. We

of size 64

441. In addition, we compared to the regular (uni-

then tested the compression potential of the learned dictionary

tary) DCT dictionary (used by the JPEG algorithm). The re-

decomposition and derived a rate-distortion graph. We hereafter

sulting rate-distortion graph is presented in Fig. 8. In this com-

describe those experiments in more detail.

pression test, the face image was partitioned (again) into 594

disjoint 8

8 blocks. All blocks were coded in various rates

A. Filling In Missing Pixels

(bits-per-pixel values), and the peak SNR (PSNR) was mea-

We chose one random full face image, which consists of 594

sured. Let

be the original image and

be the coded image,

nonoverlapping blocks (none of which were used for training).

combined by all the coded blocks. We denote

as the mean

For each block, the following procedure was conducted for in

squared error between and , and

the range {0.2,0.9}.

1) A fraction

of the pixels in each block, in random loca-

PSNR

(26)

tions, were deleted (set to zero).

2) The coefficients of the corrupted block under the learned

In each test we set an error goal and fixed the number of bits

dictionary, the overcomplete Haar dictionary, and the over-

per coefficient . For each such pair of parameters, all blocks

complete DCT dictionary were found using OMP with an

were coded in order to achieve the desired error goal, and the co-

error bound of

, where

is a vector of

efficients were quantized to the desired number of bits (uniform

all ones3 (allowing an error of 5 gray-values in 8-bit im-

quantization, using upper and lower bounds for each coefficient

3The input image is scaled to the dynamic range [0,1].

in each dictionary based on the training set coefficients). For the - まとめ

• Overcomplete な基底に対する辞書学習方法

K- ‐SVD を提案したよ

• K- ‐SVD は K- ‐means の一般化（少し強引）

• K- ‐SVD で作成した基底は inpain=ng とかの問

題解決に役立つ