このページは http://www.slideshare.net/miyamamoto/ss-42246142 の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

- 情報検索における評価指標の最新動向と新たな提案

デンソーアイティーラボラトリ 山本光穂 - 発表内容

(1) 最近のIR研究における検索評価指標の動向

基本的に使われている/注目されている評価指標のみを紹介

(2) 音声対話検索向けの評価指標であるsession ERRを紹

介。

・今日紹介する評価指標のソースコードは以下にあります。

https://github.com/DensoITLab/evaluation_measures

新技術研究会

2 - 前提知識:「よい検索結果」とは？

! 情報検索における「よい検索結果」ってなんだろう？

サッポロ一番は？

インスタントラーメン

ユーザ

検索システム

! クエリーに関係するドキュメントがよい検索結果？

新技術研究会

3 - 前提知識:「よい検索結果」とは？

! 情報検索における「よい検索結果」ってなんだろう？

ラーメンの歴史に

ついて知りたい

安藤百福は？

検索意図

インスタントラーメン

ユーザ

検索システム

! 【正解】検索意図にいかに合っている情報（ドキュメント)

か否か

→これが「適合性(relevance)」の定義.

新技術研究会

4 - 前提知識:検索意図とユーザモデル

! 「適合性」の高い情報を出すためには？

・一つだけ大正解を得たい

大正解を一つ発見するのが得意

ラーメンの歴史に

・広範囲の情報を網羅したい

+

ついて知りたい

・間違いは含ませたくない

正解を網羅的に提示するのが得意

・なるべく少ないインタラク

ションで正解を得たい

ユーザモデル

子供に見せたくない

コンテンツを含めない事が得意

検索意図’

検索システム

ユーザ

! ユーザモデルを考慮した検索意図にあった情報を出せるか否か

→これがIRの研究

新技術研究会

5 - 前提知識:単一の評価手法だけで良いのか？

検索システムの特徴

検索結果

評価

適合性

例:平均適合率@4

正解

20点

大正解を一つ発見するのが得意

5点

正解

5点

正解を網羅的に提示するのが得意

正解

10点

4点

正解

1点

正解

8点

子供に見せたくない

(除外に成功)

2点

コンテンツを含めない事が得意

! 検索システムが目的とするユーザモデルを考慮した検索評価指標が

必要→検索システムの進化に合わせ検索評価手法の改善も進む。

新技術研究会

6 - 本日紹介する情報検索評価指標一覧

! Mean Reciprocal Rank(MRR) (RR)

! E.M. Voorhees (1999). "Proceedings of the 8th Text Retrieval Conference". TREC-8 Question

Answering Track Report. pp. 77–82.

! 平均適合率 (AP)

! ??

! nDCG

! Kalervo Jarvelin, Jaana Kekalainen: Cumulated gain-based evaluation of IR techniques.

ACM Transactions on Information Systems 20(4), 422–446 (2002) Cumulated gain-based

evaluation of IR techniques

! Rank-Biased Precision (RBP)

! MOFFAT Alistair (Univ. Melbourne, AUS); ZOBEL Justin (RMIT Univ., AUS), ACM Trans Inf

Syst (USA) 2009

! Expected reciprocal rank (ERR)

! Olivier Chapel e, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal

rank for graded relevance. In Proceedings of the 18th ACM conference on Information and

knowledge management (CIKM '09).

! Session DCG

! K. J ̈arvelin, S. L. Price, L. M. L. Delcambre, and M. L. Nielsen. Discounted cumulated gain

based evaluation of multiple-query ir sessions. In ECIR, pages 4–15, 2008.

! Session ERR

! 現在執筆中の論文に記載予定

新技術研究会

7 - 評価指標一覧と用途

Binary Relevance

Graded Relevance

(正解，不正解)

(1, 2, 3, 4, 5)

一つ

Success

重み付き逆順位

逆順位(RR)

正解が

再現率・適合率

Normalized Discounted Cumulative Gain

(nDCG)

11点平均適合率

Expected reciprocal rank for graded

relevance(ERR)

平均適合率(AP)

Session NDCG/Session ERR

正解がたくさん

第r位適合率

risk sensitive Rank

新技術研究会

8 - 逆順位(Reciprocal Rank)

! 用途

! 目的の情報が1つ見つかればよいときに使う

! 正解情報が第r位に現れたとき逆順位(RR)は

1

RR

タスク1

タスク2

= r

正解

! 全タスクの平均を取って平均逆順位(MRR)で

正解

システムを評価

RR=1/2

RR=1

K

1

1

MRR =

∑

K i 1 r

MRR=3/4

=

i

! とてもブレが大きい

! たくさんのタスクが必要

新技術研究会

9 - 平均適合率(AP)

! 比較的再現率を重視する評価で有効

! 第r位までの適合率をP(r)とするとAPは

タスク1

L

1

AP

I (r)P(r)

=

∑

R r

正解

1/2

=1

! ただし，I(r)は第r位が正解のとき1

正解

2/5

R=全正解数，L=システム出力件数

! 全タスクの平均を取ってMAPで

正解

3/7

システムを評価 (よく使われる)

正解

4/9

! TREC等のコンペ等

全正解数=10なら

AP = (1/2+2/5+3/7+4/9)/10

新技術研究会

10 - MAPの課題

! 少し不安定な事で知られている(らしい)

pooling の数を10と100それぞれで評価した際の結果

MOFFAT Alistair (Univ. Melbourne, AUS); ZOBEL Justin (RMIT Univ., AUS), ACM Trans Inf Syst (USA) 2009 より

新技術研究会

11 - Normalized Discounted Cumulative Gain(NDCG)

Kalervo Jarvelin, Jaana Kekalainen: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)

! とってもメジャー

! Cumulative Gain (CG)

! 累積利得（右図参照）

L

CG(L) = ∑ g(r)

r=1

! Discounted CG

! 1位に正解 > 2位に正解

L

g(r)

DCG(L) = ∑

r 1 log (r

)

1

=

b

+

よりよい検索システム実現のために：正解の良し悪しを考慮した情報検索評価の動向. 酒井哲也. For Bui新技術研究会

12

lding Better Retrieval

Systems : Trends in Information Retrieval Evaluation based on Graded Relevance. Tetsuya SAKAI (Toshiba Corp.) 参考 - nDCGの正規化

! タスクによってDCGの値は

検索結果

理想の結果

5

大きく変化

4

4

! 簡単/難しいタスク

3

2

! 正規化= nDCG

2

! 理想のDCG=1となるように

5

DCG(L)

nDCG(L) = DCG (L)

3

ideal

新技術研究会

13 - NDCGの課題

! web検索において下記の検索結果が提示された場合、

直感的にどちらの検索結果が良いと想いますか？

適合性

適合性

正解

20点

正解

2点

正解

2点

正解

2点

正解

2点

nDCGの場合は

正解

2点

こちらがスコアが

正解

2点

高くなる

正解

2点

正解

検索結果例1

検索結果例2

! web検索の場合上位に良い検索結果があった場合、

検索行動をstopする(Cascade-based models)新技術研究会

14 - Expected reciprocal rank for graded relevance(ERR)

Olivier Chapel e, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09).

! relevantの高いドキュメントが上位にあった場合、

閲覧を中止してしまう可能性を考慮したモデル

Expected Reciprocal Rank

[Chapelle et al CIKM09]

black powder

Query

ammunition

1

View Next

2

Item

3

4

5

Relevant?

6

7

8

highly

somewhat

no

9

10

…

Stop

新技術研究会

15 - Ranking 1

Ranking 2

which is simply the probability the the user is not satisfied

URL

CTR

URL

CTR

with the first r − 1 results and is satisfied with the r-th one.

uk.myspace.com

0.97

www.myspace.com

0.97

www.myspace.com

0.11

4. PROPOSED METRIC

Figure 2: Illustration of the problem with position-

We now introduce our proposed metric based on the cas-

based models. The query is myspace in the UK mar-

cade model described in the previous section. A key step

ket. See text for discussion.

is the definition of the probability that a user finds a doc-

ument relevant as a function of the editorial grade of that

document. Let gi be the grade of the i-th document, then:

dency among URLs on a search results page. In its generic

form, the cascade model assumes that the user views search

Ri := R(gi),

(3)

results from top to bottom and at each position, the user

where R is a mapping from relevance grades to probability of

has a certain probability of being satisfied. Let Ri be this

relevance. R can be chosen in diﬀerent ways; in accordance

probability at position i.2 Once the user is satisfied with

with the gain function for DCG used in [4], we might take

a document, he/she terminates the search and documents

it to be:

below this result are not examined regardless of their posi-

tion. It is of course natural to expect Ri to be an increasing

2g − 1

R(g) :=

,

g ∈ {0, . . . , g

function of the relevance grade, and indeed in what follows

2g

max}.

(4)

max

we will assimilate it to the often loosely-defined notion of

When the document is non-relevant (g = 0), the probability

“relevance”. This generic version of the cascade model is

that the user finds it relevant is 0, while when the document

summarized in Algorithm 1.

is extremely relevant (g = 4 if a 5 point scale is used), then

the probability of relevance is near 1.

Algorithm 1 The cascade user model

We first define the metric in a more general way by con-

Require: R1, . . . , R10 the relevance of the 10 URLs on the

sidering a utility function ϕ of the position. This function

result page.

typically satisfies ϕ(1) = 1 and ϕ(r) → 0 as r goes to +∞.

1: i = 1

2: User examines position i.

Definition 1

(Cascade based metric). Given a util-

3: if random(0,1) ≤ Ri then

ity function ϕ, a cascade based metric is the expectation of

4:

User is satisfied with the document in position i and

ϕ(r), where r is the rank where the user finds the document

stops.

he was looking for. The underlying user model is the cascade

5: else

model (2), where the Ri are given by (3).

6:

i ← i + 1; go to 2

7: end if

In the rest of this paper we will be considering the special

case ϕ(r) = 1/r, but there is nothing particular about that

Two instantiations of this model have been presented in

choice and, for instance, we could have instead picked ϕ(r) =

1

[12, 8]. In the former, R

as in the discount function of DCG.

i is the same as the attractiveness

log2(r+1)

defined above for position-based models: it measures a prob-

ability of click which can be interpreted as the relevance of

Definition 2

(Expected Reciprocal Rank).

the snippet. In that model, it is assumed that the user is al-

The Expected Reciprocal Rank is a cascade based metric with

ways satisfied after clicking. It can however be the case that

ϕ(r) = 1/r.

the snippet looks attractive, but that the user does not find

Expected

anyR

releci

evant i pr

nform oc

ation al

on t Rank

he correspondi ng landing page.

It may not seem straightforward to compute ERR from

This is the reason why an extended cascade model has been

the previous definition because there is an expectation. How-

[Chapel

prop loe

se et

d i al

n [8 ,CIKM09]

Section 5], in which the user might not be

ever it can easily be computed as follows:

satisfied after clicking. More precisely, there is a probability,

n

X

1

black powder

depending on the landing page, that the user will go back

ERR :=

P (user stops at position r),

to the search result list after clicking. The R

r

ammu

Expect nition

ed

i in Algorithm

r=1

Reciprocal Rank

1 have now to be understood as the relevance probability of

the landing page.

where n is the number of documents in the ranking. The

[Chapelle et al CIKM09]

In(r)

both mode of

Utility

:

ls a docum "

finding

ent satisfies the user wit

document"

perfect

the

h prob-

probability that the user stops at position r is given by the

1

ability R

definition of the cascade model (2). Plugging that value into

bl Expect

i.

The values Ri can be estimated by maximum

ack powder 2

ammunition

ed Reciproc

likelih

the above equation, we finally obtain:

the neal

ood

xt se Rank

r

rank

at

on the click logs

ction, the Ri val

. Alternatively, as we will do in

Ranking 1

Ranking 2

which is simply the probability the the user is not satisfied

URL

CTR

URL

CTR

with the first r − 1 results and is satisfied with the r-th one.

uk.myspace.com

0.97

www.myspace.com

0.97

ues can be set as a function of the

3

(r) 1/

r

n

r

www.myspace.com

0.11

−1

4. PROPOSED METRIC

editorial grade of the URL. For a given set of R

X

1 Y

Figure 2: Illustration of the problem w

i, the likeli-

ith position-

We now introduce our proposed metric based on the cas-

ERR :=

(1 − R

based models. The query is

i)Rr .

(5)

myspace in the UK mar-

cade model described in the previous section. A key step

4

[Chapelle et al CIKM09]

hood of a n

ses 1

ket. See text for discussion.

is the definition of the probability that a user finds a doc-

sion for which the user is satisfied and stops at

r

ument relevant as a function of the editorial grade of that

r=1

i=1

(r)

of

Utility

:

"

finding

document"

perfect

the

1

ERRの定義

ERR

position r is: P(user st

opsat

po

document. Let gi be the grade of the i-th document, then:

dency a s

moit

ng i

U o

RLs n

on r

a se)

arch results page. In its generic

black powder 5

form, the cascade model assumes that the user views search

Ri := R(gi),

(3)

A na

r 1 r

results from top to bottom and at each position, the user

¨ıve computation using the above requires O(n2) oper-

where R is a mapping from relevance grades to probability of

has a certain probability of being satisfied. Let Ri be this

r

r

rank

at

−1

relevance. R can be chosen in diﬀerent ways; in accordance

probability at position i.2 Once the user is satisfied with

Y

with the gain function for DCG used in [4], we might take

2

ations. But as shown in Algorithm 2, ERR can easily be

6

・

ammunition

potision rで閲覧行動をstopする確率

a document, he/she terminates the search and documents

it to be:

(1 − bel

Row this result are not examined regardless of their posi-

i)Rr ,

(2)

tion. It is of course natural to expect R

2g − 1

computed in O(n) time.

i

3

(r) 1/

r

=1

th

i to be an increasing

R(g) :=

,

g ∈ {0, . . . , g

function of the relevance grade, and indeed in what follows

2g

max}.

(4)

max

7

g rele

:

vance grad

e of

r

the we will

When the document is non-relevant (g = 0), the probability

Compared to position-based metrics such as DCG and

“relevan doc

assimilate

ce”. This um

it to the

generic v ent

often loosely-defined notion of

r

ersion of the cascade model is

that the user finds it relevant is 0, while when the document

summarized in Algorithm 1.

但しRrはdocu

n ment r位の

2

relevance

The probab であり

ility is

is extremely relevant (g = 4 if a 5 point scale is used), then

RBP for which the discount depends only the position, the

the probability of relevance is near 1.

in fact a function of the i-th document

4

8

(r)

1 of

Utility

:

"

finding

2gr Al 1

gorithm 1

document"

perfect

the

1

The cascade user model

We first define the metric in a more general way by con-

d(i). However, for simplicity we shorte

g n

=4 R

discount in ERR depends on the relevance of documents

次の式で定義したとき

ERR

P(

of

Prob.user st

op s of

relevance

at po

si

doc

trio R

Require: R1, . . . , R10 the relevance of the 10 URLs on the

sidering a utility function ϕ of the position. This function

d(i) to Ri .

n r)

result page.

typic

position

at

stops

P(user

a r

lly )

r

g

satisfies ϕ(1) = 1 and ϕ(r) → 0 as r goes to +∞.

max

1: i = 1

5

9

r

rank

at

2 2: User examines position i.

Definition 1

(Cascade based metric). Given a util-

r

r

2

1

n

1

1 r

3: if - ERR 算出例

Algorithm 2 Algorithm to compute the ERR metric (5) in

the number of non-relevant documents that occur before the

linear time.

Kth relevant document. The measure has been shown to

Require: Relevance grades gi, 1 ≤ i ≤ n, and mapping

be useful for measuring the eﬀectiveness of web search en-

function R such as the one defined in (4).

gines [24]. Our metric diﬀers from ESL in that we explicitly

p ← 1, ERR ← 0.

support graded judgments and also take into account a user

for r = 1 to n do

browsing model that is absent from the ESL model. One of

R ← R(gr)

the primary problems with ESL is that it requires knowing

ERR ← ERR + p · R/r

the appropriate value of K for each query. Rather than as-

p ← p · (1 − R)

suming the user wants to find K relevant documents, our

end for

metric measures the (inverse) expected user eﬀort required

return ERR

to be satisfied.

Second, ERR is closely related to Moﬀat and Zobel’s RBP

document relevance

sme

te tr

pi cd[17

o ].

w O

n ur metric can be thought of as an extension

ERR

r

sh R

own above it. The “eﬀective” discount in ERR of docu-

p an

r d

obge

a ne

b rial

li izat

ty ion of RBP that makes use of the cascade

ment at position r is indeed:

model as a user browsing model. We note that Moﬀat and

Zobel discuss the possibility of incorporating a user model

1

3/16

r

1

−1

Y

3/16

13/16

(1 − R

into RBP by making p dependent on the previously seen

r

i).

documents, the authors left this direction open as future

i=1

work. The combination of the cascade model and RBP is

Thus the more relevant the previous documents are, the

natural and provides a number of benefits, including no need

3/16 + 13/16 * 15/16 * 1/2

13/16 * (1- 15/16)

2

15/16

more discounted the other documents are. This diminish-

to set p a priori and the possibility of seemlessly combining

ing return property

= is

29 d

1e/sir

5 a

1 b

2le because it reflects real user = 13 / 240

human judgments and clicks in a single unified framework,

behavior.

as will be discussed in Section 7.3.

Figure 3 summarizes our discussion up until this point.

Third, suppose that all of the R

The figure shows the connection between user models and

i values are either 0 or 1,

……

which corresponds to the binary relevance setting. In this

metrics. As the figure shows, most traditional measures,

ERR@2 = 291/ sc

5 e

1n2a rio

+ 3it

/1is

6 easy to see that:

such as DCG and RBP assume a position-based user brows-

新技術研究会

17

ing model. As we have discussed, these models have been

1

ERR :=

shown to be poor approximations of actual user behavior.

min{r : Rr = 1}

The cascade-based user model, which is a more accurate

which is exactly the reciprocal rank (RR) metric [26]. Thus,

user model, forms the basis for our proposed ERR metric.

under binary relevance, ERR simplifies to RR.

Fourth, ERR can be seen as a special case of Normalized

User model

Metric

Cumulative Utility (NCU) [22], which is defined as

n

X

Position-based

DCG,RBP

N CU :=

p(r)N U (r),

r=1

where p(r) is the probability that the user stops at position

Cascade

ERR

r and N U (r) is a utility, defined as a combination of benefit

and eﬀort for the user to have examined all the documents

from position 1 to r. In the case of ERR, p(r) is given by the

Figure 3: Links between user models and metrics:

cascade model (2), while N U (r) = 1/r. But we could have

DCG and RBP are instances of metrics that can be

considered other choices for the utility function N U , such

motivated using a position-based model. But the

as precision or a normalized cumulative gain as discussed

cascade user model is a more accurate model and

in [22]. The important common point between ERR and

the ERR metric derived from it correlates better

NCU is the separation of the stopping point probability from

with user satisfaction.

utility as discussed in [22, section 3.1].

Finally, one can recover the case of additive metrics such

5.

RELATION TO OTHER METRICS

as DCG in the limit where the Ri are infinitesimally small.

In this case, Qr−1 R

i=1

i ≈ 1 and equation (5) is approxima-

As we discussed in Section 2, our metric is similar to DCG

tively equal to:

to a certain extent. The metric also shares commonalities

n

with several other metrics, which we will describe in this

X Rr .

section.

r

r=1

First, our metric is related to the expected search length

(ESL) metric, which was proposed by Cooper in 1968 [11].

It course does not make sense from a user model point of

The metric, which is defined over a weak ordering of doc-

view to consider infinitesimally small Ri, but the point here

uments, quantifies the amount of user eﬀort necessary to

is that when the Ri are far away from 1, ERR turns out to be

find K relevant documents. The measure computes the ex-

more similar to DCG. This happens in particular for diﬃcult

pected number of non-relevant documents that the user will

queries where there are only marginally relevant documents.

see, by sequentially browsing the search result list, before

This behavior has also been empirically observed as we shall

finding the Kth relevant document. In the simplest case,

see at the end of section 6.2.

the weak ordering is taken as the ranked output of the sys-

tem, in which the ESL can be computed by simply counting

6. EVALUATION - 近年の検索技術の動向

! クエリ・リフォミュレーション系技術の拡充

! クエリ推薦/クエリ修正/クエリ拡張

! 対話検索 (e.g. siri/ワトソン等)

!

システムとの対話を通じて検索結果を得る

sessionベースの検索が普及

新技術研究会

18 - 既存評価手法のsession対応に対する課題

クエリ xx

yy

zz

xx

yy

zz

正解

正解

正解

正解

正解

正解

正解

正解

正解

正解

正解

正解

1回目

2回目

3回目

1回目

2回目

3回目

検索システム1

検索システム2

! どちらの検索システムの方が良い？

! 当然一度目のセッションで正解を発見できている検索システム2の方が性能が良い

! nDCGを利用した場合、検索システム1と2に差は出ない。

新技術研究会

19 - one rank up. This way, we neither penalize systems for re-

giving a path of length mk. Therefore, we set the cut-o↵s

peating information possibly useful to the user, nor do we

of our expected session measures to mk (with the excep-

push unnecessary complexity to the retrieval system side.

tion of AP). For the computation of the expected session

Further, the measures still reward the retrieval of new rele-

measures the parameter pdown is set to 0.8 following the

vant documents. Note here that such an approach assumes

recommendation by Zhang et al. [18]; in expectation, users

that a system ranks documents independent of each other

stop at rank 5. The parameter preform is set arbitrarily to

in a ranked list (probabilistic ranking principle [13]). If this

0.5. With m = 2 reformulations, the probability of a user

is not true, i.e. if ranking a document depends on previ-

stopping at the first reformulation is then 67% and mov-

ously ranked documents and the retrieval system is agnostic

ing to the next reformulation 33%, which is not far o↵ the

to our removal policy then this may also lead to unsound

percentage of users reformulating their initial queries in the

evaluation.

Excite logs [15].

6. EXPERIMENTS

6.1 Session track collection

The Session track collection consists of a set of 150 two-

In this section we demonstrate the behavior of the pro-

query sessions (initial query followed by one reformulation).

posed measures. There is currently no “gold standard” for

Out of the 150 sessions, 136 were judged. The judged 136

session evaluation measures; our goal in this section is to

topics include 47 for which the reformulation involves greater

evaluate whether our measures provide information about

specification of the information need, 42 for which the re-

system e↵ectiveness over a session and whether they cap-

formulation involves more generalization, and 47 for which

ture di↵erent attributes of performance in a similar way to

the information need “drifts” slightly. In the case of spec-

traditional precision, recall, and average precision. We will

ification and generalization, both the initial query and its

compare our measures to the session nDCG measure pro-

reformulation represent the same information need, while in

posed by J¨arvelin et al., though we consider none of these,

the case of drifting the two queries in the session represent

nor any other measure, to be the one true session measure.

two di↵erent (but related) information needs. Given that

We use two collections towards these goals: (a) the TREC

some of the proposed session measures make the assump-

2010 Session track collection [7], and (b) the TREC-8 (1999)

tion of a single information need per session—these are the

Query track collection [2]. Though the latter is not a collec-

recall-based measures such as AP and recall at cuto↵ k—we

tion of multi-query sessions, we will find it useful to explore

drop the 47 drifting sessions from our experiments.

properties of our measures. Both of these collections are

Each participating group submitted runs consisting of three

described in more detail below.

ranked lists of documents:

The instantiation of session nDCG@k we use is calculated

as follows: we start by concatenating the top k results from

RL1 ranked results for the initial query;

each ranked list of results in the session. For each rank i in

the concatenated list, we compute the discounted gain as

RL2 ranked results for the query reformulation indepen-

dently of the initial query;

2rel(i) 1

DG@i = log

RL3 ranked results for the query reformulation when the

b(i + (b

1))

initial query was taken into consideration.

where b is a log base typically chosen to be 2. These are the

summands of DCG as implemented by Burges et al. [3] and

Thus, each submission consists of two pairs of runs over

used by many others. We then apply an additional discount

the two-query session: RL1!RL2, and RL1!RL3. The

to documents retrieved for later reformulations. For rank i

document corpus was the ClueWeb09 collection. Judging

session DCG

bet

ween 1 and k, there is no discount. For rank i between

was based on a shallow depth-10 pool from all submitted

K. J ̈arvelin, S. L. Price, L. M. L. Delcambre, and M. L. Nielsen. Discounted cumulated gain based evaluation of multiple-query ir sessions. In ECIR, pages 4–

ra 15

n ,

k + 1 and 2k, the discount is 1/ log (2 + (bq

1)), where bq

ked lists. Kanoulas et al. detail the collection further [7].

2008.

bq

is the log base. In general, if the document at rank i came

Figure 3 shows example precision-recall surfaces for two

! session

fro 回数を考慮した

m the jth reformulation dcg

, then

submissions, CengageS10R3 [9] and RMITBase [8]. In both

cases there is a moderate improvement in performance from

1

sDG@i =

DG@i

the first query to the second. The decrease in precision is

log (j + (bq

1))

bq

c

c

rapid in both, but slightly less so in RMITBase. As a result,

Session DCG is then the sum over sDG@i

though CengageS10R3 starts out with higher precisions at

lower recalls, the model-free mean sAP s are close: 0.240

mk

X

2rel(i) 1

and 0.225 respectively. In general, these surfaces, like tradi-

sDCG@k =

log

tional precision-recall curves, provide a good sense of relative

i=1

bq (j + (bq

1)) log

c

c

b(i + (b

1))

e↵ectiveness di↵erences between systems and where in the

Session DCG

with j = b(i

1)/kc, and m the length of the session. We

ranking they occur.

session回数に対するdiscount

rankingに対するdiscount

use bq = 4. This[Jär

imvel

pl i

en e

m t

nal

t E

at CIR 2008]

ion resol ves a problem present

We use the submitted RL1 and RL2 runs (27 submissions

in the original definition by J¨arvelin et al. [6] by which docu-

in total) to compare the proposed model-based measures

ments ken

in ya

t coo

op kin

p g

o

sitions of an ea ken

rlie yra coo

ra kin

nk ge d list are penalized

with normalized session DCG. nsDCG is computed at cut-

traditional

traditional swahili

more than documents in later ranked lists.

o↵ 10. We compute all measures in Section 2 with cut-o↵

As with the standard definition of DCG, we can also com-

2 · 10 = 20 (to ensure the same number of documents are

pute an “ideal” score based on an optimal ranking of docu-

used). Scatter plots of nsDCG@10 versus expected session

ments in decreasing order of relevance to the query and then

nDCG@20 (esnDCG), PC@20 (esPC), RC@20 (esRC), and

normalize sDCG by that ideal score to obtain nsDCG@k.

AP (esAP) are shown in Figure 4. Each point in the plot

nsDCG@k essentially assumes a specific browsing path:

corresponds to a participant’s RL1!RL2 submission; mea-

k

2rel(r) 1

k

2rel(r) 1

ranks 1 through k in each subsequent ranked list, thereby

sures are averaged over 89 topics. The strongest correlation

log (r b 1)

log (r b 1)

r 1

b

r 1

b

1

1

DCG(RL1)

DCG(RL2)

log (1 c 1)

log (2 c 1)

c

c

新技術研究会

20 - session ERR

our original method

! 音声対話検索における検索結果の提示方法

! 小さい画面

! 高負荷→画面操作ができない

! 読み上げ

! より上位にある情報しか閲覧しない傾向

! インタラクションはなるべく簡潔にすます傾向が強くなる.

→音声対話検索におけるユーザモデルは

nDCGよりERRが近い

→session ERRという手法を提案し評価指標の一つとして利用

! session nDCGもERRも普及しているので理解は

されるかと。

新技術研究会

21 - session ERR

our original method

! 手法としては

session回数に対するdiscount関数をERRの式に

導入する

! sERRの定義式

session回数に対するdiscount

新技術研究会

22 - 超最近の検索評価指標の動向

! Intent-Aware Expected Reciprocal Rank

!

L. Wang, P. N. Bennet and K. C-Thompson, Robust Ranking Mpodels via Risk-Sensitive

Optimazation. In Proc. of the SIGIR 2012. See also TREC WebTRAC 2013

! documentのrelevanceを考慮する際に

検索する意図(TOPIC)に適合しているかどうかを更に考慮

! Risk-sensitive Task(アダルトフィルタ)等の評価に使われ

る。

! Time-based calibration of effectiveness measures

!

Mark D. Smucker. Department of Management Sciences. University of Waterloo,

Canada mark.smucker@uwaterloo.ca. Charles L. A. Clarke. School of Computer

Science(SIGIR 2012) Best PAPER

! 評価時間による検索有効性測定の補正

! 検索クエリの一文字目を入れただけでクエリサジェスチョン

したりその検索結果を提示したりするケースにも対応できる

新技術研究会

23 - まとめ

(1)最近のIR研究の変化に伴う検索評価指標の動向

新技術研究会

24