このページは http://www.slideshare.net/shima__shima/absolute-and-relative-clustering の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約3年前 (2013/08/11)にアップロードinテクノロジー

Absolute and Relative Clustering

4th MultiClust Workshop on Multiple Clusterings, Multi-view Data...

Absolute and Relative Clustering

4th MultiClust Workshop on Multiple Clusterings, Multi-view Data, and Multi-source Knowledge-driven Clustering (Multiclust 2013)

Aug. 11, 2013 @ Chicago, U.S.A, in conjunction with KDD2013

Article @ Official Site: http://dx.doi.org/10.1145/2501006.2501013

Article @ Personal Site: http://www.kamishima.net/archive/2013-ws-kdd-print.pdf

Handnote: http://www.kamishima.net/archive/2013-ws-kdd-HN.pdf

Workshop Homepage: http://cs.au.dk/research/research-areas/data-intensive-systems/projects/multiclust2013/

Abstract:

Research into (semi-)supervised clustering has been increasing. Supervised clustering aims to group similar data that are partially guided by the user’s supervision. In this supervised clustering, there are many choices for formalization. For example, as a type of supervision, one can adopt labels of data points, must/cannot links, and so on. Given a real clustering task, such as grouping documents or image segmentation, users must confront the question ``How should we mathematically formalize our task?’’To help answer this question, we propose the classification of real clusterings into absolute and relative clusterings, which are defined based on the relationship between the resultant partition and the data set to be clustered. This categorization can be exploited to choose a type of task formalization.

- Absolute and Relative Clustering

Toshihiro Kamishima and Shotaro Akaho

National Institute of Advanced Industrial Science and Technology (AIST), Japan

4th MultiClust Workshop on Multiple Clusterings, Multi-view Data,

and Multi-source Knowledge-driven Clustering

In conjunction with the KDD 2013 @ Chicago, U.S.A., Aug. 11, 2013

START

1 - Overview

Supervised Clustering

clustering a data set under the supervision

that indicates clusters desired by a user

Absolute and Relative Clustering

properties of real tasks that should be considered

when formalizing these tasks as mathematical problems

These properties are useful for determining these design issues:

formats of input examples & the goal of learning

the types of supervision

information provided by features

2 - An Intuitive Definition of

Absolute and Relative Clustering

3 - Real Tasks and Math Problems

tasks in the real world

problems in the math world

what we want to perform

solved in computers

ex. document clustering

a set of documents

document vectors

of bags of words

x1, x2, . . . , xN

formalize

algorithm

criterion

document clusters

C1, C2, . . . , CK

Absolute and relative clustering are properties

of real tasks, not of mathematical problems

4 - Absolute and Relative Clustering

In user’s target task, consider the determination whether two objects

grouped together OR

separated

A

B

A

B

If the determination is

A

B

NOT

A

B

CAN

influenced

influenced

OR

X

OR

X

A

B

A

B

absolute clustering

relative clustering

5 - Reference Matching

The reference matching task is an example of absolute clustering

The goal of reference matching is to group reference strings into clusters

of multiple real references to objects consisting of the same entity

Ex

These strings refer the same entity in the real world

the appearances of strings are different

Knowledge Discovery and Data Mining

&

KDD

grouped

the order of words are permuted

Author → Title → Journal → Year

grouped

Author → Year → Title → Journal

6 - Reference Matching

The strings 1 and 2 in a document set currently refer the same entity

string 1

string 2

string 3

string 4

string 5

refer the same entity

The entity referred by the strings 1 and 2 never changes

string 1

string 2

string 3

string 4

string 5

new str

refer the same entity

The determination whether a pair of strings are clustered together

is NOT influenced by the other strings in a document set

The reference matching task is absolute clustering

7 - Noun Coreference

The noun coreference task is an example of relative clustering

The goal of noun coreference is to group noun phrases in a document

into clusters of phrases corresponding to the same entity or concept

Ex

If one determines these phrases represent the same person in a

news article, they are clustered together

Mr. Abe , who is the prime minister of Japan , visited Kyoto.

And he met the mayor of the Kyoto city.

8 - Noun Coreference

A: There is a parent turtle .

B: On this turtle ,

there is a child turtle .

C: On this turtle ,

there is a grandchild turtle .

Currently, the phrase “a parent turtle” in the sentence A and

the phrase “this turtle” in the sentence C

are separated in different clusters.

The sentence B is deleted

9 - Noun Coreference

A: There is a parent turtle .

C: On this turtle ,

there is a grandchild turtle .

The phrase “a parent turtle” in the sentence A and

the phrase “this turtle” in the sentence C are clustered together

The determination whether a pair of phrases are clustered together

is influenced by the other phrases

The noun coreference task is relative clustering

10 - A Formal Definition of

Absolute and Relative Clustering

11 - Clustering Function

: a universal object set, a domain of all possible objects

X

X = {x1, x2, . . . , xN } ⇢ X : an object set

CX = {c1, c2, . . . , cK}: a partition, and c1, c2, . . . , cK : clusters

(1 if x

({x

i and xj are in the same cluster

i, xj }, CX ),

0 otherwise

Clustering Function

⇡(X): maps a given object set, X , into a partition, CX

real task

math problem

a set of entities

formal representation

X

True Clustering Function

⇡⇤(X)

appropriate partition that

correspond to

fits for the goal of a real task

CX

12 - Absolute and Relative Clustering

Intuitive Definition

If the determination whether two objects are grouped together or

separated is not influenced by the other objects, it is an absolute

clustering task; otherwise, it is a relative clustering task

Formal Definition

If a true clustering function, π*(X), for the target task satisfies the

following condition, the task is absolute clustering; otherwise, it is

relative clustering

({xi, xj}, ⇡⇤(X)) = ({xi, xj}, ⇡⇤(X0)),

8xi, xj 2 X \ X0, xi6=xj, 8X, X0 ✓ X

13 - Property of Absolute Clustering

Existence of an Absolute Partition

An absolute partition

exists iff a true clustering function

C = ⇡⇤(X )

corresponds to an absolute clustering task

All assignments of objects are consistent with this an absolute partition

even if clustered object sets are changed

({xi, xj}, ⇡⇤(X)) = ({xi, xj}, C),

8xi, xj 2 X, xi6=xj, 8X ✓ X

universal

object set X

X1

x5

x3

x1

x2

x4

x6

X2

absolute

C partition

14 - Property of Absolute Clustering

Transitivity across Different Object Sets

For absolute clustering task, the following transitivity is satisfied,

because there is an absolute partition:

For x

and

, x1 and x2 are in the same

1, x2 2 X1

x1, x3 2 X2

cluster, and x1 and x3 are also in the same cluster.

In this case, when two object sets are merged, x2 and x3 fall in the

same cluster

X1

X2

same

same

x2

x1

x3

same

X1 [ X2

15 - Three Types of

Supervised Clustering Problems

16 - There Types of

Supervised Clustering Problems

Math Problems of Supervised Clustering

format of input examples & goal of the algorithm

Transductive Clustering : A single object set with supervision

information is given, and the goal of learning is to obtain a partition

of the set

Applicable to both absolute and relative clustering tasks

Semi-Supervised Clustering : A clustering function is learned

from a single object set with supervision information

Fit for performing absolute clustering tasks

Fully Supervised Clustering : A clustering function is learned

from multiple object sets with supervision information

Relative clustering tasks must be formulated as this type of

problems

17 - Transductive Clustering

A single object set with supervision information is given, and the goal

of learning is to obtain a partition of the set

input example

X Y

learning

algorithm

ˆ

CX

partition of

supervision

X

object set

information

The distinction between absolute and relative clustering becomes

apparent when the contents of an object set change

There is no need to differentiate between absolute and relative

clustering, because an object set is invariant

18 - Semi-Supervised Clustering

A clustering function is learned from a single object set with

supervision information, and the function is used to cluster a test

object set

test object set

Xt

input example

X Y

learning

algorithm

ˆ

⇡(X)

clustering

object set supervision

function

ˆ

information

CXt

partition of Xt

To formulate absolute clustering tasks,

transitivity property can be efficiently exploited

To learn a clustering function for an absolute clustering task, the task

should be formulated as a semi-supervised clustering problem

19 - Fully Supervised Clustering

A clustering function is learned from multiple object sets with

supervision information, and the function is used to cluster a test

object set

input examples

test object set

X Y

X

1

1

t

X Y

learning

2

2

algorithm

ˆ

⇡(X)

clustering

function

ˆ

CXt

XN YN

partition of Xt

To formulate a relative clustering task,

the supervision information, Yi, is valid only for the object set, Xi

To learn a clustering function for a relative clustering task, the task

must be formulated as a fully supervised clustering problem

20 - Conclusions

We propose a notion of absolute and relative clustering

The determination whether a pair of objects are clustered together

or not is influenced by the other objects, then it is a absolute

clustering; otherwise, it is relative clustering

Two properties of absolute clustering task

Existence of an absolute partition

Transitivity across different object sets

Three types of supervised clustering problems

Transductive clustering, Semi-supervised clustering, and Fully

supervised clustering

Absolute clustering tasks should be formulated as a semi-

supervised problem, and relative clustering taks must be

formulated as a fully supervised problem.

21