このページは http://www.slideshare.net/mksaad/class-outlier-mining-presentation の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- International Journal of Intelligent Technology, Vol. 2, No. 1, pp 55-68, 2007

Class Outlier Mining:

Distance‐Based Approach

By

Nabil M. Hewahi and Motaz K. Saad

Presented by

Motaz K. Saad

msaad@iugaza.edu

Jan. 2008 - Abstract

• In large datasets, identifying exception or rare

cases with respect to a group of similar cases

is to be considered very significant problem.

(unusual pattern)

• The traditional problem (Outlier Mining) is to

find exception or rare cases in a dataset

irrespective of the class label of these cases,

they are considered rare event with respect to

the whole dataset.

2 - Abstract (Cont.)

• Present an overview of Class Outlier.

• Introduce a novel definition of a class outlier

and propose COF factor.

• Propose a new algorithm for class outlier

mining.

• Present experimental results.

• Perform a comparison study.

3 - Outlier Definition

• An Outlier is a data object that does not

comply with the general behavior of the

data (unusual pattern)

• It can be considered as noise or

exception but is quite useful in fraud

detection and rare events analysis.

4 - Outlier Mining

• It is the problem of detecting rare

events, deviant objects, and

exceptions.

• Is an important data mining issue in

knowledge discovery; it has attracted

increasing interests in recent years.

5 - Outliers

Outliers

6 - Outlier Mining: Business Applications

• Medical

• Education

• Fraud detection

• Credit approving

• Stock market analysis

• Identifying computer network intrusions

• Data cleaning

• Surveillance and auditing

• Health monitoring systems

• Insurance, banking, money laundering

telecommunication ..., etc).

7 - Outlier Detection Methods

• Statistical based (Distribution based)

• Clustering

• Distance‐based

– K Nearest Neighbors (KNN)

– Density‐Based

• Model‐Based (Neural Network): Replicator

Neural Network RNN

8 - Statistical (Distribution) based

Outlier Detection Method

9 - Variation of Distance‐Based Approach

for detecting Outlier

10 - NN Model for Detection Outlier

A schematic view of a fully connected Replicator Neural Network.

11 - What is Class Outlier?

• All the previous Definition of Outliers do

not consider the class labels of the data

set

• This means all the previous methods of

Outliers Mining are devoted on the

overall data set without looking closely

to each class label separately.

12 - Class Outlier vs. Outlier

Class Outliers

Outlier

X Class

Class

13 - Class Outlier Example in heart‐

statlog dataset

Class Outlier

Att.#

1 2 3

4

5

6 7

8

9 10 11 12 13

Class

Inst#

69

47 1 3 108 243 0 0 152 0 0

1

0

3

present

62

44 1 3 120 226 0 0 169 0 0

1

0

3

absent

150 41 1 3 112 250 0 0 179 0 0

1

0

3

absent

179 50 1 3 129 196 0 0 163 0 0

1

0

3

absent

38

42 1 3 130 180 0 0 150 0 0

1

0

3

absent

253 51 1 3 110 175 0 0 123 0 .6

1

0

3

absent

23

47 1 3 112 204 0 0 143 0 .1

1

0

3

absent

Class Outlier Mining take in consideration the class label of the

dataset

14 - Class Outlier Example in house‐

vote‐84 dataset

Class Outlier

Att.#

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Class

Inst#

407

n n n y y y n n n n y y y y n n democrat

306

n n n y y y n n n n n y y y n n republican

83

n n n y y y n n n n n y y y n n republican

87

n n n y y y n n n n n y y y n n republican

303

n n n y y y n n n n n y y y n n republican

119

n n n y y y n n n n n y y y n n republican

339

y n n y y y n n n n y y y y n n republican

Class Outlier Mining take in consideration the class label of the

dataset

15 - Party Behavior of house‐vote‐84

dataset

Dem ocrat

Republican

300

245

231

218

250

213

200

200

163

157

142

135

133

s

150

te

Vo

100

55

45

36

20

24

29

50

1318

22

11

12

14

8

8

7

4

3

3

2

4

0

noVote

n

y

noVote

n

y

noVote

n

y

noVote

n

y

noVote

n

y

Issue 12

Issue 8

Issue 5

Issue 4

Issue 3

Issue #

16 - Class Outlier Definitions

• There are three definition that handle the

problem that is “given a set of observations with

class labels, find those that arouse suspicions,

taking into account the class labels”.

– Semantic Outlier [He, et al. WAIM’02, 2002]

– Cross Outlier [Papadimitriou and Faloutsos, SSTD’03, 2003]

– Generalization of def. 1 & 2 [He, et al. ESWA'04, 2004]

17 - Semantic Outlier

• Semantic Outlier: a data point, which behaves

differently with other data points in the same

class, while looks normal with respect to data

points in another class [He, et al. 2002]

18 - Cross Outlier

• Cross‐outlier: Given two sets (or classes) of

objects, find those which deviate with respect to

the other set [Papadimitriou and Faloutsos 2003]

19 - Definition (3)

• Generalization of Definition 1 & 2: The

generalization does not consider only outliers

that deviate with respect to their own class,

but also outliers that deviate with respect to

other classes [He, et al. 2004].

20 - The proposed definitions

• Distance (Similarity) Function

• K Nearest Neighbors

• PCL(T, K)

• Deviation

• K‐Distance

• Class Outlier

• Class Outlier Factor (COF)

21 - Distance (Similarity) Function

• Given a data set D = {t1, t2, t3, ..., tn} of tuples

where each tuple ti = <ti1, ti2, ti3, ..., tim, Ci>

contains m attributes and the class label Ci, the

similarity function based on the Euclidean

Distance between two data tuples, X = <x1, x2,

x3, ...., xm> and Y = <y1, y2, y3,..., ym>

(excluding the class labels) is

2

d2( Y

X, )

m

=

(

∑ X −Y)

=

i 1

22 - K Nearest Neighbours

• For any positive integer

K, the K-Nearest

Neighbours of a tuple ti

are the K closest tuples

in the data set.

23 - PCL (T, K)

• The Probability of the

class

label

of

the

instance T with respect

to the class labels of its

K Nearest Neighbours.

• The instance T has the

class label y, So the PCL

of the instance T is 2/7.

24 - Deviation (T)

• Given a subset DCL = {t1, t2, t3, ..., th} of a

data set D = {t1, t2, t3, ..., tn}. Where h is the

number of instances in DCL and n is the

number of instances of D.

• Given the instance T, DCL contains all the

instances that have the similar class label of

that of the instance T.

• The Deviation of T is how much the instance

T deviates from DCL subset.

25 - Deviation (T) (Cont.)

• The Deviation is computed by summing the

distance between the instance T and every

instance in DCL.

h

Deviation (T ) =

∑ d(T,t

∈

i ) Where

,

t

DCL.

i

=

i 1

26 - Deviation (T) (Cont.)

27 - K‐Distance (The Density Factor)

• K Distance between the

instance T and its K

nearest neighbors, i.e.

how much the K nearest

neighbors instances are

close to the instance T.

KDist (T ) =

∑K d (T,ti )

i=1

28 - Class Outlier:

The proposed Definition

• Class Outliers are the top N instances

which satisfies the following:

– The K‐Distance to its K nearest neighbours

is the least.

– Its Deviation is the greatest.

– Has different class label form that of its K

nearest neighbours.

29 - Class Outlier Factor (COF)

• COF: The Class Outlier Factor of the instance T is

the degree of being Class Outlier. The Class Outlier

Factor of the instance T is defined as:

COF (T ) = K ∗ PCL(T ) + α ∗

1

∗

Deviation(T ) + β KDist(T )

• PCL(T) from [1/K,1] to [1,K] by multiplying it by K.

α and β factors are to control the importance and the

effects of Deviation and K-Distance, where 0 ≤ α ≤ M

and 0 ≤ β ≤ 1. M is a changeable value based on the

application domain and the initial experimental

results.

30 - Guidelines for choosing K, α and β

• If the Deviation in hundreds for example, the

best value for α is 100, and if the Deviation in

tens, then the best value for α is 10 and so on.

• The optimal value of K is determined by trial

and error technique.

– There are many factors affecting the optimal value,

for example dataset size and number of classes are

very important factors that affect choosing the

value of K.

31 - Optimal value of K

• High value of K might result in wrong

estimation of PCL.

• Low value of K means KNN is not well

utilized.

• Odd values of K would make more sense for

PCL value.

32 - CODB Algorithm Basic Steps

• Rank each instance in the dataset D.

– This is done by calling the Rank procedure after

providing the CODB with all the necessary data such as

the value of α, β and K.

– The Rank Procedure finds out the rank of each

instance using the formula in slide 28 and gives back

the rank to CODB

• CODB maintains a list of only the instances of the

top N class outliers.

– The less is the value of COF of an instance, the higher

is the priority of the instance to be a class outlier. 33 - CODB features

• Direct method: no need for clustering.

• Handle numeric (continues), nominal and

mixed dataset.

• Works on datasets with more than two

classes.

• More specific in data object ranking than

other related methods.

34 - Experimental results

• The CODB algorithm has been applied on five

different real world datasets.

• All the datasets are publicly available at the

UCI machine learning repository .

• The datasets are chosen from various domains

that might have single or mixed data types

and with two or more class labels. This

variation is being tested on our proposed

algorithm to show its capabilities.

35 - We performed experiments on the

following datasets:

• Votes dataset: Nominal, 2 class labels.

• Hepatitis dataset: Mixed, 2 class labels.

• Heart‐statlog dataset. Mixed, 2 class labels.

• Credit approval (credit‐a) dataset: good mix of

attributes: continuous, nominal with small

numbers of values, and nominal with larger

numbers of values, 2 class labels

• Vehicle dataset. Continues, 4 class labels

36 - Votes dataset experimental results

• 1984 United States Congressional Voting

Records Database.

• Includes votes for each of the U.S. House of

Representatives Congressmen on the 16 key

votes.

• 16 Boolean attributes + class name = 17

• 435 instances, 2 classes (61.38% Democrats,

38.62% Republicans).

37 - THE TOP 20 CLASS OUTLIERS OF HOUSE‐VOTE‐84

DATASET

• K = 7

• Top N COF = 20

• Distance type: Euclidean Distance

• α = 100

• β = 0.1

• Remove Instance With Missing Values: false

• Replace Missing Values: false

38 - #

Inst. #

PCL

Dev

KDist

#

Inst. #

PCL

Dev

KDist

1

896.24

6.0

2

519.94

8.49

1

407

11

176

COF: 1.71158

COF: 3.04086

1

881.78

8.07

2

832.95

9.34

2

375

12

384

COF: 1.92051

COF: 3.0543

1

857.35

8.07

2

836.65

9.44

3

388

13

365

COF: 1.92375

COF: 3.0634

1

819.35

8.49

2

849.33

9.76

4

161

14

6

COF: 1.97058

COF: 3.0934

1

523.66

8.49

2

524.28

9.12

5

267

15

355

COF: 2.03949

COF: 3.10283

1

535.52

9.44

2

845.19

10.07

6

71

16

164

COF: 2.13061

COF: 3.12576

1

799.64

10.39

2

480.79

10.93

7

77

17

402

COF: 2.16429

COF: 3.30081

2

846.64

8.49

2

879.84

12.0

8

325

18

151

COF: 2.96664

COF: 3.31366

2

829.17

8.49

3

839.0 8.07

9

160

19

173

COF: 2.96913

COF: 3.9263

2

851.76

9.02

3

841.28

9.44

10

382

20

75

COF: 3.01986

COF: 4.06275

39 - Comparison Study

• We performed a comparison study with He’s

method (2002, 2004).

• Semantic Outlier Factor vs. Class Outlier

Factor (SOF vs. COF).

40 - SOF vs. COF

41 - THE TOP 20 SEMANTIC OUTLIERS FOR

HOUSE‐VOTE‐84 DATASET

#

Ins t. #

SOF

#

Ins t. #

SOF

1

176

0.3036 11

375

1.4520

2

71

0.3394 12

151

1.4927

3

355

0.3645 13

372

1.4950

4

267

0.3659 14

388

1.6365

5

183

0.8726 15

2

1.6489

6

97

0.9892 16

382

1.6727

7

88

1.0724 17

215

1.7010

8

402

1.1690 18

164

1.7168

9

407

1.3309 19

6

1.7236

10

248

1.3487 20

325

1.7259

42 - Comparison Study (Cont.)

• Instance # 176 of votes dataset

– Rank 1 (the top) using SOF.

– Rank 11 using COF.

– PCL(176) = 2/7 → there is another instance of the

same class within the seven nearest neighbours.

43 - Comparison Study (Cont.)

• Instance # 407 of votes dataset

– Rank 9 using SOF.

– Rank 1 (the top) using COF.

– PCL(407) = 1/7.

– The Deviation is the greatest which implies sort of

uniqueness of the instance (object) behaviour.

– The K-Distance of the instance is very small

(high density of other class type)

– SOF(407) = rank 9 → indicates the disability of

recognizing such important cases.

44 - Conclusions

• In this research we proposed and introduced:

– A novel approach for Class Outliers mining based

on the K nearest neighbours using distance‐based

similarity function to determine the nearest

neighbours.

– Motivation about Class Outliers and their

significance as exceptional cases.

– Ranking score that is Class Outlier Factor (COF) to

measure the degree of being a Class Outlier for an

object.

45 - Conclusions (Cont.)

– An efficient algorithm for mining and detection Class

Outliers.

– An implementation has been developed using Weka

framework.

• We presented:

– Experimental results of the algorithm applied on

various domains dataset (medical, business, and

other domains), and for different dataset type

(continues, nominal with small numbers of values,

nominal with larger numbers of values, and mixed).

46 - Conclusions (Cont.)

– A comparison study has been performed with

other methods and results show that our

proposed algorithm gives more plausible and

reasonable results than others. In addition, it

considers mixed data types and more than two

class label.

47 - Future work

• Proposing Class Outlier Detection Model.

• Getting advantage of the output of this work

to find out a scheme to induce Censored

Productions Rules (CPRs) from large datasets.

• Developing a weighted distance similarity

function, where feature weight determination

might be based on the information gain.

48 - Thank You !... Q&A

49