このページは http://www.slideshare.net/HideoHirose/bump-hunting の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約1年前 (2015/10/20)にアップロードin学び

In difficult classification problems of the z-dimensional points into two groups having 0-1 respo...

In difficult classification problems of the z-dimensional points into two groups having 0-1 responses due to the messy data structure, it is more favorable to search for the denser regions for the re- sponse 1 assigned points than to find the boundaries to separate the two groups. To such problems of- ten seen in customer databases, we have developed a bump hunting method using probabilistic and sta- tistical methods. By specifying a pureness rate in advance, a maximum capture rate will be obtained. Then, a trade-off curve between the pureness rate and the capture rate can be constructed. In find- ing the maximum capture rate, we have used the decision tree method combined with the genetic al- gorithm. We first explain a brief introduction of our research: what the bump hunting is, the trade-off curve between the pureness rate and the capture rate, the bump hunting using the tree genetic algorithm, the upper bounds for the trade-off curve using the extreme-value statistics. Then, the assessment for the accuracy of the trade-off curve is tackled from the genetic algorithm procedure viewpoint. Using the new genetic algorithm procedure proposed, we can obtain the upper bound accuracy for the trade- off curve. Then, we may expect the actually attain- able trade-off curve upper bound. The bootstrapped hold-out method is used in assessing the accuracy of the trade-off curve, as well as the cross validation method.

- Bump huntingとその顧客データへの応用������������� by H. Hirose et. al. 0

Bump Huntingと

その顧客データへの応用�

H. Hirose

Department of Systems Design

and Informatics

Faculty of Computer Science and

Systems Engineering

Kyushu Institute of Technology

Fukuoka, 820-8502 Japan

pureness ratep

0

1

0

1

capture

rate

シンポジウム：高度情報抽出のための統計理論・方法論とその応用

九州大学附属図書館視聴覚ホール, 11/20-11/22, 2008 - 1

background and objectives

response 1

response 0

feature variable 1

feature variable 2

feature variable 3

feature variable m-1

feature variable m

feature variable 1

feature variable 2

feature variable 3

feature variable m-1

feature variable m

sample size = N

two class - 2

the information for the

customers preference

is abundant

in the cases

rather easy to classify

the favorable customers

easy to ﬁnd the boundaries to classify

the feature points clearly

classiﬁcation

linear discriminat analysis

nearest neighbors

logistic regression

neural networks

SVM

many classiﬁcation problems - 3

some of

0/1 responses projected onto 2 dimensional feature variable space

･ red ： response 1

・ blue ： response 0

a real messy customer database

real data

.

explanation variable A

response 1

response 0 - 4

the information for the

customers preference

is abundant

in the cases

rather easy to classify

the favorable customers

easy to ﬁnd the boundaries to classify

the feature points clearly

classiﬁcation

linear discriminat analysis

nearest neighbors

logistic regression

neural networks

SVM

ﬁnding denser regions instead of classiﬁcation

less chances to collect

the customers

preference; the amount

of information is not so

large

in our case

it seems not so easy

to classify the

favorable customers

difﬁcult to draw the boundaries to discriminate

response 1 from response 0 points

ﬁnding denser regions of response 1

bump

hunting

instead - 5

ﬁnding denser regions instead of classiﬁcation

less chances to collect

the customers

preference; the amount

of information is not so

large

in our case

it seems not so easy

to classify the

favorable customers

difﬁcult to draw the boundaries to discriminate

response 1 from response 0 points

ﬁnding denser regions of response 1

bump

hunting - 6

use of the decision tree

in ﬁnding denser regions

bump

because

If we think the rectangular box regions parallel to the

axes, it directly corresponds to the if-then-rules

described by a tree and it is easy to apply to the

future action.

use of the decision tree

If-then-rule - 7

trade-oﬀ between pureness rate & capture rate

pureness - rate =

#(response 1 in target regions)

#(response 1&0 in target regions)

capture- rate =

#(response 1 in target regions)

# (response 1 in toal regions)

pureness - rate =

7

10

= 0.7

mean pureness - rate =

12

12 +15

= 0.44

capture- rate =

7

12

= 0.58

deﬁne

1. under the condition that the pureness-rate of response 1 is pre-speciﬁed,

ﬁnd the bump where the capture-rate of response 1 becomes maximum.

2. obtain the trade-off curve between the pureness-rate and the maximum capture-rate

objectives

1

0

pureness rate of response 1

capturerate

1

pureness

capture

a trade-off curve

between the

pureness-rate and

the capture-rate

larger the pureness-rate

smaller the capture-rate

total regions) - 8

50+25+10=85

85% points of response 1 is captured

€

85

100

= 0.85 = capture rate

density is larger than 40%

€

0.40 ≤ pureness rate

pureness of response 1

number of points for response 1

capture the nodes of ratio of response 1 > 40%

20

25

15

20

20

10

60

35

45

10

42

8

15

5

100

80

20

50

30

15

5

28

22

25

5

10

5

3

2

50

rules to be applied - 9

splitting result by the conventional decision tree algorithm

200

800

density of response 1

= 20%

number of items for response 1

number of items for response 0

density of response 1

is larger than 45%

66

486

58

480

23

64

36

416

8

6

8

4

0

2

89

266

134

314

45

48

32

157

57

109

35

25

10

23

4

4

3

6

4

1

6

explanation variable id

45+8=53 points are captured

(53/200=27%)

example A - 10

200

800

66

486

39

199

16

19

23

180

27

287

16

106

11

181

56

181

134

314

78

133

34

162

22

19

58

68

20

65

density of response 1

= 20%

number of items for response 1

number of items for response 0

The conventional tree is not globally optimal

A random selection of the explanation variables in each node yields a result better

than the result by the conventional automatic decision tree algorithm

4

4

5

1

3

1

6

explanation variable id

density of response 1

is larger than 45%

58+22+16=96 points are captured

(96/200=48%: almost twice as much as the result of automatic case)

example A - 11

search for rules which show the maximum capture rate

0%

20%

50%

0 %

capture

rate

40%

60%

80%

100%

each

classiﬁer

skyline

pureness rate

cm

p0 - 12

bump region

TP

FP

FN TN

P N

P TP FN

N FP TN

actual

predicted

€

0

€

1

€

0

€

1

€

0

€

1

€

0

€

1

€

0

€

1

€

1

€

0€

1

€

0

€

1 €

0

€

0

€

1

€

0

€

1

€

0

€

1

€

0

€

1

€

1 €

0

€

0

€

1

€

0

€

1

€

0

€

1

€

0

€

1

€

1

€

0

pureness-rate, capture-rate and TP, FP, TN, FN

€

pureness rate =

# TP

# TP+# FP bump region

€

capture rate =

# TP

# TP+# FN total

recall

precision

confusion matrix

Recall/Precision curve

Receiver/Operator Characteristics

0%

20%

50%

0 %

capture

rate

40%

60

%

80%

100%

each

classiﬁer

skyline

pureness rate

cm

p0

The recall/precision curve

corresponds to one

classiﬁer of a tree, but the

trade-off curve try to ﬁnd

the supremum point of all

the classiﬁers under the

pre-speciﬁed pureness-rate. - 13

pureness

CART

40%

117

200.607

45%

47

138.796

50%

47

133.947

60%

47

119.059

70%

47

69.8525

99.8 precentile

point

0

25

50

75

100

125

150

175

200

35 40 45 50 55 60 65 70 75

Pattern 1

CART

random

search

pureness

CART

40%

160

193.597

45%

160

192.621

50%

160

184.553

60%

89

177.723

70%

89

158.385

Pattern 2

Pattern 3

0

25

50

75

100

125

150

175

200

35 40 45 50 55 60 65 70 75

pureness

CART

40%

189

208.247

45%

184

195.142

50%

184

198.508

60%

184

192.003

70%

175

191.931

0

25

50

75

100

125

150

175

200

35 40 45 50 55 60 65 70 75

CART

CART

random

search

random

search

99.8 precentile

point

99.8 precentile

point

why we should use random search method - 14

Can Gini s improvement ﬁnd the bump boundary?

Why Gini’x index? - 15

use of the Gini s index in splitting

i(t) = p( j |t){1− p( j |t)} =

j =1

C

∑ p( j | t){1− p( j | t)}

j =1

C

∑

=1− {p( j | t)}2

j =1

C

∑

i(t) = pLi(tL ) + pRi(tR )

Δi(t) = i(t)− i(t)

Improvement,

Impurity

x1+x2

y1+y2

x1

y1

x2

y2

x

i(t) = pL i(tL ) + pRi(tR)

=

2(x2 y1 − x1y2 )2

(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )

pL pR

δk ( j | t)=

1, k ∈class j

0, k ∉class j

⎧

⎨

⎩

p( j |t) =

1

nt

δk( j |t)

k=1

nt

∑

Vj =

1

nt

{δk ( j | t)− δ ( j | t)}2

k=1

nt

∑

=

1

nt

[{δk( j |t)}2

− 2δk ( j | t)⋅δ ( j | t)

k=1

nt

∑ +{δ ( j | t)}2

]

=

1

nt

[δk ( j | t)−

k=1

nt

∑ {δ ( j | t)}2

] {δk( j |t)}2

= δk ( j | t)( )

= p( j |t){1− p( j | t)} - 16

decision tree ﬁnds the boundaries of the bump

bump

x

boundaries of the bump found by Gini’s index

y2

x2

x1+x2

y1+y2

x1

y1

x2

y2

x

x1

y1

Gini’s index

i(t) = pL i(tL ) + pRi(tR)

=

2(x2 y1 − x1y2 )

2

(x1 + y1)(x2 + y2 )(x1 + x2 + y1 + y2 )

response 0

response 1 - 17

The boundary of the bump can be found by Gini s improvement

Gini

Gini improvement

f(x) ~ N (0,1)

g(x) = c

1

5

1

1

5

1

Splitting points may differ

according to the volume of the base,

but the amounts of ﬂuctuation of the

spitting points are small

-5 -4 -3 -2 -1 0 1 2 3 4 5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Other criteria provide the

same results

Chi-suare

Gini - 18

In 2 dimensional case of 0,1 responses:

response 1: uniform 1000 points in -5<x<5, -5<y<5

with 200 normal points from N(0,1) bump

response 0: uniform 5000 points in -5<x<5, -5<y<5

-1.86<x<2.15, -1.55<y<2.43

2-domensinal bump hunting simulation - 19

0.1

0.1

0.8

0.2

0.3

0.5

(1) (2) (3)

response 0

response 1

3 kinds of simulation for bump hunting

1-dimensional

2-dimensional

3-dimensional

4-dimensional

8-dimensional

16-dimensional

32-dimensional

64-dimensional

0.8

0.2

Gaussian

uniform - 20

theoretical and simulated trade-oﬀ (1)

capturerate

pureness rate

0.8

0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-dim. - 21

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

系列1

系列2

系列3

系列4

系列5

系列6

theoretical and simulated trade-oﬀ (1)

1-dim.

2-dim.

3-dim.

4-dim.

8-dim.

16-dim.

capturerate

pureness rate

1-dim.

2-dim.

3-dim.

4-dim.

8-dim.

16-dim.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.8

0.2 - 22

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

2

3

4

8

16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

theoretical and simulated trade-oﬀ (2)

1-dim. 2-dim.

3-dim.

4-dim.

8-dim.

16-dim.

capturerate

pureness rate

0.1

0.1

0.8

1-dim.

2-dim.

3-dim.

4-dim.

8-dim.

16-dim. - 23

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 0.6 0.7 0.8 0.9 1

系列1

系列2

系列3

系列4

系列5

系列6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 0.6 0.7 0.8 0.9 1

theoretical and simulated trade-oﬀ (3)

1-dim.

2-dim.

3-dim.

4-dim.

8-dim.

16-dim.

capturerate

pureness rate

0.2

0.3

0.5

1-dim.

2-dim.

3-dim.

4-dim.

8-dim.

16-dim. - 24

simulated bump regions

0.2

0.3

0.5

2-dim.

capturerate

pureness rate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 0.6 0.7 0.8 0.9 1 - 25

Cases we cannot ﬁnd any bumps using the Gini’s improvement

exceptional cases - 26

random search in trees

Genetic algorithm? - 27

1t

2t 3t

4t 5t 6t 7t

The conventional decision tree finds

the optimal feature variable and

optimal splitting point from the top

node to downward by using the

Gini’s index, or entropy.

generate the tree by the probabilistic method

But it will not capture the largest

number of response 1 points.

explore the optimal tree by generating the trees by a greedy method

€

ti :explanation variable is selected at random

optimal spilitting point is found by using the Gini's index

=> probabilistic method

=> genetic algorithm - 28

parent A

parent b

cross-over in the tree

The genetic algorithm applied to the tree structure is different from

the conventional one where the structure is one-dimensional line

like genes. The splitting point at the root node in a tree have a

deﬁnitely important meaning.

child Ab

crossover is

supposed to

preserve the good

inheritance

A

B

a

b

This is an example of a child Ab, consisting of

left-hand-side branch from parent A and upper

side tree from parent B.

the branches with

the root node are

used as they are - 29

parent A,a

parent B,b

child Ab

cross-over in the tree

child bA

child Ba child aB

According to

this manner, we

can have 4

children by

parent A and B. - 30

1

20

2

30

0

random

30

genetic algorithm for the tree evolution

30 initial trees

sorted from larger

capture-rates

10

evolution algorithm to the tree structure

local maximum case 1

top 1

cap. max

1

20

2

20

0 5 10 15 20 25

80

90

100

110

120

130

140

evolution

capture

rate

the best tree from one

set of initial seeds

cross over

tree A

tree B

next generation #1,2,3,4

branches from top 10

branches from top 1

cross over

tree B

tree A

2

3

5

6

7

8

9

4

１

10

next generation #17,18,19,20

branches from top 6

branches from top 5

…

…

…

next generation #5,6,7,8

branches from top 9

branches from top 2

combine the two of them from different

parents, producing 4 children

evolution

1

20

2

1

top

10

evolution procedure is continued to 20 generations - 31

1

20

2

2

top

20

1

20

2

30

1

random

30

local maximum case 1

top 1

cap. max

1

20

2

20

evolution

genetic algorithm for the tree evolution (2)

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

local maximum case 2

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

local maximum case 3

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

local maximum case 20

Why?

20 cases with different initial seeds

are dealt with similarly. - 32

0

50

100

150

200

250

300

-.2 .2 .6 1

var01

0

50

100

150

200

250

300

-.2 .2 .6 1

var02

0

40

80

120

160

-.2 .2 .6 1

var08

0

100

200

300

400

-.2 .2 .6 1

var03

0

20

40

60

80

100

-.2 .2 .6 1

var11

0

10

20

30

40

50

60

70

-.2 .2 .6 1

var12

0

20

40

60

80

-.2 .2 .6 1

var13

0

100

200

300

400

-.2 .2 .6 1

var04

0

50

100

150

200

250

300

-.2 .2 .6 1

var05

0

50

100

150

200

250

300

-.2 .2 .6 1

var06

0

40

80

120

160

-.2 .2 .6 1

var07

0

10

20

30

40

50

60

70

-.2 .2 .6 1

var14

0

20

40

60

80

-.2 .2 .6 1

var15

0

20

40

60

80

100

120

-.2 .2 .6 1

var16

0

20

40

60

80

100

120

140

-.2 .2 .6 1

var17

0

20

40

60

80

-.2 .2 .6 1

var18

marginal densities in one feature variable

1

2

3

4

5

6

7

8

response

0

1

800

samples

200

samples

simulated densities of the feature variables

simulated data mimicked to a real customer data base for simplicity

1

2

3

4

5

6

7

7

8

marginal densities in two feature variables

bump region

simulated data - 33

0 5 10 15 20 25

80

90

100

110

120

130

140

0 5 10 15 20 25

80

90

100

110

120

130

140

0 5 10 15 20 25

80

90

100

110

120

130

140

0 5 10 15 20 25

80

90

100

110

120

130

140

0 5 10 15 20 25

80

90

100

110

120

130

140

0 5 10 15 20 25

80

90

100

110

120

130

140

numberofcapturedpointsforresponse1

from many initial sets of seeds in the genetic algorithm for the decision tree,

different capture-rates are obtained.

local convergence in the GA and estimated return

iteration number of the evolution procedure

…

p0=0.45

simulated data

each point

is

a local

maxima

ﬁtted

density function

0

1

2

3

4

5

6

7

112.5 117.5 122.5 127.5 132.5 137.5

number of captured points for response 1

frequency

histogram for 20 observed local maxima

return period

return period

0

40

80

120

160

200

125 135 145 155 165

frequency

return period and its CI are obtained

boostrap result

F(x) = exp −exp −

x − γ

η

⎛

⎝

⎜ ⎞

⎠

⎡

⎣ ⎢

⎤

⎦ ⎥

0

20

40

60

80

100

120

140

105 110 115 120 125 130 135 140

500 cases - 34

pureness of response 1specify p0

1

0

1

usable rules

upper bound capture-rates

estimated by using the

extreme-value statistics

trade-oﬀ curve and its upper bound

many

local

maxima

are

obtained

by GA

return period

and its CI

by extreme-value statistics

capturerate That’s it?

No.

These curves could be

optimistic.

Because we are using

only the training data. - 35

original

data

training

data

induced

rule

accuracy evaluation by the simulated test data

GA rule

obtained by the

training data

accurate evaluation

generalization error

test

data

accuracy

assess

generated by the

underlying distribution

test data

To ﬁnd the accurate the capture rate, we have to apply the

test data to the optimized rule obtained by using the

training - 36

test data analysis by naïve method

best GA rule

obtained by the

training data

0

25

50

75

100

125

150

175

200

0 10 20 30 40 50 60 80

numberofcaptures

pre-speciﬁed pureness rate of response 1

70

local pureness < p

138

observed maximum value from 20 GA

p=0.45

simulated data

best GA rule

by 20 times reproductions

from 30 initial value

original

data

training

data

induced

rule

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

naïve method

10 simulated test data

cases using the underlying

distribution

test data

If local pureness < p,

discard the points in this node.

accuracy

assess

generated by the

underlying distribution

10 case simulated data

10 cases

results

test

data

1

2

10

data - 37

test data analysis by naïve and relaxation methods

naïve method

If local pureness < p,

discard the points in this node.

best GA rule

obtained by the

training data

test data

local pureness < p

0

25

50

75

100

125

150

175

200

0 10 20 30 40 50 60 80

numberofcaptures

pre-speciﬁed pureness rate of response 1

p=0.45

138

70

observed maximum value from 20 GA

10 simulated test data

cases using the underlying

distribution

bias

observed pureness rate of response 1

numberofcaptures

10 simulated test data

cases using the underlying

distribution

relaxation method

best GA rule

obtained by the

training data

test data

Even if local pureness < p,

collect the points in this node.

local pureness < p

138

0

25

50

75

100

125

150

175

200

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

observed maximum value from 20 GA

p=0.45 - 38

(naïve) bootstrap

classiﬁcation accuracy

evaluation method 2

real data

original

data

1,2,...n

induced

rule

by

the GA

maximum

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

test

data eval.

test

data eval.

test

data eval.

accur

acy

mean

11*,21*,...n1*

12*,22*,...n2*

1b*,2b*,...nb*

eval.

1

2

10

data - 39

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60% 70% 80%

using the test data

speciﬁed p0 =0.45

pureness of response 1

capturerate

bootstrap 10 cases

top 1 GA maximum

(naïve) bootstrap result

optimistic estimator

(63.2% samples are diﬀerent from each other)

1/10 model

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

1

2

10

data

real data

GA maximum

1/10 model - 40

10-fold

CV

10 fold cross-validation

classiﬁcation accuracy

evaluation method 3

original

data

training

data

induced

rule

1,2,...10

1,2,...9

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

10

test

data

eval.

data

training

data

test

data

induced

rule

eval.

training

data

test

data

induced

rule

eval.

accur

acy

1,2,...10 9

2,...10

mean

eval.

1

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules data

real data - 41

the bias obtained by the 10 fold cross validation

10 fold CV

top 1 GA maximum

each 10 points

1/10 model

speciﬁed p0 =0.45

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules datadata

using the training data

using the test data

pureness of response 1

capturerate

mean

computing time

is extremely

long

(80 hours)

smaller model

real data

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60%

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60%

bias

GA maximum

1/10 model - 42

the bias obtained by the 10 fold cross validation

10 fold CV

top 1 GA maximum

each 10 points

1/100 model

speciﬁed p0 =0.45

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules datadata

using the training data

using the test data

pureness of response 1

capturerate

mean

bias

GA maximum

1/10 model

0%

5%

10%

15%

20%

25%

30%

35%

0% 10% 20% 30% 40% 50% 60%

0%

5%

10%

15%

20%

25%

30%

35%

0% 10% 20% 30% 40% 50% 60%

even the

relaxation

method badly

behaves

(almost

collapsed)

real data - 43

bootstrapped hold-out

classiﬁcation accuracy

evaluation method 3

real data

original

data

training

data

induced

rule-1

11*,21*,...n/2

n

1,2,...n

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

bias

1b**,2b**,...

bias

mean

eval.

training

data

test

data

induced

rule-2

bias

12*,22*,... 12**,22**,...

training

data

test

data

induced

rule-b

bias

1b*,2b*,...

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules data

test

data

11**,21**,...

n/2

data - 44

speciﬁed p0 =0.45

capturerate

the bias obtained by the bootstrapped hold-out

BHO 10 cases

top 1 GA maximum

each 10 points

1/10 model

using the training data

using the test data

pureness of response 1

bias

mean

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20

evolution

case 1

rules datadata

real data

computing time

is shortened

and accuracy is

preserved,

but a little bit

optimistic

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60%

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60%

GA maximum

1/10 model - 45

1

20

2

30

1

random

30

1

20

2

2

select top

10 by

applying the

training data

top 1

cap. max

1

20

2

20

evolution

old

10

training

data

BHO

we have been using

the training data only

in the GA tree

procedure

training

data

evaluation

data

test

data

we divide the data to 3 parts

1

20

2

30

1

random

30

1

20

2

2

top 1

cap. max

1

20

2

20

new

select top

10 by

applying the

evaluation data10

training

data

evaluation

data

evolution

At each evolution generation stage, we

produce the trees using the training data, and

select the best trees using the evaluation data.

Then, we can expect that the ﬁnal stage results

could be the local maxima for the evaluation

data, and we may apply the extreme-value

statistics to these ﬁnal results.

Then, we apply the the ﬁnal rule to the test data.

test

data

accuracy

assess.

tree genetic algorithm - 46

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20 by eval.

evolution

case 1

evolution in the tree GA and the return period

BHO

real data

using the evaluation data

the capture-rate is converging to a ﬁnal value

within 10 generations, both in training data and

evaluation data.

using the 20 ﬁnal best capture rates

100 125 150 175 200 225 250

0.0025

0.005

0.0075

0.01

0.0125

0.015

extreme-value density using the estimated

parameters

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20 by test

evolution

case 1

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

20 by test

evolution

case 1

top 1

cap. max

1

20

2

1

20

2

30

1

20

2

1 2 20

random

30

top

10 by eval. data

evolution

case 20

0

0.05

0.1

0.15

0.2

0.25

0.0 0.1 0.2 0.3 0.4 0.5 0.6 - 47

0.04 0.06 0.08 0.12 0.14 0.16

5

10

15

20

25

30

Gumbel

Distribution

ﬁt

0.04 0.06 0.08 0.12 0.14 0.16

5

10

15

20

25

30

Gumbel

Distribution

ﬁt

€

f (x) =

1

η

⋅exp

γ − x

η

⎛

⎝

⎜

⎞

⎠

⎟ ⋅exp − exp

γ − x

η

⎛

⎝

⎜

⎞

⎠

⎟

⎛

⎝

⎜

⎞

⎠

⎟

Gumbel distribution for maxima

similarity between the evaluation and the test

BHO

200 initial cases

pre-speciﬁed pureness rate

= 45%

real data.04

.06

.08

.1

.12

.14

.16

test

.04 .06 .08 .1 .12 .14 .16

eval.

we may

estimate the

upper bound of

the trade-off

curve by using

the test data

results.

0

10

20

30

40

50

60

70

.04 .06 .08 .1 .12 .14 .16

test

0

10

20

30

40

50

60

70

80

.04 .06 .08 .1 .12 .14 .16

eval.

evaluation

data result

test

data result

observed observed

relation - 48

pureness of response 1

capture

rate

specify p0

1

0

1

rules using only the

training data

maximum capture-rates estimated by using extreme-

value statistics with the training data

accurate trade-oﬀ curve using the test data

maximum capture-rates estimated by using

extreme-value statistics with the test data

rules using

the training

data - 49

0

0.1

0.2

0.3

0.4

0.2 0.3 0.4 0.5 0.6 0.7 0.8

10 cases 99.8% return

period

mean

10 cases of

best 1s from

20 local

maxima by the

new tree-GA

with the test

data

mean40%

45%

50%

60% 70%

pureness of response 1

capturerate

The upper bound for the trade-oﬀ curve using extreme-value statistics can be

estimated by using the new tree-GA using test data

actual trade-oﬀ curve and its upper bound

real data - 50

1. In ﬁnding the denser region for response 1 points having a large number of feature

variables, we have proposed to use the bump hunting method.

2. To evaluate the bump hunting method, we have shown that the trade-off curve is

useful.

3. To construct the trade-off curve, we have been used the decision tree, genetic

algorithm, and the extreme-value statistics.

4. We have shown that the trade-off curve using the training data could be

optimistic.

5. For the use of the test data with less computing cost, we have proposed the

bootstrapped hold-out method instead of cross-validation.

6. To estimate the accurate upper bound trade-off curve, we have developed the new

GA tree by using the three sets of sampled data: training, evaluation and test data.

7. The evaluation data results follow the extreme-value statistics, and using the

similarity between the evaluation data results and the test data results, we can

estimate the accurate trade-off curve.

conclusions - 51

Bump Huntingと

その顧客データへの応用�

H. Hirose

Department of Systems Design and Informatics

Faculty of Computer Science and Systems Engineering

Kyushu Institute of Technology

Fukuoka, 820-8502 Japan

シンポジウム：高度情報抽出のための統計理論・方法論とその応用

九州大学附属図書館視聴覚ホール, 11/20-11/22, 2008

thank you