このページは http://www.slideshare.net/paul_kyeong/topological-data-analysis-with-examples の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/05/30)にアップロードin学び

Topological data analysis methodologies will be introduced with example studies.

- Topological Data Analysis

-Methods and Examples-

Sunghyon Kyeong

Severance Biomedical Science Institute,

Yonsei University College of Medicine - Contents

• Brief overview of topological data analysis

• About Ayasdi

Startup company providing solutions for data analytics

• Topological data analysis - Methods

• Applications to medical science data

• Applications to social science data

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 2 - Brief Overview of

Topological Data Analysis - Machine Learning

•Supervised Machine Learning

→ Classification of new input data

(LDA, bayesian, support vector machine, neural network, and so on)

•Unsupervised Machine Learning

→ Clustering of given dataset / Community detection

(k-means clustering, modularity optimisation, ICA, PCA, and so on)

•Topological Data Analysis

→ partial clustering with allowing overlaps among clusters

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 4 - World Interests for TDA

Heat map for viewers of my TDA slide at sliceshare (for 2500 viewers during 2015.2.14. - 2014.11.31.)

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 5 - Data has Shape
- An example

Raw Data

(diabetes related data) - Shape has Meaning
- An example

Type II

diabetes

Normal

Type I

diabetes - Meaning drives “Values”
- When to use TDA?

• To study complex high-dimensional data

: feature selections are not required in TDA

• Extracting shapes (patterns) of data

• Insights qualitative information is needed.

• Summaries are more valuable than individual

parameter choices.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 11 - Algebraic

Betti

mathematically

0: 10

Betti

defined “holes” in data

1: 5

Betti0: clusters

Topology

Betti0: 8

Betti1: holes

Betti1: 5

Betti2: voids

Geometric

Topology - Algebraic Topology

Quantitive Information

Persistent homology is a spatial type of homology that is useful for data analysis.

Betti numbers, which come from computing homology, reflect the topological

properties of an object.

B

O ㅂ

q

b

: the connected components

: the number of holes

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 13 - Homology

homeomorphic to

, Betti2 = 1

homeomorphic to

, Betti2 = 0

Ref) Xiaojin Zhu, IJCAI 2013 presentation slide

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 14 - Geometric Topology

Extracting Shapes of Data

?

points cloud data

topology

Ref) Figures are obtained from Y.P. Lum et al (2013) Scientific Reports | 3: 1236

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 15 - Topological Data Analysis

using Mapper

Two input functions

- filter is to collapse high-dimensional data set into a single point

- distance as a measure of distance between data points

Resolution Parameters

- Intervals, overlap, magic fudge - Filter : Divide point clouds into each filter bin

Filter Range

: 0.0~2.0

50

Intervals

: 5

n

Bi

Overlap

: 50%

r

ilte 40

Interval Length : 0.4

F

ch

0.2

0.6

1.0

1.4

1.8

Ea 30

in

s

0

0.4

0.8

1.2

1.6

2.0

e

d

o 20

f N

o

r

e 10

mb

u

N

0

0

1

2

single

cluster

one or two cluster(s)?

Filter Metric

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 17 - Filter Function

- Filter function is not necessarily linear projections on a data matrix.

- People often uses functions that depend only on the distance

function itself, such as a measure of centrality.

- Some filter functions may not produce any interesting shapes.

Ref) Figures are obtained from Y.P. Lum et al (2013) Scientific Reports | 3: 1236

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 18 - Distance Function

- distance between all pairs of data points.

- both euclidean or geodesic distances could be used.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 19 - Distance & Clustering

• Single linkage dengrogram is used for clustering point clouds based on

distance between two nodes.

9

Magic Fudge is the number of

1 Cluster

6

bins in the distribution of the

3

distance obtained from single

0

linkage dendrogram.

1

2

3

4

5

12

2 Clusters

No. of clusters are estimated

8

from the number of continuous

4

bins having zero elements.

0

1

2

3

4

5

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 20 - Nodes, Edges, Colors

indices for points cloud: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14

11,13

13

8,11

1,2,3,

0,1,2,7

3,4,8,12

4,5

12

12,14

14

Nodes are groups of similar objects

10

5,6,9

Edges connect similar nodes

Colors let you see values of interest

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 21 - Topology extraction

Filter Range : 0 ~ 2

Intervals

: 5

Overlap

: 50%

Magic Fudge : 5

Filter

Partial Clustering

The size of node represents

the number of point clouds

in each node.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 22 - Topology of Y-shape points cloud

A)

B)

C)

F10

9th Filter bin

F10

30

F9

F9

20

F8

F8

10

F9

F10

1 1

F7

F7

Filter

0

F6

1 2 3 4 5

F6

F5

F5

1st Filter bin

Extract

F4

20

F4

position

F3

F3

information

10

F2

of y-axis

F2

F1

1

1

F1

0

1 2 3 4 5

23

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p - interval = 5

interval = 5

interval = 5

overlap = 20%

overlap = 50%

overlap = 80%

interval = 10

interval = 10

interval = 10

overlap = 20%

overlap = 50%

overlap = 80%

interval = 15

interval = 15

interval = 15

overlap = 20%

overlap = 50%

overlap = 80%

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 24 - Mathematical details can be found at

• Gurjeet Singh et al. (2007), Topological methods for the analysis of

high dimensional data sets and 3D object recognition.

• Gunnar Carlsson (2009), Topology and data, Bull. Amer. Math.

Soc. 46. 255-308.

Gurjeet Singh and Gunnar Carlsson are

co-founders of Ayasdi

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 25 - US Patent by Ayasdi

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 27 - Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 28
- Applications to

Neurobiological/Clinical Data

International Neuroimaging Data-sharing Initiative (INDI)

- URL: http://fcon_1000.projects.nitrc.org/index.html

- Healthy control, ADHD, ASD data sets are available.

- resting state fMRI, diffusion tensor imaging,

- phenotype information such as intelligence scale and ADHD

symptom severity are available. - Dataset : ADHD Symptoms & IQ

Data set:

(or Euclidean)

L2-Distance (or Euclidean distance) for all pairwise subjects:

Distance Matrix:

Filter Function: L-infinity eccentricity

1

2

3

4

5

1

0.0

61.5

77.3

51.6

77.0

f (x) = max d(x, y)

2

61.5

0.0

62.2

55.9

69.0

y 2 x

3

77.3

62.2

0.0

47.2

11.9

4

51.6

55.9

47.2

0.0

52.4

5

77.0

69.0

11.9

52.4

0.0

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 30 - Phenotypic subgroups of ADHD

N of subjects = 204

Low resolution

LIQ-ADHD

(Low IQ & Ordinary Sx)

AIQ-ADHD

(Average IQ &

Borderline Sx )

HIQ-TDC

(High IQ)

TDC

AIQ-TDC

(Low IQ)

(Average IQ)

Distance function: L2-distance

Filter function: L-infinity eccentricity

Low

High

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 31 - HIQ-TDC

AIQ-TDC

LIQ-ADHD

AIQ-ADHD

All subjects

(High IQ)

(Average IQ)

(Ordinary Sx)

(Borderline Sx)

(Avg. Sx & Avg. IQ)

13 (13 / 0)

21 (17 / 4)

12 (1 / 11)

19 (2 / 17)

204 (90 / 114)

90

90

90

60

60

60

59

60

59

30

30

30

ADHD

Inattentive

Hyper/Impulsivity

130

130

130

108

109

105

100

100

100

70

70

70

FSIQ

VIQ

PIQ

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 32 - Functional Modular Architectures

HIQ-TDC

AIQ-TDC

LIQ-ADHD

AIQ-ADHD

(TDC with High IQ)

(TDC with Average IQ)

(ADHD with Ordinary Sx.)

(ADHD with Borderline Sx.)

DMN module

Central module

BG/THL module

Occipital module

Parietofrontal module

Temporal/Limbic module

※ Only positive weights of the functional connectivity are used for module analysis and visualised.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 33 - Applications to

Medical science data - SSPG

Diabetes Subtypes

based on six quantities:

- age

area

Insuli

se

n area

- relative weight

Gluco

- fasting plasma glucose

- area under the plasma glucose curve for the three hour

glucose tolerance test (OGTT)

- area under the plasma insulin curve for the OGTT

- steady state plasma glucose (SSPG) response

Ref) Eurographics Symposium on Point-Based Graphics, Singh G et al. (2007) - Subtypes of diabetes

Low-resolution

High-resolution

Interval: 3

Interval: 4

Overlay: 50%

Overlay: 80%

Type II

diabetes

Type I

Type I

Type II

diabetes

diabetes

diabetes

Type I: adult onset TypeII: juvenile onset

Distance function: L2-distance

Filter function: density kernel with e=130,000

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 36 - Application to Biological Data,

Subtypes of Breast Cancer

using breast cancer microarray gene expression data set

- disease specific genomic analysis (DSGA) transformed data

PNAS, Monica Nicolau et al. (2010) - Breast Cancer Subtype

ER: Estrogen Receptor

PNAS, Monica Nicolau et al. (2010)

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 38 - Diabetes Mellitus (DM)

• Commonly referred to as diabetes, a group of metabolic disease.

• If left untreated, diabetes can cause many complications:

cardiovascular disease, stroke, chronic kidney failure, foot ulcers, and damage to eyes.

• Diabetes is due to either the pancreas not producing enough insulin or the cells of the body

not responding properly to the insulin produced.

• Type 1 DM: results from the pancreas’s failure to produce enough insulin.

• Type 2 DM: begins with insulin resistance, a condition in which cells fail to respond to

insulin properly. heavy weight and not enough exercise are causes of T2D.

• Gestational diabetes, is the third main form and occurs when pregnant women without a

previous history of diabetes develop a high blood-sugar level.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 40 - Why subgroups of type 2 DM?

• Risk factors of Type 2 DM are:

obesity, family history of diabetes, physical inactivity, ethnicity, and

advanced age.

• Type 2 DM is heterogenous complex disease affecting

more than 29 million in American (9.3%). 2 million in Korea.

• Increasing needs for early prevention and clinical management of Type 2

DM.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 41 - Methods

• High dimensional EMRs and genotype data from 11,210 individuals from Mount Sinai

Medical Center (MSMC)’s outpatient population (46% Hispanic, 32% African american, 20%

European white, and 2% others).

• Type 2 DM and non-Type 2 DM were defined by an electronic phenotyping algorithm

(eMERGE network) based on ICM-9-CM diagnosis codes, laboratory test, prescribed

medications (RxNorm), physician notes (natural language processing), and family history.

• Form of preprocessed data matrix is n patients by P medical variables.

Medical variables included 505 clinical variable, 7097 unique ICM-9-CM codes (1 to 218 per

patients). On average, 64 clinical variables per patients. To avoid overfitting, select the

variable with at least 50% of patients who had the variables, resulting in 73 (of 505) variables

to perform the analysis.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 42 - TDA pipeline

• Distance metric: cosine distance metric was used to assess the similarity

of the data points.

• Filter metric: L-infinity centrality and principal metric singular value

decomposition (SVD1).

L-infinity centrality is defined for each data point y to be the maximum

distance from y to any other data point in the data set. Large values of this

function correspond to points that are far from the centre of data set.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 43 - 3889 patients

7321 patients

44 - 762

TDA with only

Subtype 1

2551 Type II DM

1096

Subtype 3

Gender is not an organising

factor in the topology

Reproducibility test (2/3 training,

1/3 test data set, 10 times)

revealed that the overall

accuracy was 96% for a subtype

617

Subtype 2

classification.

45 - Applications to

Social Science Data

- Classification of (basket ball) player types

- Partial clustering of personality using temperamental traits

- Relationship among welfare, civil construction, and suicide rate - Classification of player types

based on their in-game performance such as:

- rates (per minute played) of rebounds, assists,

turnovers, steals, blocked shots, personal fouls, and

points scored (7 performance measures) - Map of Players

low resolution map at 20 intervals

high resolution map at 30 intervals

Traditionally, basket ball players are

categorised into guards, forwards, and

center.

Distance function:

Points Per Game

Variance normalised L

Low

High

2-distance

Filter function:

Principal and secondary SVD values

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 48 - TDA versus K-means clustering

Grouping subjects into

personality types - Novelty seeking

Reward dependance

Harm avoidance

Persistence

Temperament and character inventory (TCI) scores from 40 normal subjects

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 50 - Subject Grouping

2

The goal of k-means clustering is

X X

V =

||xj

µi||2 S = {S1, S2}

to minimise the within-cluster

i=1 xj 2Si

sum of squares.

the mean of points in Si

A. Input Dataset

B. k-means Clustering

70

70

Introverts

high HA & low NS

Extraverts

60

60

low HA & high NS

Centroids

ce

n

× Centroids

a

50

d

50

i

vo

40

A

40

30

rm

30

a

H

Harm Avoidance 20

20

10

10

00 20

30

40

50

60

70

Novelty Seeking

20 30 40 50 60 70

Novelty Seeking

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 51 - TDA to extract personality groups

High resolution

(5 intervals, 70% overlap)

Group1

Group2

Group 2

60

(n=4)

50

40

HA,NS,RD,P=(32,65,58,49)

30

1

1

1

20

Group 1

Low resolution

NS

HA

RD

P

(5 intervals, 50% overlap)

(n=7)

Distance function: L2-distance

Filter function: L-infinity eccentricity

Low

High

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 52 - Welfare / civil engineering / suicide ratio

Data Download: http://newstapa.com/news/201411935

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 53 - Nation’s public data analysis

Input data: ratio of welfare to civil engineering (2009), ratio of welfare to civil engineering (2012), Suicide rate (2012)

부산(남구)

인천(연수구, 남동구, 서구)

광주(동구)

광주(북구)

Group 1

강원(본청)

전북(본청)

Group 3

대구(북구)

Group 4

강원(홍천, 양양)

충북(단양)

전북(장수)

전남(함평)

경남(함양)

서울(노원구)

대구(달서구)

대전(서구)

Group 2

Distance function: L2-distance

Filter function: L-infinity eccentricity

Low

High

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 54 - Welfare | Civil Eng. | Suicide ratio

Ratio of Welfare to Civil Eng. (2009)

Ratio of Welfare to Civil Eng. (2012)

Suicide ratio (2012)

7

70

광주(북구)

서울(노원구)

전북(본청)

대구(달서구)

6

60

대구(북구)

대전(서구)

부산(남구)

인천(연수구, 남동구, 서구)

5

강원(홍천, 양양)

광주(동구)

50

충북(단양)

강원(본청)

전북(장수)

4

40

전남(함평)

경남(함양)

3

30

2

20

1

10

0

0

Group1

Gr

Group2

Gr

Group3

Gr

Group4

Gr

Blog posting: http://skyeong.tistory.com/136/

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 55 - Topological Data Analysis,

new weapon for

discovering new insight from data.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 56 - Conclusion

• TDA can be applied to various dataset and has a coordinate free

characteristic.

• Useful for analysing non-linear dataset.

• Selection and optimisation of distance and filter metric are important

issue.

• TDA will be a powerful weapon for those who want to find a new insight

from data

• It’s possible to make a supervised machine learning system using TDA.

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 57 - References

1. Gurgeek Singh et al., Topological Methods for the Analysis of High Dimensional Data Sets and

3D Object Recognition, Eurographics Symposium on Point-Based Graphics, 2007.

2. Gunnar Carlsson, Topology and Data, Bull. Amer. Math. Soc. 46(2):255-308, 2009.

3. Monica Nicolau et al., Topology based data analysis identifies a subgroup of breast cancers

with a unique mutational profile and excellent survival, PNAS 108(17):7265-7270, 2011.

4. P. Y. Lum et al., Extracting insights from the shapes of complex data using topology, Nature

Scientific Reports 3:1236, 2013.

5. Li Li et al., Identification of type 2 diabetes subgroups through topological analysis of patients

similarity, Science Translational Medicine 7(311):311ra174, 2015.

6. Jessica L. Nielson et al., Topological data analysis for discovery in preclinical spinal cord injury

and traumatic brain injury, Nature communications 6:8581, 2015.

7. AYASDI, a commercial software for TDA, http://www.ayasdi.com/

Sunghyon Kyeong (Yonsei University) | sunghyon.kyeong@gmail.com | Topological Data Analysis: Methods and Examples | p 58