このページは http://www.slideshare.net/yubintx/healthcare-data-analytics-with-extreme-tree-models の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

7ヶ月前 (2016/04/11)にアップロードinテクノロジー

Healthcare data is messy. Tree-based models provide robust first-cut solutions to such data. I in...

Healthcare data is messy. Tree-based models provide robust first-cut solutions to such data. I introduce various kinds of trees and how they are different from each other. After understanding these trees, you can build better custom models of your own.

- Introduction to

Healthcare Data Analytics

with Extreme Tree Models

Yubin Park, PhD

Chief Technology Officer

1 - Who am I

• Co-founder and Chief Technology Officer of Accordion Health, Inc.

• PhD from the University of Texas at Austin

• Advisor: Professor Joydeep Ghosh

• Studied Machine Learning and Data Mining, with a special focus on

healthcare data

• Involved in various industry data mining projects

• USAA: Life-time modeling of customers

• SK Telecom: Smartphone purchase prediction, usage pattern analysis

• LinkedIn Corp.: Related search keywords recommendation

• Whole Foods Market: Price elasticity modeling

• …

2 - Accordion Health

• Healthcare Data Analytics Company

• Founded in 2014 by

• Sriram Vishwanath, PhD

• Yubin Park, PhD

• Joyce Ho, PhD

• A team of data scientists and medical

professionals

• Help healthcare organizations lower costs and

improve qualities

From Health Datapalooza 2014

3 - Types of Problems We Solve

• Which patient is likely to be readmitted?

• Which patient is likely to develop type 2 diabetes?

• Which patient is likely to adhere to his medication?

• How much this patient wil cost this year?

• How many inpatient admissions this patient wil have this year?

• Which physician is likely to fol ow our care guideline?

• What star rating wil our organization receive this year?

• …

4 - Healthcare Data is Messy

• Data structure

• Unstructured data such as EHR

• Structured data such as claims

• Location

• Doctors’ offices, insurance companies, governments,

etc.

• Data definition

• Different definitions for different communities

• Data format

• Various industry formats

• Data complexity

• Patients going in and out of systems

• Incomplete data

• Regulations & requirements

• Source: Health Catalyst

5 - My Usual Work Flow

I start my data project by

Summary

Data Cleansing

Custom

checking summary

Statistics

& Feature

Extreme Tree

statistics, distributions, data

Engineering (2)

Models

errors, and applying simple

models.

Extreme Tree Models*

Visual

Extreme Tree

Data Cleansing

serve as a check point

Inspection

Models

& Feature

before further

Engineering (3)

developing customized

models.

Data Cleansing

Fully

*Extreme Tree Models refer

& Feature

Baseline

Customized

to a class of models that use

a tree as a base classifier.

Engineering (1)

Models

Models

6 - Why Tree-based Models

“Of al the wel -known methods,

decision trees come closest to

meeting the requirements for

serving as an off-the-shelf

procedure for data mining.”

• J. H. Friedman, R. Tibshirani, and

T. Hastie,. The Elements of

Statistical Learning

7 - How to Grow a Tree

1. Start with a dataset

2. Pick a splitting feature

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

8 - Various Kinds of Trees – C4.5, CART

1. Start with a dataset

2. Pick a splitting feature

Information Gain à C4.5

Gini Impurity, Variance Reduction à CART

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

- Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann

Publishers.

- Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and

regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

9 - Tree à Forest

• Randomization Methods

• Random data sampling

• Random feature sampling

• Random cut-point sampling

10 - Various Kinds of Forests – Bagged Trees

1. Start with a dataset

Sample with replacement, and many trees

2. Pick a splitting feature

à Bagged Trees

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

- Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123–140.

11 - Various Kinds of Forests – Random Subspace

1. Start with a dataset

2. Pick a splitting feature

Select a random subset of features

Then find the best feature/cut-point

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

- Ho, T. (1998). The Random subspace method for constructing decision forests.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:8, 832–844.

12 - Various Kinds of Forests – Random Forests

1. Start with a dataset

Sample with replacement

2. Pick a splitting feature

Select a random subset of features

Then find the best feature/cut-point

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

- Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

13 - Various Kinds of Trees – ExtraTrees

1. Start with a dataset

2. Pick a splitting feature

Select a random subset of (feature, cut-point) pairs

Then find the best (feature, cut-point) pair

3. Pick a splitting cut-point

4. Split the dataset into two sets based on the splitting feature and

cut-point

5. Repeat from Step 2 with the partitioned datasets

- Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees.

Machine learning 63.1, 3-42.

14 - Again, Bias vs Variance

• Bias: Error from model

• Variance: Error from data

• Recursive partition à fewer samples as

tree grows

• Split features/cut-points are susceptible to

training samples

• Randomization decreases variance

• Image Source: Scott Fortmann-Roe

15 - Evolution of Bias vs. Variance

- Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees.

Machine learning 63.1, 3-42.

16 - Bias Variance Trade-off

• Randomization Methods

reduces variance

• However, for some

problems, reducing the

bias of a model may be

more critical for improving

its accuracy

• A very complex dataset with

many variables and samples

Image Source: Scott Fortmann-Roe

17 - Are Tree Models are High-Variance Models?

• It depends…

• Number of data samples

• Number of features

• Data complexity

There is another way of decreasing the

expected error, which

• Randomization Methods

- Decrease Bias

• Decrease Variance

- May increase variance

• But increase Bias

18 - Boosting: Learn from Errors

Y = f0(X), where E1 = |Y-f0(X)|2

E1 = f1(X), where E2 = |Y-f1(X)|2

E2 = f2(X), where E3 = |Y-f2(X)|2

and so on...

19 - Additive Model Framework

• Additive Model Framework

generalizes boosting,

stacking, and other variants

• Source: J. H. Friedman, R.

Tibshirani, and T. Hastie,.

The Elements of Statistical

Learning (ESL)

20 - Gradient Boosting Machine

• Additive Models can be numerical y

optimized via Gradient Descent

• Source: Wikipedia and ESL

- Friedman, Jerome H. (2001) Greedy function approximation: a gradient

boosting machine. Annals of statistics: 1189-1232.

21 - Extreme Gradient Boosting (XGBoost)

Various Data Mining

Competitions in Kaggle

One thing they have in

common:

- They al used XGBoost

22 - What’s so Special about XGBoost

• XGBoost implements the basic idea of GBM with some tweaks, such

as:

• Regularization of base trees

• Approximate split finding

• Weighted quantile sketch

• Sparsity-aware split finding

• Cache-aware block structure for out-of-core computation

• “XGBoost scales beyond bil ions of examples using far fewer resources

than existing systems.” – T. Chen and C. Guestrin

23 - Going Further Extreme

• XGBoost of XGBoost

• Bagging of XGBoost

• Bagging of XGBoost of

XGBoost of …

• Stacking, Bagging, Sampling,

etc.

• Source: Kaggle

24 - Real-world Example: Predict MedAdh Scores

• Centers for Medicare and Medicaid Services (CMS) measures the

performance of Medicare Advantage (MA) Plans via Star Rating

System

• Medication Adherence (MedAdh) is one of the most important

quality measures in the Star Rating System

• MA Plans want to know how much their MedAdh scores wil change

in the next two years

25 - Predict MedAdh Scores

• Where can I find data

• Download from the CMS Part C and D Performance Data webpage

• Constructing datasets

• MedAdh Data from 2012, 2013 à Training Features, Xtrain

• MedAdh Data from 2015 à Training Label, Ytrain

• MedAdh Data from 2013, 2014 à Test Features, Xtest

• MedAdh Data from 2016 à Test Label, Ytest

26 - Lots of Missing Data

• Not al MA plans are measured for a given year à Mean Imputation

X1,X2,X3,X4,X5,X6,X7,X8,X9,Y

...

71.2,72.7,69.9,75.2,75.9,71.0,1.8

-999,-999,-999,75.8,72.5,68.8,-4.8

61.8,59.4,57.7,57.3,59.3,58.3,16.7

...

-999,-999,-999,82.8,80.0,69.8,-11.8

73.8,73.2,71.8,74.5,76.1,72.9,4.5

27 - Try Various Models

• From simple models like Linear Regression, Decision Tree to extreme-

tree models such as ExtraTrees and Gradient Boosting

from sklearn import linear_model

from sklearn import tree

from sklearn.utils import resample

from sklearn.metrics import mean_squared_error

from sklearn.ensemble import ExtraTreesRegressor

from sklearn.ensemble import GradientBoostingRegressor

28 - Try Various Models – code snippet

• From simple models like Linear Regression, Decision Tree to extreme-

tree models such as ExtraTrees and Gradient Boosting

lm = linear_model.LinearRegression()

dt = tree.DecisionTreeRegressor()

etr = ExtraTreesRegressor(n_estimators=100, max_depth=10)

gbr = GradientBoostingRegressor(n_estimators=500,

learning_rate=0.25,

max_depth=8)

29 - Try Various Models – results

$ python test.py

…

RMSE Results

lm: 2.7125536923

dt: 3.10460672029

etr: 2.18597303421

gbr: 2.02698129388

30 - Try Various Models – results

Extreme Tree Models

exhibit significant

improvements in

accuracies compared to

simple models.

One can build more

sophisticated models

based on the error

characteristics of these

models.

31 - Contact

• yubin [at] accordionhealth [dot] com

32