このページは http://www.slideshare.net/mark_landry/modern-classification-techniques の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年弱前 (2015/01/21)にアップロードin学び

Slides to support Austin Machine Learning Meetup, 1/19/2015.

Overview of techniques of recent Ka...

Slides to support Austin Machine Learning Meetup, 1/19/2015.

Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.

- Modern Classification Techniques

Mark Landry

Austin Machine Learning Meetup

1/19/2015 - Overview

• Problem & Data

– Click-through rate prediction for online auctions

– 40 mil ion rows

– Sparse: gather characteristics

– Down-sampled

• Methods

– Logistic regression

– Sparse feature handling

– Hash trick

– Online learning

– Online gradient descent

– Adaptive learning rate

– Regularization (L1 & L2)

• Solution characteristics

– Fast: 20 minutes

– Efficient: ~4GB RAM

– Robust: Easy to extend

– Accurate: competitive with factorization machines, particularly when extended to key interactions - Two Data Sets

• Primary use case: click logs

– 40 million rows

– 20 columns

– Values appear in dense fashion, but a sparse feature space

• For highly informative features types (URL/site) 70% of features have

3 or fewer instances

– Note: negatives have been down-sampled

• Extended to separate use case: clinical + genomic

– 4k rows

– 1300 columns

– Mix of dense and sparse features - Methods and objectives

• Logistic regression: accuracy/base algorithm

• Stochastic gradient descent: optimization

• Adaptive learning rate: accuracy, speed

• Regularization (L1 & L2): generalized solution

• Online learning: speed

• Sparse feature handling: memory efficiency

• Hash trick: memory efficiency, robustness - Implementation Infrastructure

• From scratch: no machine learning libraries

• Maintain vectors for

– Features (1/0)

– Weights

– Feature Counts

• Each vector will use the same index scheme

• Hash trick means we can immediately find the

index of any feature and we bound the vector

size (more later) - Logistic Regression

• Natural fit for probability problems (0/1)

– 1 / (1 + exp(sum(weight*feature)))

– Solves based on log odds

– Higher calibration than many other algorithms

(particularly decision trees), which is useful for

Real Time Bid problem - Sparse Features

• All values experience receive a column where

the absence/presence

• So 1 / (1 + exp(sum(weight*feature))) resolves

to 1 / (1 + exp(sum(weight))) for only the

features in each instance - Hash Trick

• Hash trick al ows for quick access into paral el arrays that hold key

information to your model

• Example: use native python hash(‘string’) to cast into a large integer

• Bound the parameter space by using modulo

– E.g. abs(hash(‘string’)) % (2 ** 20)

– The size of that integer is a parameter, and it allows you to set it as large as your

system can handle

– Why set it larger? Hash col isions

– Keep features separate: abs(hash(feature-name + ‘string’)) % (2 ** 20)

• Any hash function can have a col ision. The particular function used is fast,

but much more likely to encounter a col ision than a murmur hash or

something more elaborate.

• So a speed/accuracy tradeoff dictates what function to use. The larger the

bits, the lower the hash col isions. - Online Learning

• Learn one record at a time

– A prediction is always available at any point, and the

best possible given the data the algorithm has seen

– Do not have to retrain to take in more data

• Though you may still want to

• Depending on learning rate used, may desire to

iterate through data set more than once

• Fast: VW approaches speed of network interface - OGD/SGD: online gradient descent

Gradient descent

Optimization algorithms are required to minimize the loss in logistic regression

Gradient descent, and many variants, are a popular choice, especial y with large –scale data.

Visualization (in R)

library(animation)

par(mar = c(4, 4, 2, 0.1))

grad.desc()

ani.options(nmax = 50)

par(mar = c(4, 4, 2, 0.1))

f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))

grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.3, tol = 1e-04)

ani.options(nmax = 70)

par(mar = c(4, 4, 2, 0.1))

f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))

grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.1, tol = 1e-04)

# interesting comparison: https://imgur.com/a/Hqolp - Other common optimization algorithms

ADAGRAD

Momentum

Stil slightly sensitive to choice of n

Newton’s Method

ADADELTA

Quasi-Newton - Adaptive learning rate

• Difficulty using SGD is finding a good learning rate

• An adaptive learning rate wil

– ADAGRAD is an adaptive method

• Simple learning rate in example code

– alpha / (sqrt(n) + 1)

• Where N is the number of times a specific feature has been

encountered

– w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)

• Full weight update will shrink the change by the learning

rate of the specific feature - Regularization (L1 & L2)

• Regularization attempts to ensure robustness of a

solution

• Enforces a penalty term on the coefficients of a

model, guiding toward a simpler solution

• L1: guides parameter values to be 0

• L2: guides parameters to be close to 0, but not 0

• In practice, these ensure large coefficients are not

applied to rare features - Related Tools

• Vowpal Wabbit

– Implements al of these features, plus far more

– Command line tool

– svmLite-like data format

– Source code available on Github with fairly open license

• Straight Python implementation (see code references slide)

• glmnet, for R: L1/L2 regression, sparse

• Scikit-learn, python ML library: ridge, elastic net (l1+l2), SGD (can

specify logistic regression)

• H2O, Java tool; many techniques used, particularly in deep learning

• Many of these techniques are used in neural networks, particularly

deep learning - Code References

• Introductory version: online logistic regression, hash trick, adaptive

learning rate

– Kaggle forum post

• Data set is available on that competition’s data page

• But you can easily adapt the code to work for your data set by changing the train and

test file names (lines 25-26) and the names of the id and output columns (104-107, 129-

130)

– Direct link to python code from forum post

– Github version of the same python code

• Latest version: adds FTRL-proximal (including SGD, L1/L2

regularization), epochs, and automatic interaction handling

– Kaggle forum post

– Direct link to python code from forum post (version 3)

– Github version of the same python code - Additional References

• Overall process

– Google paper, FTRL proximal and practical observations

– Facebook paper, includes logistic regression and trees, feature handling,

down-sampling

• Fol ow The Regularized Leader Proximal (Google)

• Optimization

– Stochastic gradient descent: examples and guidance (Microsoft)

– ADADELTA and discussion of additional optimization algorithms (Google/NYU

intern)

– Comparison Visualization

• Hash trick:

– The Wikipedia page offers a decent introduction

– general description and list of references, from VW author