このページは http://www.slideshare.net/indicods/general-sequence-learning-with-recurrent-neural-networks-for-next-ml の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

1年以上前 (2015/02/17)にアップロードin学び

indico’s Head of Research, Alec Radford, led a workshop on general sequence learning using recurr...

indico’s Head of Research, Alec Radford, led a workshop on general sequence learning using recurrent neural networks at Next.ML in San Francisco. Here’s his presentation and workshop resources available for free.

---

Recurrent Neural Networks hold great promise as general sequence learning algorithms. As such, they are a very promising tool for text analysis. However, outside of very specific use cases like handwriting recognition and recently, machine translation, they have not seen widespread use. Why has this been the case?

In this workshop, Alec will introduce RNNs as a concept. Then you’ll sketch how to implement them and cover the tricks necessary to make them work well. With the basics covered, we will investigate using RNNs as general text classification and regression models, examining where they succeed and where they fail compared to more traditional text analysis models.

Finally, a simple Python and Theano library for training RNNs with a scikit-learn style interface will be introduced and you’ll see how to use it through several hands-on tutorials on real world text datasets.

---

Next.ML was created to help you use the latest machine learning techniques the minute you leave the workshop. Learn from industry-leading data scientists at Next.ML on April 27th, 2015 at the Microsoft NERD Center -- for more info, visit http://Next.ML

---

Resources

Passage - https://github.com/IndicoDataSolutions/Passage

Presentation Video - https://www.youtube.com/watch?v=VINCQghQRuM

- How ML

-0.15, 0.2, 0, 1.5

Numerical, great!

A, B, C, D

Categorical, great!

The cat sat on the

Uhhh…….

mat. - How text is dealt with (ML perspective)

Features

Linear Model

Text

(bow, TFIDF, LSA, etc...)

(SVM, softmax) - Structure is important!

● Certain tasks, structure is essential:

○ Humor

○ Sarcasm

The cat sat on the mat.

● Certain tasks, ngrams can get you a

long way:

○ Sentiment Analysis

○ Topic detection

sat

the

on

mat

cat

the

● Specific words can be strong indicators

○ useless, fantastic (sentiment)

○ hoop, green tea, NASDAQ (topic) - Structure is hard

Ngrams is typical way of preserving some structure.

sat

the cat

mat

cat sat

sat on

the mat

the

on

cat

on the

Beyond bi or tri-grams occurrences become very rare and

dimensionality becomes huge (1, 10 million + features) - How text is dealt with (ML perspective)

Features

Linear Model

Text

(bow, TFIDF, LSA, etc...)

(SVM, softmax) - How text should be dealt with?

Linear Model

Text

RNN

(SVM, softmax) - How an RNN works

the

cat

sat

on

the

mat - How an RNN works

the

cat

sat

on

the

mat

input to hidden - How an RNN works

the

cat

sat

on

the

mat

hidden to hidden

input to hidden - How an RNN works

the

cat

sat

on

the

mat

hidden to hidden

input to hidden - How an RNN works

projections

activities

(activities x weights)

(vectors of values)

the

cat

sat

on

the

mat

hidden to hidden

input to hidden - How an RNN works

projections

activities

(activities x weights)

(vectors of values)

Learned representation of

sequence.

the

cat

sat

on

the

mat

hidden to hidden

input to hidden - How an RNN works

projections

activities

(activities x weights)

(vectors of values)

cat

the

cat

sat

on

the

mat

hidden to output

hidden to hidden

input to hidden - From text to RNN input

Learned matrix

String input

“The cat sat on the mat.”

2.5 0.3 -1.2

0.2 -3.3 0.7

Tokenize

the

cat

sat

on

the

mat

.

-4.1 1.6 2.8

Assign index

1.1 5.7 -0.2

0

1

2

3

0

4

5

1.4 0.6 -3.9

Embedding lookup

2.5 0.3 -1.2

0.2 -3.3 0.7

-4.1 1.6 2.8

1.1 5.7 -0.2

2.5 0.3 -1.2

1.4 0.6 -3.9

-3.8 1.5 0.1

-3.8 1.5 0.1 - You can stack them too

cat

hidden to output

the

cat

sat

on

the

mat

hidden to hidden

input to hidden - But aren’t RNNs unstable?

Simple RNNs trained with SGD are unstable/difficult to learn.

But modern RNNs with various tricks blow up much less often!

● Gating Units

● Gradient Clipping

● Steeper gates

● Better initialization

● Better optimizers

● Bigger datasets - Simple Recurrent Unit

h

+

h

+

h

t-1

t

t+1

x

x

t

t+1

+ Element wise addition

Activation function

Routes information can propagate along

Involved in modifying information flow and

values - Gated Recurrent Unit - GRU

⊙

~

h

⊙

h

⊙

+

h

t-1

t

t

+ Element wise addition

1-z

z

⊙ Element wise multiplication

r

z

Routes information can propagate along

Involved in modifying information flow and

values

xt - Gated Recurrent Unit - GRU

⊙

⊙

~

~

h

⊙

h

⊙

+

h

⊙

h

⊙

+

h

t-1

t

t

t+1

t+1

1-z

z

1-z

z

r

z

r

z

x

x

t

t+1 - Gating is important

For sentiment analysis of longer

sequences of text (paragraph or so)

a simple RNN has difficulty learning

at all while a gated RNN does so

easily. - Which One?

There are two types of gated RNNs:

●

Gated Recurrent Units (GRU) by

K. Cho, recently introduced and

used for machine translation and

speech recognition tasks.

●

Long short term memory (LSTM)

by S. Hochreiter and J.

Schmidhuber has been around

since 1997 and has been used

far more. Various modifications to

it exist. - Which One?

GRU is simpler, faster, and optimizes

quicker (at least on sentiment).

Because it only has two gates

(compared to four) approximately 1.5-

1.75x faster for theano

implementation.

If you have a huge dataset and don’t

mind waiting LSTM may be better in

the long run due to its greater

complexity - especially if you add

peephole connections. - Exploding Gradients?

Exploding gradients are a major problem

for traditional RNNs trained with SGD. One

of the sources of the reputation of RNNs

being hard to train.

In 2012, R Pascanu and T. Mikolov

proposed clipping the norm of the gradient

to alleviate this.

Modern optimizers don’t seem to have this

problem - at least for classification text

analysis. - Better Gating Functions

Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so

they change more rapidly from “off” to “on” so model learns to use them quicker. - Better Initialization

Andrew Saxe last year showed that initializing weight matrices with random

orthogonal matrices works better than random gaussian (or uniform) matrices.

In addition, Richard Socher (and more recently Quoc Le) have used identity

initialization schemes which work great as well. - Understanding Optimizers

2D moons dataset

courtesy of scikit-learn - Comparing Optimizers

Adam (D. Kingma) combines the

early optimization speed of

Adagrad (J. Duchi) with the better

later convergence of various other

methods like Adadelta (M. Zeiler)

and RMSprop (T. Tieleman).

Warning: Generalization

performance of Adam seems

slightly worse for smaller datasets. - It adds up

Up to 10x more efficient training once you

add all the tricks together compared to a

naive implementation - much more stable

- rarely diverges.

Around 7.5x faster, the various tricks add

a bit of computation time. - Too much? - Overfitting

RNNs can overfit very well as we will

see. As they continue to fit to training

dataset, their performance on test data

will plateau or even worsen.

Keep track of it using a validation set,

save model at each iteration over

training data and pick the earliest, best,

validation performance. - The Showdown

Model #1

Model #2

output

+

512 dim

hidden state

512 dim

embedding

Using bigrams and grid search on min_df for

Using whatever I tried that worked :)

vectorizer and regularization coefficient for model.

Adam, GRU, steeper sigmoid gates, ortho/identity

init are good defaults - Sentiment & Helpfulness
- Effect of Dataset Size

●

RNNs have poor generalization properties on small

datasets.

○ 1K labeled examples 25-50% worse than linear model…

● RNNs have better generalization properties on large

datasets.

○ 1M labeled examples 0-30% better than linear model.

● Crossovers between 10K and 1M examples

○ Depends on dataset. - The Thing we don’t talk about

For 1 million paragraph sized text examples to converge:

● Linear model takes 30 minutes on a single CPU core.

● RNN takes 90 minutes on a Titan X.

● RNN takes five days on a single CPU core.

RNN is about 250x slower on CPU than linear model…

This is why we use GPUs - Visualizing

representations of

words learned via

sentiment

Individual words colored by average sentiment

TSNE - L.J.P. van der

Maaten - Negative

Positive

Model learns to separate negative and positive words, not too surprising - Qualifiers

Quantities of Time

Product nouns

Punctuation

Much cooler, model also begins to learn components of language from only binary sentiment labels - The library - Passage

● Tiny RNN library built on top of Theano

● https://github.com/IndicoDataSolutions/Passage

● Still alpha - we’re working on it!

● Supports simple, LSTM, and GRU recurrent layers

● Supports multiple recurrent layers

● Supports deep input to and deep output from hidden layers

○ no deep transitions currently

● Supports embedding and onehot input representations

● Can be used for both regression and classification problems

○ Regression needs preprocessing for stability - working on it

● Much more in the pipeline - An example

Sentiment analysis of movie reviews - 25K labeled examples - RNN imports
- RNN imports

preprocessing - RNN imports

preprocessing

load training data - RNN imports

preprocessing

load training data

tokenize data - RNN imports

preprocessing

load training data

tokenize data

configure model - RNN imports

preprocessing

load training data

tokenize data

configure model

make and train model - RNN imports

preprocessing

load training data

tokenize data

configure model

make and train model

load test data - RNN imports

preprocessing

load training data

tokenize data

configure model

make and train model

load test data

predict on test data - The results

Top 10! - barely :) - Summary

● RNNs look to be a competitive tool in certain situations

for text analysis.

● Especially if you have a large 1M+ example dataset

○ A GPU or great patience is essential

● Otherwise it can be difficult to justify over linear models

○ Speed

○ Complexity

○ Poor generalization with small datasets - Contact

alec@indico.io