このページは http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

9ヶ月前 (2016/01/16)にアップロードin学び

Available with notes:

http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-...

Available with notes:

http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

(Data Day 2016)

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she ’used to wear scrubs to work’, and distill ’taking a trip’ into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I’ll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.

- A word is worth a

thousand vectors

(word2vec, lda, and introducing lda2vec)

Christopher Moody

@ Stitch Fix - About

Gaussian Processes

t-SNE

Tensor Decomposition

chainer

@chrisemoody

deep learning

Caltech Physics

PhD. in astrostats supercomputing

sklearn t-SNE contributor

Data Labs at Stitch Fix

github.com/cemoody - Credit

Large swathes of this talk are from

previous presentations by:

• Tomas Mikolov

• David Blei

• Christopher Olah

• Radim Rehurek

• Omer Levy & Yoav Goldberg

• Richard Socher

• Xin Rong

• Tim Hopper - word2vec

1

lda

2

3lda2vec - word2vec

1. king - man + woman = queen

2. Huge splash in NLP world

3. Learns from raw text

4. Pretty simple algorithm

5. Comes pretrained - word2vec

1.

Set up an objective function

2.

Randomly initialize vectors

3.

Do gradient descent - word2vec

word2vec: learn word vector vin

from it’s surrounding context

vin - word2vec

“The fox jumped over the lazy dog”

Maximize the likelihood of seeing the words given the word over.

P(the|over)

P(fox|over)

P(jumped|over)

P(the|over)

P(lazy|over)

P(dog|over)

…instead of maximizing the likelihood of co-occurrence counts. - word2vec

What should this be?

P(fox|over) - word2vec

Should depend on the word vectors.

P(fox|over)

P(vfox|vover) - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog” - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vIN - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN- word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT

vIN

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

vOUT vIN- word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

v vOUT

IN - word2vec

Twist: we have two vectors for every word.

Should depend on whether it’s the input or the output.

Also a context window around every input word.

P(vOUT|vIN)

“The fox jumped over the lazy dog”

v

vOUT

IN - objective

How should we deﬁne P(vOUT|vIN)?

Measure loss between

vIN and vOUT?

vin . vout - word2vec

objective

vin

vout

vin . vout ~ 1 - word2vec

objective

vin

vin . vout ~ 0

vout - word2vec

objective

vin

vin . vout ~ -1

vout - word2vec

objective

vin . vout ∈ [-1,1] - word2vec

objective

But we’d like to measure a probability.

vin . vout ∈ [-1,1] - word2vec

objective

But we’d like to measure a probability.

softmax(vin . vout ∈) [

-1,

∈ [1]

0,1] - word2vec

objective

But we’d like to measure a probability.

softmax(vin . vout ∈) [-1,1]

Probability of choosing 1 of N discrete items.

Mapping from vector space to a multinomial over words. - word2vec

objective

But we’d like to measure a probability.

softmax ~ exp(vin . vout )∈ [0,1] - word2vec

objective

But we’d like to measure a probability.

exp(vin . vout ∈) [-1,1]

softmax =

Σexp(vin . vk)

k ∈ V

Normalization term over all words - word2vec

objective

But we’d like to measure a probability.

exp(vin . vout ∈) [-1,1]

softmax =

= P(v

Σexp(v

out|vin)

in . vk)

k ∈ V - word2vec

objective

Learn by gradient descent on the softmax prob.

For every example we see update vin

vin := vin + P(vout|vin)

vout := vout + P(vout|vin) - word2vec
- word2vec
- ITEM_3469 + ‘Pregnant’
- + ‘Pregnant’
- = ITEM_701333

= ITEM_901004

= ITEM_800456 - what about?LDA?
- LDA

on Client Item

Descriptions - LDA

on Item

Descriptions

(with Jay) - LDA

on Item

Descriptions

(with Jay) - LDA

on Item

Descriptions

(with Jay) - LDA

on Item

Descriptions

(with Jay) - LDA

on Item

Descriptions

(with Jay) - Pairwise gamma correlation

Latent style vectors from text

from style ratings

Diversity from ratings

Diversity from text - lda vs word2vec
- “I love ﬁnding new designer brands for jeans”

word2vec is local:

one word predicts a nearby word - “I love ﬁnding new designer brands for jeans”

But text is usually organized. - “I love ﬁnding new designer brands for jeans”

But text is usually organized. - “I love ﬁnding new designer brands for jeans”

doc 7681

In LDA, documents globally predict words. - typical word2vec vector

typical LDA document vector

[ -0.75, -1.25, -0.55, -0.12, +2.2]

[ 0%, 9%, 78%, 11%] - typical word2vec vector

typical LDA document vector

[ -0.75, -1.25, -0.55, -0.12, +2.2]

[ 0%, 9%, 78%, 11%]

All real values

All sum to 100% - 5D word2vec vector

5D LDA document vector

[ -0.75, -1.25, -0.55, -0.12, +2.2]

[ 0%, 9%, 78%, 11%]

Dense

Sparse

All real values

All sum to 100%

Dimensions relative

Dimensions are absolute - 100D word2vec vector

100D LDA document vector

[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] [ 0%0%0%0%0% … 0%, 9%, 78%, 11%]

dense

sparse

Dense

Sparse

All real values

All sum to 100%

Dimensions relative

Dimensions are absolute - 100D word2vec vector

100D LDA document vector

[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] [ 0%0%0%0%0% … 0%, 9%, 78%, 11%]

Similar in 100D ways

Similar in fewer ways

(very flexible)

(more interpretable)

+mixture

+sparse - can we do both? lda2vec
- The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this document is

80% high fashion

this document is

word2vec

60% style

LDA

P(vOUT |vDOC) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this zip code is

80% hot climate

this zip code is

60% outdoors wear

word2vec

LDA - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this client is

80% sporty

this client is

60% casual wear

word2vec

LDA - lda2vec

vIN

vOUT

“PS! Thank you for such an awesome top”

word2vec predicts locally:

one word predicts a nearby word

P(vOUT |vIN) - lda2vec

vDOC

vOUT

doc_id=1846 “PS! Thank you for such an awesome top”

LDA predicts a word from a global context

P(vOUT |vDOC) - lda2vec

vDOC

vIN

vOUT

doc_id=1846 “PS! Thank you for such an awesome top”

can we predict a word both locally and globally ? - lda2vec

vDOC

vIN

vOUT

doc_id=1846 “PS! Thank you for such an awesome top”

can we predict a word both locally and globally ?

P(vOUT |vIN+ vDOC) - lda2vec

vDOC

vIN

vOUT

doc_id=1846 “PS! Thank you for such an awesome top”

can we predict a word both locally and globally ?

P(vOUT |vIN+ vDOC)

*very similar to the Paragraph Vectors / doc2vec - lda2vec

This works! 😀 But vDOC isn’t as

interpretable as the LDA topic vectors. 😔 - lda2vec

This works! 😀 But vDOC isn’t as

interpretable as the LDA topic vectors. 😔 - lda2vec

This works! 😀 But vDOC isn’t as

interpretable as the LDA topic vectors. 😔 - lda2vec

This works! 😀 But vDOC isn’t as

interpretable as the LDA topic vectors. 😔

We’re missing mixtures & sparsity. - lda2vec

This works! 😀 But vDOC isn’t as

interpretable as the LDA topic vectors. 😔

Let’s make vDOC into a mixture… - lda2vec

Let’s make vDOC into a mixture…

vDOC = a vtopic1 + b vtopic2 +…

(up to k topics) - lda2vec

Let’s make vDOC into a mixture…

Trinitarian

baptismal

vDOC = a vtopic1 + b vtopic2 +…

Pentecostals

Bede

schismatics

excommunication - lda2vec

Let’s make vDOC into a mixture…

topic 1 = “religion”

Trinitarian

baptismal

vDOC = a vtopic1 + b vtopic2 +…

Pentecostals

Bede

schismatics

excommunication - lda2vec

Let’s make vDOC into a mixture…

topic 1 = “religion”

Trinitarian

Milosevic

baptismal

vDOC = a vtopic1 + b vtopic2 +…

absentee

Pentecostals

Indonesia

Bede

Lebanese

schismatics

Isrealis

excommunication

Karadzic - lda2vec

Let’s make vDOC into a mixture…

topic 1 = “religion”

topic 2 = “politics”

Trinitarian

Milosevic

baptismal

vDOC = a vtopic1 + b vtopic2 +…

absentee

Pentecostals

Indonesia

bede

Lebanese

schismatics

Isrealis

excommunication

Karadzic - lda2vec

Let’s make vDOC into a mixture…

topic 1 = “religion”

topic 2 = “politics”

Trinitarian

Milosevic

baptismal

vDOC = 10% religion + 89% politics +…

absentee

Pentecostals

Indonesia

bede

Lebanese

schismatics

Isrealis

excommunication

Karadzic - lda2vec

Let’s make vDOC sparse

vDOC = a vreligion + b vpolitics +…

[ -0.75, -1.25, …] - lda2vec

Let’s make vDOC sparse

vDOC = a vreligion + b vpolitics +… - lda2vec

Let’s make vDOC sparse

vDOC = a vreligion + b vpolitics +… - lda2vec

Let’s make vDOC sparse

vDOC = a vreligion + b vpolitics +…

{a, b, c…} ~ dirichlet(alpha) - lda2vec

Let’s make vDOC sparse

vDOC = a vreligion + b vpolitics +…

{a, b, c…} ~ dirichlet(alpha) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this document is

80% high fashion

this document is

word2vec

60% style

LDA

lda2vec

P(vOUT |vIN + vDOC) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

word2vec

LDA

lda2vec

P(vOUT |vIN+ vDOC + vZIP) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this zip code is

80% hot climate

this zip code is

60% outdoors wear

word2vec

LDA

lda2vec

P(vOUT |vIN+ vDOC + vZIP) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

this client is

80% sporty

this client is

60% casual wear

word2vec

LDA

lda2vec

P(vOUT |vIN+ vDOC + vZIP +vCLIENTS) - The goal:

@chrisemoody

Use all of this context to learn

interpretable topics.

Can also make the topics

supervised so that they predict

an outcome.

word2vec

P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)

LDA

P(sold | vCLIENTS)

lda2vec - @chrisemoody

uses pyldavis

API Ref docs (no narrative docs)

GPU

github.com/cemoody/lda2vec

Decent test coverage - @chrisemoody

Can we model topics to sentences?

lda2lstm

doc_id=1846 “PS! Thank you for such an awesome idea” - @chrisemoody

doc_id=1846 “PS! Thank you for such an awesome idea”

Can we represent the internal LSTM

states as a dirichlet mixture? - @chrisemoody

Can we model topics to sentences?

lda2lstm

doc_id=1846 “PS! Thank you for such an awesome idea”

Can we model topics to images?

lda2ae

TJ Torres - Bonus slides
- Paragraph Vectors

(Just extend the context window)

Content dependency

(Change the window grammatically)

Crazy

Social word2vec (deepwalk)

Approaches

(Sentence is a walk on the graph)

Spotify

(Sentence is a playlist of song_ids)

Stitch Fix

(Sentence is a shipment of five items) - SkipGram

CBOW

Guess the context

Guess the word

given the word

given the context

vOUT

vIN

“The fox jumped over the lazy dog” “The fox jumped over the lazy dog”

vOUT vOUT vOUT

vOUT vOUT vOUT

vIN vIN

vIN

vIN vIN

vIN

Better at syntax.

~20x faster.

(this is the one we went over)

(this is the alternative.) - LDA

Results

Great Stylist

Perfect

I loved every choice in this fix!! Great job!

y

context

Histor - LDA

Results

Body Fit

My measurements are 36-28-32. If that helps.

I like wearing some clothing that is fitted.

Very hard for me to find pants that fit right.

y

context

Histor - LDA

Results

Sizing

Excited for next

Really enjoyed the experience and the

pieces, sizing for tops was too big.

Looking forward to my next box!

y

context

Histor - LDA

Results

Almost Bought

Perfect

It was a great fix. Loved the two items I

kept and the three I sent back were close!

y

context

Histor - What I didn’t mention

A lot of text (only if you have a specialized vocabulary)

Cleaning the text

Memory & performance

Traditional databases aren’t well-suited

False positives - and now for something completely crazy
- All of the following ideas will change what

‘words’ and ‘context’ represent. - paragraph

vector

What about summarizing documents?

On the day he took office, President Obama reached out to America’s enemies,

offering in his first inaugural address to extend a hand if you are willing to unclench

your fist. More than six years later, he has arrived at a moment of truth in testing that - paragraph

vector

IN

On the day he took office, President Obama reached out to America’s enemies,

offering in his first inaugural address to extend a hand if you are willing to unclench

your fist. More than six years later, he has arrived at a moment of truth in testing that

The framework nuclear agreement he reached with Iran on Thursday did not provide

the definitive answer to whether Mr

OUT

. Obama’s audacious gamble will pay of

OUT

f. The fist

Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.

Normal skipgram extends C words before, and C words after. - IN

paragraph

vector

OUT

doc_1347

OUT

On the day he took office, President Obama reached out to America’s enemies,

offering in his first inaugural address to extend a hand if you are willing to unclench

your fist. More than six years later, he has arrived at a moment of truth in testing that

The framework nuclear agreement he reached with Iran on Thursday did not provide

the definitive answer to whether Mr

OUT

. Obama’s audacious gamble will pay of

OUT

f. The fist

Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.

A document vector simply extends the context to the whole document. - from gensim.models import Doc2Vec

fn = “item_document_vectors”

model = Doc2Vec.load(fn)

model.most_similar('pregnant')

matches = list(filter(lambda x: 'SENT_' in x[0], matches))

# ['...I am currently 23 weeks pregnant...',

# '...I'm now 10 weeks pregnant...',

# '...not showing too much yet...',

# '...15 weeks now. Baby bump...',

# '...6 weeks post partum!...',

# '...12 weeks postpartum and am nursing...',

# '...I have my baby shower that...',

# '...am still breastfeeding...',

# '...I would love an outfit for a baby shower...']

ch

sentence

sear - English

translation

Matrix

Rotation

(using just a rotation

matrix)

Spanish

olov

Mik 2013 - context

Australian scientist discovers star with telescope

dependent

context +/- 2 words

4

201

Levy & Goldberg - context

Australian scientist discovers star with telescope

dependent

4

context

201

Levy & Goldberg - context

Australian scientist discovers star with telescope

dependent

context

4

context

201

Levy & Goldberg - BoW DEPS

context

dependent

topically-similar vs ‘functionally’ similar

4

context

201

Levy & Goldberg - Also show that SGNS is simply factorizing:

context

dependent

w * c = PMI(w, c) - log k

This is completely amazing!

Intuition: positive associations (canada, snow)

stronger in humans than negative associations

(what is the opposite of Canada?)

4

context

201

Levy & Goldberg - word2vec

deepwalk

learn word vectors from

‘words’ are graph vertices

sentences

‘sentences’ are random walks on the

graph

“The fox jumped over the lazy dog”

vOUT vOUT vOUT vOUT

vOUT vOUT

ozzi 4

Per

et al 201 - Playlists at

‘words’ are songs

Spotify

‘sentences’ are playlists

context

sequence

learning - Great performance on ‘related artists’

Playlists at

Spotify

Erik

context

Bernhardsson - Fixes at

Let’s try:

‘words’ are styles

Stitch Fix

‘sentences’ are ﬁxes

sequence

learning - Learn similarity between styles

Fixes at

because they co-occur

Stitch Fix

Learn ‘coherent’ styles

context

sequence

learning - Fixes at

Stitch Fix?

Got lots of structure!

context

sequence

learning - Fixes at

Stitch Fix?

context

sequence

learning - Fixes at

Stitch Fix?

Nearby regions are

consistent ‘closets’

context

sequence

learning - A specific lda2vec model

Our text blob is a comment that comes from a region_id and a style_id - lda2vec

Let’s make vDOC into a mixture…

topic 1 = “religion”

topic 2 = “politics”

Trinitarian

Milosevic

baptismal

vDOC = 10% religion + 89% politics +…

absentee

Pentecostals

Indonesia

bede

Lebanese

schismatics

Isrealis

excommunication

Karadzic