このページは http://www.slideshare.net/alembert2000/ssrae-sa の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

4年弱前 (2013/01/17)にアップロードinテクノロジー

Deep Learning Japan @ 東大です

http://www.facebook.com/DeepLearning

https://sites.google.com/site/d...

Deep Learning Japan @ 東大です

http://www.facebook.com/DeepLearning

https://sites.google.com/site/deeplearning2013/

- Semi-Supervised Autoencoders forPredicting Sentiment DistributionsSocher, Pennington, Huang, Ng, Manning (Stanford) Presented by Danushka Bollegala
- Task• Predict sentence level sentiment • white blood cells destroying an infection: + • an infection destroying white blood cells: -• Predict the distribution of sentiment 1. Sorry, Hugs: User oﬀers condolences to author. 2. You Rock: Indicating approval, congratulations. 3. Teehee: User found the anecdote amusing. 4. I Understand: Show of empathy. 5. Wow, Just Wow: Expression of surprise,shock. 2
- Approach• Learn a representation for the entire sentence using autoencoders (unsupervised) • A tree structure is learnt (node and edges)• Learn a sentiment classifier using the representation learnt in the previous step (supervised)• Together, the approach becomes a semi-supervised learning task. 3
- Neural Word Representations• Randomly initialize the word representations • For a vector x representing a word, sample it from a zero mean Guassian x 2 Rn , x ⇠ N (0, 2 ) • Works well when the task is supervised because given training data we can later tune the weights in the representation.• Pre-trained word representations • Bengio et al. 2003 • Given a context vector c, remove a word x that co-occur in c, and use the remaining features in c to guess the occurrence of x. • Similar to Ando et al 2003 suggestion for transfer learning via Alternating Structure Optimization (ASO). • Can take into account co-occurrence information. • Rich syntactic and semantic representations for words! 4
- Representing the input• L: word-representation matrix • Each word in the vocabulary is represented by an n- dimensional vector • Those vectors are then stacked in columns to create the matrix L• Given a word, k, it is represented as 1-of-k binary vector bk. Then the vector representing the word k is given by, x = L bk.• This continuous representation is better than the original binary representation because internal sigmoid activators are continuous. 5
- Autoencoders (Tree Given)• binary trees are assumed (each parent have two children).• Given child node representations iteratively compute the representation of their parent node.• Concatenate the two vectors for the child nodes and apply W and bias b, followed up by Sigmoid function f. 6
- Structure Prediction• Build the tree as well as node representations• Concept • Given a sentence (a sequence of words), generate all possible trees with those words. • For each tree learn autoencoders at each non-leaf node and compute the reconstruction error . • Total reconstruction error of a tree is the sum of reconstruction errors at each node in the tree. • Select the tree with the minimum total reconstruction error. Too expensive (slow) in practice! 7
- Greedy Unsupervised RAE• For each consecutive word pairs, compute their parent.• Repeat this process until we are left with one parent. Given n child nodes at some step, we are left with n-1 parents, thus reducing the number of nodes to process as we go up the tree.• Now, select the parent at each level that has the minimum reconstruction error.• Weighted reconstruction • The number of descendants of a parent must be considered when computing its reconstruction error.• Parent vectors are L2 normalized (p/||p||) to avoid hidden layer (W) becoming smaller (thereby reducing the reconstruction error) 8
- Semi-Supervised RAE• Each parent node is assigned with an 2n dimensional feature vector• We can learn a weight vector (2n dim) on top of those parent vectors• Logistic sigmoid function is used as the output• The sum of cross-entropy is minimized over all parent nodes in the tree• Similar to logistic regression 9
- Equations Sigmoid output at a parent node p Cross entropy loss Aggregate the loss over all sentence x, label t pairs in the corpus Total loss of a tree is the sum of losses at all parent nodes Loss at a parent node comes from two sources: reconstruction error + cross entropy loss 10
- Learning• Compute the gradient w.r.t. parameters (Ws, b) and use LBFGS• The objective function is not convex.• Only local optima can be achieved with this procedure.• Works well in practice. 11
- EP Dataset• Experience Project (EP) dataset • People write confessions and others tag them. • Five categories: 12
- Baselines• Random (20%)• Most frequent (38.1%) • I understand• Binary Bag of words (46.4%) • Represent each sentence as a binary bag of words and do not use pre-trained representations.• Features (47.0%) • Use sentiment lexicons, training data (supervised sentiment classification with SVMs)• Word vectors (45.5%) • Ignore tree structure learnt by RAE. Aggregate the pre-trained representations for each word in the sentence and train an SVM.• Proposed method (50.1%) • Learn the tree structure with RAE using unlabeled data (sentences only). • Softmax layer is trained on each parent node using labeled data. 13
- Predicting the distribution 14
- Random vectors!!!• Randomly initializing the word vectors does very well!• Because supervision occurs later, it is OK to initialize randomly• Randomly initializing RAEs have shown similar good performances in other tasks • On random weights and unsupervised feature learning ICML’11 15
- Summary• Learn a sentiment classifier using recursive autoencoders• Structure is also learn using unlabeled data • greedy algorithm for tree construction• Semi-supervision at the parent level• Model is general enough for other sentence level classification tasks• Random word representations do very well! 16