Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning Slides & Speech: Rui Zhang
Outline Motivation & Contribution Recursive Neural Network Scene Segmentation using RNN Learning and Optimization Language Parsing using RNN Experiments
Motivation Data naturally contains recursive structures Image: Scenes split into objects, objects split into parts Language: A noun phrase contains a clause which contains noun phrases of its own
Motivation The recursive structure helps to Identify components of the data Understand how the components interact to form the whole data
Contribution First deep learning method to achieve state-of-art performance on scene segmentation and annotation Learned deep features outperform hand-crafted ones(e.g. Gist) Can be generalized for other tasks, e.g. language parsing
Recursive Neural Network Similar to one-layer full-connected network Models transformation from children nodes to parent node Recursively applied to tree structure Parent of one layer become child of the upper layer Parameters shared across layers 𝑐 1 𝑐 2 𝑊 𝑟𝑒𝑐𝑢𝑟 𝑥 ℎ 𝑐 3
Recursive vs. Recurrent NN There are two models called RNN: Recursive and Recurrent Similar Both have shared parameter which are applied in a recursive style Different Recursive NN applies to trees, while Recurrent NN applies to sequences Recurrent NN could be considered as Recursive NN for one-way trees
Scene Segmentation Pipeline Over segment image into superpixels Extract feature of superpixels Map feature onto semantic space Compute score for each merge with RNN Permute possible merges Merge pair of nodes with highest score Repeat until only one node is left
Input Data Representation Image Over-segmented superpixels Extract hand-crafted feature Map onto semantic space by one full-connection layer to obtain feature vector Each superpixel has a class label
Tree Construction Scene parse trees are constructed in bottom-up style Leaf nodes are over-segmented superpixels Extract hand-crafted feature Map onto semantic space by one full-connection layer Each leaf has a feature vector An adjacency matrix records neighboring relations 𝐴 𝑖𝑗 =𝑓 𝑥 = 0, 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 &1, 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 Adjacency Matrix
Greedy Merging Nodes are merged in a greedy style In each iteration Permute all possible merge(pairs of adjacent nodes) Compute score for each possible merge Full-connection transformation upon ℎ Merge the pair with highest score 𝑐 1 and 𝑐 2 replaced by new node 𝑐 12 ℎ 12 becomes feature for 𝑐 12 Union of neighbors of 𝑐 1 and 𝑐 2 becomes neighbors of 𝑐 12 Repeat until only one node is left 𝑐 1 𝑐 2 𝑊 𝑟𝑒𝑐𝑢𝑟 𝑥 ℎ 12 𝑠𝑐𝑜𝑟𝑒 𝑊 𝑠𝑐𝑜𝑟𝑒
Training(1) Max Margin Estimation Structured Margin Loss ∆ Penalize merging a segment with another one of a different label before merging with all its neighbors of the same label Number of sub-trees not appearing in correct trees Tree Score 𝑠 Sum of merge scores on all non-leaf nodes Class Label Softmax upon node feature vector Correct Trees Adjacent nodes with same label are merged first One image may have more than one correct tree
Training(2) Intuition: We want the score of highest scoring correct tree to be larger than other trees by a margin △ Formulation Margin Loss Function 𝑟 𝑖 𝜃 is minimized 𝑑 is a node in the parse tree 𝑁 ∙ is the set of nodes 𝜃 is all model parameters 𝑖 is index of training image 𝑥 𝑖 is training image 𝑖 𝑙 𝑖 is labels of 𝑥 𝑖 𝑌 𝑥 𝑖 , 𝑙 𝑖 is set of correct trees of 𝑥 𝑖 Τ 𝑥 𝑖 is all possible trees of 𝑥 𝑖 𝑠 ∙ is the tree score function
Training(3) Label of node is predicted by softmax The margin △ is no differentiable Therefore only a sub-gradient is computed 𝜕𝑠 𝜕𝜃 is obtained by back-propagation Gradient of label prediction is also obtained by back-propagation 𝑐 1 𝑐 2 𝑊 𝑟𝑒𝑐𝑢𝑟 𝑥 ℎ 12 𝑠𝑐𝑜𝑟𝑒 𝑊 𝑠𝑐𝑜𝑟𝑒 𝑊 𝑙𝑎𝑏𝑒𝑙 𝑙𝑎𝑏𝑒𝑙
Language Parsing Language parsing is similar to scene parsing Differences Input is natural language sentence Adjacency is strictly left and right Class labels are syntactical classes Word Level Phrase Level Clause(从句) Level Each sentence has only one correct tree
Experiments Overview Image Language Scene Segmentation and Annotation Scene Classification Nearest Neighbor Scene Subtree Language Supervised Language Parsing Nearest Neighbor Phrases
Scene Segmentation and Annotation Dataset Stanford Background Dataset Task: Segment and label foreground and different types of background pixelwise Result 78.1% pixelwise accuracy 0.6% above state-of-art
Scene Classification Dataset Task Method Result Discussion Stanford Background Dataset Task Three classes: city, countryside, sea-side Method Feature: Average of all node features/top node feature only Classifier: Linear SVM Result 88.1% accuracy for average feature 4.1% above Gist, the state-of-art feature 71.0% accuracy for top feature Discussion Learned RNN feature can better capture semantic info of scene Top feature losses some lower level info
Nearest Neighbor Scene Subtrees Dataset Stanford Background Dataset Task Retrieve similar segments from all images Subtrees whose nodes have the same label corresponds to a segment Method Feature: Top node feature of the subtree Metrics: Euclidean Distance Result Similar segments are retrieved Discuss RNN feature can capture segment level characteristics
Supervised Language Parsing Dataset Penn Treebank Wall Street Journal Section Task Generate parse tree with labeled node Result Unlabeled bracketing F-measure 90.29%, comparable to 91.63% of Berkley Parser
Nearest Neighbor Phrases Dataset Penn Treebank Wall Street Journal Section Task Retrieve nearest neighbor of given sentence Method Feature: Top node feature Metrics: Euclidean Distance Result Similar sentences are retrieved
Discussion Understanding semantic structure of data is essential for applications like fine-grained search or captioning Recursive NN predicts tree structure along with node labels in an elegant way Recursive NN can be incorporated with CNN If we can jointly learn Recursive NN with