SAIL The Stanford Artificial Intelligence Laboratory

SAIL The Stanford Artificial Intelligence Laboratory

Distributed representations of language are back
Christopher Manning Stanford University Artificial intelligence requires computers with the knowledge of people. The primary location for getting that knowledge is from text. Human beings evolved to have language and that set off their general ascendancy, but it was much more recently that humans developed writing, allowing knowledge to be communicated at a distance of space and time. In just a few thousand years, this took us from the bronze age to the smart phones and tablets of today. The question is how to get computers to be able to understand the information in text. Trinity College Library, Dublin. The long room. 45 minute slot = 40 slides, i guess.

1980s Natural Language Processing
VP →{ V (NP:(↑ OBJ)=↓ (NP:(↑ OBJ2)=↓) ) (XP:(↑ XCOMP)=↓) VP VP)}. salmon N SALMON) (↑ PERSON)=3 { (↑ NUM)=SG|(↑ NUM)=PL}. What didn’t work Trying to teach a computer all the rules of grammar Why? Because language is too vast, ambiguous and context dependent

Learning language P(late|a, year) = 0.0087 P(NN|DT, a, project) = 0.9
WRB VBZ DT NN VB TO VB DT How does a project get to be a NN JJ : CD NN IN DT NN year late ? … One day at a time . P(late|a, year) = P(NN|DT, a, project) = 0.9 The computer will also learn about the language by reading a lot of text, perhaps annotated with labels/supervision by humans. Statistical NLP gave soft, usage based models of language, and led into the technologies which we use today for speech recognition, spelling correction, question answering, and machine translation. But although these models have continuous variables in the probabilities, the models are built over the same categorical, symbolic representations of words, part of speech tags, and other categories as rule-based descriptions.

The Rumelhart-McClelland 1986 past-tense model
{ #he, hel, ell, llo, lo# }

The traditional word representation
motel [ ] Dimensionality: 50K (small domain – speech/PTB) – 13M (web – Google 1T) motel [ ] AND hotel [ ] = 0

Word distributions  distributed word representations
Through corpus linguistics, large chunks the study of language and linguistics. The field of linguistics is concerned Written like a linguistics text book Phonology is the branch of linguistics that 0.286 0.792 −0.177 −0.107 0.109 −0.542 0.349 0.271 0.487 linguistics = Word distributions are used to collect lexical statistics and to build super-useful word representations Vector space semantics has just been mega-useful. Up to now, better lexical semantics has clearly given us more operational value than trying to model first order logic in NLP. Meaning representations are at the other end of the spectrum from signal processing, but they share with signals that human beings have at best fairly weak ideas about how to represent the useful information in them. [Bengio et al. 2003, Mnih & Hinton 2008, Collobert & Weston 2008, Turian 2010, Mikolov 2013, etc.]

Distributed word representations
A foundational component of deep networks in NLP Base case for meaning composition Vector space model is widely used for semantic similarity

Matrix-based methods for learning word representations
LSA (SVD), HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) Fast training Efficient usage of statistics Primarily used to capture word similarity Disproportionate importance given to small counts

“Neural” methods for learning word representations
NNLM, HLBL, RNN, Skip-gram/CBOW, (i)vLBL (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu) Scales with corpus size Inefficient usage of statistics Can capture complex patterns beyond word similarity Generate improved performance on other tasks

Matrix-based methods for learning word representations
NNLM, HLBL, RNN, Skip-gram/CBOW, (i)vLBL (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu) LSA (SVD), HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) Fast training Efficient usage of statistics Scales with corpus size Inefficient usage of statistics Primarily used to capture word similarity Disproportionate importance given to small counts Can capture complex patterns beyond word similarity Generate improved performance on other tasks New, scalable log-bilinear model for word representations

Word Analogies a:b :: c:? a:b :: c:? − +
Test for linear relationships, examined by Mikolov et al. man:woman :: king:? a:b :: c:? a:b :: c:? king man woman [ ] − + man woman [ ] [ ] king [ ] queen

COALS model [Rohde, Gonnerman & Plaut, ms., 2005]

Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014]
Crucial insight: Ratios of co-occurrence probabilities can encode meaning components x = solid x = water large x = gas small x = random small large ~1 large small

Encoding meaning in vector differences [Pennington et al., EMNLP 2014]
Crucial insight: Ratios of co-occurrence probabilities can encode meaning components x = solid x = water 1.9 x 10-4 x = gas x = fashion 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5 2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5 1.36 0.96 8.9 8.5 x 10-2

GloVe: A new model for learning word representations [Pennington et al
GloVe: A new model for learning word representations [Pennington et al., EMNLP 2014] Q: How can we capture this in a word vector space? A: Log-bilinear model Fast training Scalable to huge corpora Good performance even with small corpus, and small vectors

Word similarities Nearest words to frog: 1. frogs 2. toad 3. litoria
4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus litoria leptodactylidae rana eleutherodactylus

Word analogy task [Mikolov, Yih & Zweig 2013a]
Model Dimensions Corpus size Performance (Syn + Sem) CBOW (Mikolov et al. 2013b) 300 1.6 billion 36.1 GloVe (this work) 70.3 CBOW (M et al. 2013b, by us) 6 billion 65.7 71.7 1000 63.7 42 billion 75.0 Capturing attributes of meaning through a word analogy task

Named Entity Recognition Performance
Model on CoNLL CoNLL 2003 dev CoNLL 2003 test ACE 2 MUC 7 Categorical CRF 91.0 85.4 77.4 73.4 SVD (log tf) 90.5 84.8 73.6 71.5 HPCA 92.6 88.7 81.7 80.7 HSMN (Huang) 85.7 78.7 74.7 C&W 92.2 87.4 80.2 CBOW 93.1 88.2 82.2 81.1 GloVe (this work) 93.2 88.3 82.9 F1 score of CRF trained on CoNLL 2003 English with 50 dim word vectors.

The GloVe Model A new global-statistics, unsupervised model for learning word vectors Design translates relationships between word-word co-occurrence probabilities that encode meaning relationships into linear relations in a word vector space

Sentence structure: Dependency parsing

Universal (Stanford) Dependencies [de Marneffe et al., LREC 2014]
A common dependency representation and label set applicable across languages –

Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014]
An accurate and fast neural-network-based dependency parser! Parsing to Stanford Dependencies: Unlabeled attachment score (UAS) = head Labeled attachment score (LAS) = head and label Parser UAS LAS sent / s MaltParser 89.8 87.2 469 MSTParser 91.4 88.1 10 TurboParser 92.3 89.6 8 Our Parser 92.0 89.7 654 In this work, we propose a neural network based dependency parser. Here is a picture of what we are doing. We parse to Stanford Dependencies, and use the standard evaluation metrics —- unlabeled and labeled attachment scores, and compare against several widely-used dependency parsers: MaltParser: of UAS score, and it can parse 400 sentences per second.MSTParser / TurboParser are two representative graph-based parsers. They have significantly higher scores, but turn to be slow in parsing. And here is our parser, we can get a comparable accuracy, but running even faster than MaltParser!

Shift-reduce (transition-based) dependency parser feature representation
Configuration 1 binary, sparse dim =106 ~ 107 … Feature templates: usually a combination of 1 ~ 3 elements from the configuration. Say when you reduce its to dependency label. Let’s examine the traditional feature representation first. The traditional approaches usually extract a set of indicator features based on some feature templates. This usually results in millions of sparse and binary features. Feature templates are usually a combination of 1 ~ 3 elements from the configuration. Here are some typical indicator features: we can extract the word and POS tag of a stack element, and we can also concatenate two or three words together, we can also include dependency label in the feature. Indicator features

Problems with indicator features
#1 Sparse Lexicalized interaction terms are important but sparse #2 Incomplete #3 Slow 95% of parsing time is consumed by feature computation If we encode the configuration with a distributed representation and the model captures interaction terms, all the problems are solved!

“Marginal prepositions”
“There is a continual change going on by which certain participles or adjectives acquire the character of prepositions or adverbs, no longer needing the prop of a noun to cling to” – Fowler (1926) They moved slowly, toward the main gate, following the wall Repeat the instructions following the asterisk This continued most of the week following that ill-starred trip to church Following a telephone call, a little earlier, Winter had said … He bled profusely following circumcision

Transition-based, that is shift-reduce dependency parser.

Cube Activation Function
It is noting that, we use a cube function for non-linearity, because it contains the product terms of three elements from different tokens, and we believe that this can better capture the interaction terms. Better captures the interaction terms!

Parsing Speed-up + Pre-computation trick:
If we have seen (s1, good) many times in the training set, we can pre-compute matrix multiplications before parsing reducing multiplications to additions. 8 ~ 10 times faster. As in [Devlin et al. 2014] good has control He … JJ VBZ NN PRP nsubj + word POS dep. s1 s2 b1 lc(s1) rc(s1) lc(s2) rc(s2) … In the parsing process, we need to extract tokens from each configuration, and compute hidden-layer via matrix multiplications. We find that a simple pre-computation trick can help speed up the parsing process a lot, for example, if we have seen the first stack element s1 equals to good many times in training set, we can pre-compute matrix multiplications for (s1, good), and the multiplications are reduced to additions in parsing time. In practice, it increases the speed of our parser 8 ~ 10 times.

Parser type Parser LAS (Label & Attach) Sentences / sec Transition-based MaltParser (stackproj) 86.9 469 Our parser 89.6 654 Graph-based MSTParser 87.6 10 TurboParser (full) 89.7 8 Embedding size 50, hidden size 200, mini-batch AdaGrad α=0.01, 0.5 dropout on hidden, pre-trained C&W word vectors

Cube Activation Function

Pre-trained Word Vectors
Random: uniform(-0.01, 0.01) Pre-trained: PTB (C&W): % CTB (word2vec): +1.7% (C&W: Collobert et al 2011, word2vec: Mikolov et al 2013)

POS / Dependency Embeddings
POS embeddings help a lot: PTB: +1.7% CTB: % Slight gain from dependency embeddings. 41 is last slide!

Envoi A new understanding of good word vectors An accurate – and fast – neural network dependency parser Coming soon to Stanford CoreNLP…. The key tools for building intelligent systems that can recognize and exploit the compositional semantic structure of language Thank you!

SAIL The Stanford Artificial Intelligence Laboratory

Grounding language meaning with images [Socher, Karpathy, Le, Manning & Ng, TACL 2014]
Idea: Map sentences and images into a joint space

Example dependency tree and image

Recursive computation of dependency tree
41 is conclusion

Image credits and permissions
Slide 2 and 27: Trinity College Dublin library Free use picture from Slide 3 From a paper of the author (Manning 1992, Romance is so complex) Slides 4: PR2 robot reading CC BY-SA 3.0 by Troy Straszheim from

Image credits and permissions
Slide 10 images from Wikipedia (i) Public domainhttp://en.wikipedia.org/wiki/Litoria#mediaviewer/File:Caerulea_cropped(2).jpg (ii) Brian Gratwick (iii) Grand-Duc, Niabot (iv) Public domain Slides 19, 20, 21, 25 Images from the PASCAL Visual Object Challenge 2008 data set

SAIL The Stanford Artificial Intelligence Laboratory

Similar presentations

Presentation on theme: "SAIL The Stanford Artificial Intelligence Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SAIL The Stanford Artificial Intelligence Laboratory

Similar presentations

Presentation on theme: "SAIL The Stanford Artificial Intelligence Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback