Natural language understanding

Natural language understanding
Zack Larsen

What is natural language understanding (NLU)?
“A subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU is considered an AI- hard problem.” -Wikipedia

Word2vec Distributed vector space model that attempts to generate word embeddings Based on shallow neural networks Proposed in 2013 by Mikolov et. al (Google) Achieves superior performance to other leading neural network and latent semantic analysis models while being more scalable and time efficient (100 billion words/day) Uses a skip-gram context window or continuous bag-of-words instead of a traditional bag-of-words (BoW)

Skip-gram objective function

Continuous bag-of-words (cBoW) objective function
Instead of feeding n previous words into the model, the model receives a window of n words around the target word wt at each time step t.

How does word2vec work?

“Relations = Lines” The first singular value of the matrix of word vectors represents the “direction” of the analogy or relationship. The second singular value of this matrix represents random noise.

Examples King- man + woman = Queen China – Beijing + London = England
Boy + male – girl = female France - French + Mexico = Spanish Capture – captured + go = went

Bag-of-words vs. dependency based syntactic contexts

Negative sampling (NCE/Noise-contrastive elimination)
“Compare the target word with a stochastic selection of contexts instead of all contexts.” Think tf-idf in text classification “The original motivation for sub-sampling was that frequent words are less informative.” There is “…another explanation for its effectiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.” “The key assertion underlying NCE is that a good model should be able to differentiate data from noise by means of logistic regression.” We want to use “learning by comparison” to train binary logistic regression classifiers to estimate posterior probabilities.

Negative sampling Each word in the training set is discarded with probability P(wi) , where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10-5

Goldberg and Levy, 2014 Why does vector arithmetic (adding and subtracting) reveal analogies? Because vector arithmetic is similarity arithmetic. Mathematically, under the hood, we're actually maximizing two similarity terms and minimizing a third dissimilarity. In the famous "man is to woman as king is to queen" example, queen is the word w that maximizes: cos(w, king) - cos(w, man) + cos(w, woman). “…the neural embedding process is not discovering novel patterns, but rather is doing a remarkable job at preserving the patterns inherent in the word-context co-occurrence matrix.” Is there a better way to extract analogies? “Yes! We suggest multiplying similarities instead of adding them, and show a significant improvement in every scenario.”

Goldberg and Levy Notice the imbalance here:
The “England” similarities are dominated by the “Baghdad” aspect The additive approach will yield “Mosul” in the equation “England – London + Baghdad = ?” The multiplicative approach will yield Iraq

Multiplicative combination

Non-linearity

Challenges with word2vec
“…the explicit representation is superior in some of the more semantic tasks, especially geography related ones, as well as the superlatives and nouns. The neural embedding, however, has the upper hand on most verb inflections, comparatives, and family (gender) relations.” Therefore, challenges exist mostly in syntactic tasks -Omer and Levy

Problems with syntax Word2vec and GloVe perform better with semantic tasks than they do with syntax tasks These methods are based on word count co-occurrence matrices Semantic relationships appear more frequently together than do syntactical ones, e.g. “He lives in Chicago, Illinois.” vs. “She lives in Chicago”/“She lived in Chicago”

Latent semantic analysis pros and cons
Uses term-document matrix HAL (Hyperspace Analogue to Language) uses term frequency or term co- occurrence matrix Fast training Efficient usage of statistics on entire corpus Primarily used to capture word similarity Disproportionate importance given to large counts Sub-optimal vector space structure

Word2vec and neural net pros and cons
Scales with corpus size Inefficient usage of statistics (training on window rather than corpus) Generate improved performance on other tasks Can capture complex patterns beyond word similarity

GloVe (Global vectors) Pennington et. al 2014, Stanford NLP
Uses the ratio of word co-occurrence probabilities with various probe words, k to examine relationships Fast training Scalable to huge corpora Good performance even with small corpus, and small vectors Dimensionality ~ 300 Context window size of 8-10 is best Also applicable to named entity recognition “3COSMUL (Omer and Levy) performed worse than cosine similarity in almost all of our experiments.”

Objective function

GloVe objective “… a weighted least squares objective J that directly aims to minimize the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences.” where wi and bi are the word vector and bias respectively of word i, wj and bj are the context word vector and bias respectively of word j, Xij is the number of times word i occurs in the context of word j, and f is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

Why ratios? Consider a word strongly related to ice, but not to steam, such as solid. P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Thus the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of P(gas | ice) / P(gas | steam) will instead be small. For a word related to both ice and steam, such as water we expect the ratio to be close to one. We would also expect a ratio close to one for words related to neither ice nor steam, such as fashion.

Training time

Uses of word2vec and embedding models
Text/document classification/sentiment classification Playlist generation (e.g. Spotify), composer classification, analysis of harmony Cross-language word-sense disambiguation Machine translation Biomedical named-entity recognition Dialog systems Automatic research (search engines that do the reading for you)

Sentiment classification
Aims to correctly identify the sentiment “polarity” of a sentence. This could be positive, negative, or neutral. (Mostly binary +/-) Examples are tweets, shopping reviews, facebook posts, s, etc. (Typically short-form text with one continuous concept). How can this be extended?

Machine learning for sentiment classification
Multiclass support vector machines with linear basis kernel function (Excellent accuracy, large memory requirements, high complexity) Naïve Bayes (Fast, minimal memory required) Maximum Entropy (Good performance) Domain specificity is a constraint

Non-textual human communication
Verbal speech and written text only account for a certain proportion of human language and meaning How do we capture the other things in digital communications? Prosodic cues (tone of voice), body language (posture, hand gestures, etc.), facial expression, symbolism.

SentiWordNet Builds on WordNet by annotations using Support Vector Machines and Rocchio’s algorithm

Evaluation metrics Maybe accuracy is not appropriate or sufficient because of the class imbalance. (“Gruzd et al. examined the spreading of emotional content on Twitter and found that the positive posts are retweeted more often than the negative ones.”) Is there a way to measure the degree of positivity? What about precision, recall, F-measure?

Emoticons/ emojis

How can emojis help with disambiguation?
Sarcasm Negation (Neural network approaches are helpful with this “local compositionality” as well), e.g. “I did not love this movie or find it to be amazing and wonderful.” Word sense, e.g. “I like to slap the bass.” vs. “I like to slap the bass.”

Negation handling We can use only non-ambiguous emojis that are highly associated with positive or negative sentiment and appear widely in various corpora or text sources

Natural language understanding

Similar presentations

Presentation on theme: "Natural language understanding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural language understanding

Similar presentations

Presentation on theme: "Natural language understanding"— Presentation transcript:

Similar presentations

About project

Feedback