Natural language understanding

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
NTU & MSRA Ming-Feng Tsai
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Vector Semantics Dense Vectors.
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
From Frequency to Meaning: Vector Space Models of Semantics
Distributed Representations for Natural Language Processing
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Best pTree organization? level-1 gives te, tf (term level)
CS 9633 Machine Learning Support Vector Machines
Sentiment Analysis of Twitter Messages Using Word2Vec
Chapter 7. Classification and Prediction
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Sentiment analysis algorithms and applications: A survey
Deep learning David Kauchak CS158 – Fall 2016.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Information Retrieval: Models and Methods
Erasmus University Rotterdam
Conditional Random Fields for ASR
Personalized Social Image Recommendation
Intro to NLP and Deep Learning
Word Embeddings and their Applications
Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.
CSC 594 Topics in AI – Natural Language Processing
Vector-Space (Distributional) Lexical Semantics
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Machine Learning in Natural Language Processing
Distributed Representation of Words, Sentences and Paragraphs
Jun Xu Harbin Institute of Technology China
Learning with information of features
Statistical NLP: Lecture 9
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Word Embedding Word2Vec.
iSRD Spam Review Detection with Imbalanced Data Distributions
Review-Level Aspect-Based Sentiment Analysis Using an Ontology
Word embeddings based mapping
Artificial Intelligence Lecture No. 28
Ying Dai Faculty of software and information science,
Text Mining & Natural Language Processing
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Vector Representation of Text
Information Retrieval
Word embeddings (continued)
Unsupervised Learning of Narrative Schemas and their Participants
Introduction to Sentiment Analysis
SVMs for Document Ranking
Word representations David Kauchak CS158 – Fall 2016.
Information Retrieval
Natural Language Processing Is So Difficult
Statistical NLP : Lecture 9 Word Sense Disambiguation
Vector Representation of Text
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Natural language understanding Zack Larsen

What is natural language understanding (NLU)? “A subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU is considered an AI- hard problem.” -Wikipedia

Word2vec Distributed vector space model that attempts to generate word embeddings Based on shallow neural networks Proposed in 2013 by Mikolov et. al (Google) Achieves superior performance to other leading neural network and latent semantic analysis models while being more scalable and time efficient (100 billion words/day) Uses a skip-gram context window or continuous bag-of-words instead of a traditional bag-of-words (BoW)

Skip-gram objective function

Continuous bag-of-words (cBoW) objective function Instead of feeding n previous words into the model, the model receives a window of n words around the target word wt at each time step t.

How does word2vec work?

“Relations = Lines” The first singular value of the matrix of word vectors represents the “direction” of the analogy or relationship. The second singular value of this matrix represents random noise.

Examples King- man + woman = Queen China – Beijing + London = England Boy + male – girl = female France - French + Mexico = Spanish Capture – captured + go = went

Bag-of-words vs. dependency based syntactic contexts

Negative sampling (NCE/Noise-contrastive elimination) “Compare the target word with a stochastic selection of contexts instead of all contexts.” Think tf-idf in text classification “The original motivation for sub-sampling was that frequent words are less informative.” There is “…another explanation for its effectiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.” “The key assertion underlying NCE is that a good model should be able to differentiate data from noise by means of logistic regression.” We want to use “learning by comparison” to train binary logistic regression classifiers to estimate posterior probabilities.

Negative sampling Each word in the training set is discarded with probability P(wi)  , where f(wi) is the frequency of word wi and t is a chosen threshold, typically around 10-5

Goldberg and Levy, 2014 Why does vector arithmetic (adding and subtracting) reveal analogies? Because vector arithmetic is similarity arithmetic. Mathematically, under the hood, we're actually maximizing two similarity terms and minimizing a third dissimilarity. In the famous "man is to woman as king is to queen" example, queen is the word w that maximizes: cos(w, king) - cos(w, man) + cos(w, woman). “…the neural embedding process is not discovering novel patterns, but rather is doing a remarkable job at preserving the patterns inherent in the word-context co-occurrence matrix.” Is there a better way to extract analogies? “Yes! We suggest multiplying similarities instead of adding them, and show a significant improvement in every scenario.”

Goldberg and Levy Notice the imbalance here: The “England” similarities are dominated by the “Baghdad” aspect The additive approach will yield “Mosul” in the equation “England – London + Baghdad = ?” The multiplicative approach will yield Iraq

Multiplicative combination

Non-linearity

Challenges with word2vec “…the explicit representation is superior in some of the more semantic tasks, especially geography related ones, as well as the superlatives and nouns. The neural embedding, however, has the upper hand on most verb inflections, comparatives, and family (gender) relations.” Therefore, challenges exist mostly in syntactic tasks -Omer and Levy

Problems with syntax Word2vec and GloVe perform better with semantic tasks than they do with syntax tasks These methods are based on word count co-occurrence matrices Semantic relationships appear more frequently together than do syntactical ones, e.g. “He lives in Chicago, Illinois.” vs. “She lives in Chicago”/“She lived in Chicago”

Latent semantic analysis pros and cons Uses term-document matrix HAL (Hyperspace Analogue to Language) uses term frequency or term co- occurrence matrix Fast training Efficient usage of statistics on entire corpus Primarily used to capture word similarity Disproportionate importance given to large counts Sub-optimal vector space structure

Word2vec and neural net pros and cons Scales with corpus size Inefficient usage of statistics (training on window rather than corpus) Generate improved performance on other tasks Can capture complex patterns beyond word similarity

GloVe (Global vectors) Pennington et. al 2014, Stanford NLP Uses the ratio of word co-occurrence probabilities with various probe words, k to examine relationships Fast training Scalable to huge corpora Good performance even with small corpus, and small vectors Dimensionality ~ 300 Context window size of 8-10 is best Also applicable to named entity recognition “3COSMUL (Omer and Levy) performed worse than cosine similarity in almost all of our experiments.”

Objective function

GloVe objective “… a weighted least squares objective J that directly aims to minimize the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences.” where wi and bi are the word vector and bias respectively of word i, wj and bj are the context word vector and bias respectively of word j, Xij is the number of times word i occurs in the context of word j, and f is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

Why ratios? Consider a word strongly related to ice, but not to steam, such as solid.  P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Thus the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of  P(gas | ice) / P(gas | steam) will instead be small. For a word related to both ice and steam, such as water we expect the ratio to be close to one. We would also expect a ratio close to one for words related to neither ice nor steam, such as fashion.

Training time

Uses of word2vec and embedding models Text/document classification/sentiment classification Playlist generation (e.g. Spotify), composer classification, analysis of harmony Cross-language word-sense disambiguation Machine translation Biomedical named-entity recognition Dialog systems Automatic research (search engines that do the reading for you)

Sentiment classification Aims to correctly identify the sentiment “polarity” of a sentence. This could be positive, negative, or neutral. (Mostly binary +/-) Examples are tweets, shopping reviews, facebook posts, emails, etc. (Typically short-form text with one continuous concept). How can this be extended?

Machine learning for sentiment classification Multiclass support vector machines with linear basis kernel function (Excellent accuracy, large memory requirements, high complexity) Naïve Bayes (Fast, minimal memory required) Maximum Entropy (Good performance) Domain specificity is a constraint

Non-textual human communication Verbal speech and written text only account for a certain proportion of human language and meaning How do we capture the other things in digital communications? Prosodic cues (tone of voice), body language (posture, hand gestures, etc.), facial expression, symbolism.

SentiWordNet Builds on WordNet by annotations using Support Vector Machines and Rocchio’s algorithm

Evaluation metrics Maybe accuracy is not appropriate or sufficient because of the class imbalance. (“Gruzd et al. examined the spreading of emotional content on Twitter and found that the positive posts are retweeted more often than the negative ones.”) Is there a way to measure the degree of positivity? What about precision, recall, F-measure?

Emoticons/ emojis

How can emojis help with disambiguation? Sarcasm Negation (Neural network approaches are helpful with this “local compositionality” as well), e.g. “I did not love this movie or find it to be amazing and wonderful.” Word sense, e.g. “I like to slap the bass.” vs. “I like to slap the bass.”

Negation handling We can use only non-ambiguous emojis that are highly associated with positive or negative sentiment and appear widely in various corpora or text sources