Natural Language Processing Is So Difficult

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Information Retrieval in Practice
Ch 4: Information Retrieval and Text Mining
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Text Analysis Everything Data CompSci Spring 2014.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
DeepWalk: Online Learning of Social Representations
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Retrieval in Practice
Distributed Representations for Natural Language Processing
A Simple Approach for Author Profiling in MapReduce
Neural Machine Translation
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Sentiment Analysis of Twitter Messages Using Word2Vec
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Deep Learning for Bacteria Event Identification
Large-Scale Content-Based Audio Retrieval from Text Queries
A Straightforward Author Profiling Approach in MapReduce
Deep learning David Kauchak CS158 – Fall 2016.
Clustering of Web pages
Semantic Processing with Context Analysis
School of Computer Science & Engineering
Relation Extraction CSCI-GA.2591
Natural Language Processing (NLP)
A Deep Learning Technical Paper Recommender System
Intro to NLP and Deep Learning
Word Embeddings and their Applications
Deep learning and applications to Natural language processing
Vector-Space (Distributional) Lexical Semantics
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Classifying enterprises by economic activity
Distributed Representation of Words, Sentences and Paragraphs
Text Categorization Assigning documents to a fixed set of categories
Word Embedding Word2Vec.
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Review-Level Aspect-Based Sentiment Analysis Using an Ontology
Word embeddings based mapping
Word embeddings based mapping
100+ Machine Learning Models running live: The approach
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Resource Recommendation for AAN
prerequisite chain learning and the introduction of LectureBank
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Learning Term-weighting Functions for Similarity Measures
Vector Representation of Text
Natural Language Processing (NLP)
Word embeddings (continued)
Introduction to Sentiment Analysis
Text Analytics Solutions with Azure Machine Learning
From Unstructured Text to StructureD Data
Word representations David Kauchak CS158 – Fall 2016.
Modeling IDS using hybrid intelligent systems
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Information Retrieval and Web Design
Vector Representation of Text
Natural Language Processing (NLP)
Presentation transcript:

Natural Language Processing Is So Difficult

https://medium.com/@actsusanli Feature Engineering for Natural Language Processing 2019 Susan Li https://medium.com/@actsusanli Sr. Data Scientist

What are features When we predict New York City taxi fare Distance between pickup and drop off location Time of the day Day of the week Is a holiday or not

What are the features for NLP?

How does computer perceive text?

Meta features Text based features Number of words in the text Number of unique words in the text Number of characters in the text Number of stopwords Number of punctuations Number of upper case words Number of title case words Average length of the words Text distance Language of text Page rank Vectorization (CountVectorizer, TfidfVectorizer, HashingVectorizer) Tokenization Lemmatization Stemming N-grams Part of speech tagging Parsing Named entity and key phrase extraction Capitalization pattern

Meta features

Text based features

Feature Engineering for NLP Feature engineering is the most important part of developing NLP applications. Feature engineering is the most creative aspect of Data Science (art and skill). Domain knowledge / brainstorming sessions. Check / revisit what worked before. Feature Engineering for NLP

What Is Not Feature Engineering Data collection Removing stopwords, removing non-alphanumeric, removing HTML tag, lowercasing. They are data preprocessing. Creating the target variable (labeling the data) Scaling or normalization PCA Hyperparameter optimization or tuning.

Our machine learning algorithm looks at the textual content before applying SVM on the word vector.

Feature Engineering for NLP TF-IDF Word2vec FastText Topic Modeling Simple & complicated features The future

Problem 1: Predict label of a Stack Overflow question https://storage.googleapis.com/tensorflow-workshop- examples/stack-overflow-data.csv

Our problem is best formulated as multi- class, single label text classification which predicts a given Stack Overflow question to one of a set of target labels. Problem formulation

python

Our machine learning algorithm looks at the textual content before applying SVM on the word vector.

N-gram Unigrams: “nlp multilabel classification problem” -> [“nlp”, “multilabel”, “classification”, “problem”] Bigrams: “nlp multilabel classification problem” -> [“nlp multilabel”, “multilabel classification”, “classification problem”] Trigrams: [“nlp multilabel classification”, “multilabel classification problem”] Character-grams: “multilabel” -> [“mul”, “ult”, “lti”, “til”, “ila”, “lab”, “abe”, “bel”]

TF-IDF computation of the word “python”

TF-IDF sparse matrix example Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

Coding TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(min_df=3, max_features = None, token_patterns = r’\w{1, }’, strip_accents = ‘unicode’, analyzer=‘word’, ngram_range(1, 3), strop_words=‘english’) tfidf_matrix = vectorizer.fit_transform(questions_train)

extract from the sentence a rich set of hand-designed features which are then fed to a classical shallow classification algorithm, e.g. a Support Vector Machine (SVM), often with a linear kernel

Word2vec You shall know a word by the company it keeps (Firth, J. R. 1957:11) Word2vec is not deep learning Word2vec turns input text into a numerical form that deep neural networks can process as inputs. We can either download one of the pre-trained models from GloVe, or train a Word2Vec model from scratch with gensim

Word2vec CBOW: if we have the phrase “how to plot dataframe bar graph”, the parameters/features of {how, to, plot, bar, graph} are used to predict {dataframe}. Predicts the current word given the neighboring words. Skip-gram: if we have the phrase “how to plot dataframe bar graph”, the parameters/features of {dataframe} are used to predict {how, to, plot, bar, graph}. Predicts the neighboring words given the current word.

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Word2vec https://www.slideshare.net/TraianRebedea/what-is-word2vec

Airbnb https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e

Uber https://eng.uber.com/uber-eats-query-understanding/

Word2Vec FastText Treat each word as the smallest unit to train on. Does not perform well for rare words. Can’t generate word embedding if a word does not appear in training corpus. Training faster. An extension of Word2vec Treats each word as composed of character n-grams. Generate better word embeddings for rare words. Can construct the vector for a word even it does not appear in training corpus. Training slower

Word2vec FastText from gensim.models import word2vec model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=500, workers=4) from gensim.models import FastText model_FastText = FastText(corpus, size=100, window=20, min_count=500, workers=4)

Problem 2: Topic Modeling (Unsupervised)

from sklearn.decomposition import LatentDirichletAllocation lda_model = LatentDirichletAllocation(n_components=20, learning_method='online’, random_state=0, n_jobs = -1) lda_output = lda_model.fit_transform(df_vectorized) Topic Modeling

sql python

Problem 3: Auto detect duplicate Stack Overflow questions

Our problem is a binary classification problem in which we Identify and classify whether question pairs are duplicates or not Problem formulation

Meta features Word2vec features The word counts of each question. The character length of each question. The number of common words of these two questions . . . Fuzzy Features Fuzzy string matching related features. (https://github.com/seatgeek/fuzzywuzzy) Word2vec vectors for each question. Word mover’s distance. (https://markroxor.github.io/gensim/static/no tebooks/WMD_tutorial.html) Cosine distance between vectors of question1 and question2. Manhattan distance between vectors of question1 and question2. Euclidean distance between vectors of question1 and question2. Jaccard similarity between vectors of question1 and question2. . . .

Debug end-to-end Auto ML models LIME SHAP

LIME “printf” is positive for c, but negative for Other. If we remove the words printf from the document, we expect the model to predict c with probability 0.97 - 0.28 = 0.69

SHAP “php” is the biggest signal word used by our model, contributing most to “php” predictions. It’s unlikely you’d see the word “php” used in a python question.

Good Features are the backbone of any machine learning model Good Features are the backbone of any machine learning model. And good feature creation often needs domain knowledge, creativity, and lots of time.

Resources: Glove: https://blog.acolyer.org/2016/04/22/glove-global- vectors-for-word-representation/ Word2vec: https://blog.acolyer.org/2016/04/21/the- amazing-power-of-word-vectors/ https://arxiv.org/pdf/1301.3781.pdf - Mikolov et al. 2013 https://papers.nips.cc/paper/5021-distributed- representations-of-words-and-phrases-and-their- compositionality.pdf - Mikolov et al. 2013 https://arxiv.org/pdf/1411.2738v3.pdf - Rong 2014 https://arxiv.org/pdf/1402.3722v1.pdf - Goldberg and Levy 2014