Word Embeddings and their Applications

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Vector space word representations

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Chapter 5: Information Retrieval and Web Search

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.

Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Distributed Representations for Natural Language Processing

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

Best pTree organization? level-1 gives te, tf (term level)

Dimensionality Reduction and Principle Components Analysis

Advanced Computer Systems

Queensland University of Technology

Sentiment Analysis of Twitter Messages Using Word2Vec

Measuring Monolinguality

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Taking a Tour of Text Analytics

Sentiment analysis algorithms and applications: A survey

Deep learning David Kauchak CS158 – Fall 2016.

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Personalized Social Image Recommendation

Giuseppe Attardi Dipartimento di Informatica Università di Pisa

MID-SEM REVIEW.

Zhe Ye Word2vec Tutorial Zhe Ye

Vector-Space (Distributional) Lexical Semantics

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Social Knowledge Mining

Distributed Representation of Words, Sentences and Paragraphs

Jun Xu Harbin Institute of Technology China

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Design open relay based DNS blacklist system

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Word Embedding Word2Vec.

Introduction Task: extracting relational facts from text

Chapter 5: Information Retrieval and Web Search

Resource Recommendation for AAN

Text Mining & Natural Language Processing

Introduction to Text Analysis

Team 7 → Final Presentation

Vector Representation of Text

Word embeddings (continued)

Introduction to Sentiment Analysis

From Unstructured Text to StructureD Data

Word representations David Kauchak CS158 – Fall 2016.

Natural Language Processing Is So Difficult

Vector Representation of Text

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Presentation transcript:

Word Embeddings and their Applications Presented by Samhaa R. El-Beltagy Director of the Center for Informatics Science Head of the Text Mining Research Group Center for Informatics Science Nile University, Egypt.

Outline A general introduction to word embeddings: What they are How they are built Available Tools Available pre-trained models AraVec – A set of Arabic word embedding models

Language Language is the way humans are able to communicate, express ideas and transfer knowledge. Written language is what preserves our knowledge and thoughts There are many factors that influence our ability to understand each other; including context, and culture. Natural language understanding is one of the biggest challenges facing the AI community and despite the hype, its far from being solved. Sumarian

Words A word is defined as “a single distinct meaningful element of speech or writing” Words are the main constituents of any language When building computer systems that deal with natural language, one of the first decisions that need to be made is how words in documents/sentences/etc, are going to be represented in the system. The BOW model is a very popular and widely used document representation scheme. So what are word embeddings? And why are they important? Talk about the sparseness of the BOW model as well as the sparseness of the one hot encoding model

Word embeddings Word embeddings are the outcome of applying any set of techniques that result in words or phrases in some given vocabulary being mapped to vectors of real numbers. All widely used word embedding generation techniques, take in a large text corpus as an input, and generate real numbered vectors for unique terms that appear in that corpus (its vocabulary ) Resulting vectors are usually dependent on the context in which a word appears The better the used techniques, the more semantically representative the vector is. First point is particularly important when using deep learning models. The lternative representation of a word as a vector is the one hot representation

Word embedding properties Similar words tend to have similar embeddings or vectors. Since words are represented as real valued dense vectors, the similarity between them can be measured using the cosine similarity measure.

Word embedding properties Similar words tend to have similar embeddings or vectors. Since words are represented as real valued dense vectors, the similarity between them can be measured using the cosine similarity measure. Top 10 most similar terms to مصر Top 10 most similar terms to باريس Top 10 most similar terms to روما

Examples of semantic expressiveness The famous example of King – Man + Woman = Queen Top 10 most similar terms to رئيس+مصر Top 10 most similar terms to اردوغان

A more expressive example Model trained on 783 million word. Word vectors have a dimensionality of 300. France - Paris +Itay

Example Techniques used to generate Word Embeddings Latent Semantic Indexing (LSI) Represents words based on “the company” they keep Applies singular value decomposition (SVD) to identify patterns in the relationships between the terms Fastext - Facebook GloVe (Global Vectors for Word Representation) - Stanford Word2Vec - Google

Word2Vec A shallow neural network which generates a fixed vector for each word in a corpus based on different context windows. Architectures: CBOW (Continuous Bag of words) Skip – Gram model

Word2Vec - CBOW (Continuous Bag of words) Predict the probability of a word given some input context (neighboring words).

Word2Vec – Skip Gram model Skip-Gram follows the same topology as of CBOW. But It just flips CBOW’s architecture on its head. The aim of skip-gram is to predict the context given a word.

Tools Gensim (https://radimrehurek.com/gensim/): A python based tool that offers Word2Vec, LSI, and LDA implementations. GloVe (https://nlp.stanford.edu/projects/glove/) – Standford’s tool for generation word embedding FastText - “a library for efficient learning of word representations and sentence classification”. Word2Vec (https://code.google.com/archive/p/word2vec/): The original Google implementation of the Word2Vec algorithm. Code is written in C. Word2Vec in Java (https://deeplearning4j.org/word2vec)

Applications Compute similarities between words Create clusters of related words Use as features in text classification Use for document clustering Employ for query expansion Use for other NLP related tasks such as: Sentiment Analysis Named Entity recognition Machine Translation Language modelling POS tagging

Pre-trained Embeddings Embeddings for 294 languages, trained on Wikipedia using fastText https://github.com/facebookresearch/fastText/blob/master/p retrained-vectors.md Pre-trained vectors trained on part of Google News dataset (about 100 billion words) using word2Vec. The model contains 300-dimensional vectors for 3 million words and phrases. https://goo.gl/RhkUE8 GloVe vectors trained on Wikipedia 2014,and Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors), http://nlp.stanford.edu/data/glove.6B.zip GloVe vectors trained on Twitter (2B tweets, 27B tokens, 1.2M vocab, 25d, 50d, 100d, & 200d vectors): http://nlp.stanford.edu/data/glove.twitter.27B.zip

Why AraVec ? There are currently no publicly available sizable pretrained Arabic word embeddings. Training a ‘decent’ word embeddings model requires sizeable datasets and computation power The main goal of this work is to publicly avail a set of pretrained Arabic distributed word representations (word embeddings) for immediate used in NLP tasks across different text domains.

Data Resources The first version of AraVec provides six different word embedding models built on top of three different Arabic context domains; Twitter Arabic Tweets World Wide Web Arabic Pages Arabic Wikipedia Corpus The total number of tokens used to build the models amounts to more than 3,300,000,000 (or a little over 3 billion)

Data Resources | Twitter Motivation: Many recent NLP researchers targeting social media analysis and applications have used Twitter as a main data resource for carrying out their work. Nature: Has text in a variety of dialects and sub-dialects. Examples include: Modern Standard Arabic (MSA , فصحى) Egyptian or more specifically Cairene, Gulf, Moroccan, Tunisian, Algerian, and Levantine dialects. Data Collection: Designed a crawler to collect tweets to overcome limitations posed by Twitter search API. More than 2100 queries were used to collect our final twitter dataset. Collected more than 77,600,000 Arabic tweets posted between 2008 and 2016.

Data Resources | World Wide Web Motivation: The web offers a diverse selection of web pages spanning multiple domains, many of which can not be found in single location. Data Collection: Rather than directly crawl the web, we made use of the Common Crawl project. Common Crawl is a nonprofit organization that maintains an open repository of web crawl data. We only used 30% of the data contained in January 2017 dump (about one billion web pages) The page were filtered so that only Arabic content was left.

Data Resources | Wikipedia Motivation: Is a clean resource collaboratively written by users all over the world. Is a very most commonly used resource for working . Because articles in Wikipedia are written in many languages, it’s a commonly used resource for many NLP tasks. Data Collection: Downloaded the Arabic dump dated January 2017. After segmenting the articles into paragraphs, we ended up with 1,800,000 paragraphs, each representing a document to be used for building our model.

Unique tokens after applying min-count filter Summary of used data Twitter WWW Wikipedia # of documents 66,900,000 132,750,000 1,800,000 Total # of Tokens 1,090,082,092 2,225,317,169 78,993,036 Unique tokens after applying min-count filter 164,077 146,237 140,319

Data Preprocessing | Normalization Normalization of Arabic characters is a common preprocessing step when dealing with Arabic text. In this step we started with unifying some characters different forms. For example all of [ أ-آ-إ] will be [ا] Removing special characters like diacritics. normalizing mentions, URLs, emojis and emoticons.

Data Preprocessing | X-Rated Content While “X-Rated” may be desirable to have coverage for this content for its semantic value, we did not want this kind of content to dominate our dataset as this would results in a dataset that is skewed towards this kind of content. We built an “X-Rated” lexicon to detect the percentage of the occurrence of the X-Rated words in a paragraph. To filter out paragraphs, we calculate the percentage of X- Rated words contained within them. If this percentage is greater that 20%, we discard that paragraph.

Building the Models The models that we have constructed were built using the Python Gensim tool An Intel Quad core i7-3770 @3.4 GHz PC with 32 GB of RAM running Ubuntu 16.04 was used to train the models Training times were as follows: The Wikipedia model: 10 hours The Twitter model: 1.5 days The Common Crawl (Web) model: 4 days

Building the Models We built two models of each of the three text domains. We ended up with six different as described in the following table: Model Docs No. Dimension Min Word Freq. Window Size. Technique Twittert-CBOW 66,900,000 300 500 3 CBOW Twittert-SG Skip-Gram WWW-CBOW 132,750,000 5 WWW-SG Wikipedia-CBOW 1,800,000 20 Wikipedia-SG

Example of terms that are most similar to Vodafone obtained from twitter using CBOW and Skip-gram

This data was produced using the twitter trained model

Qualitative Evaluation The purpose of carrying out qualitative evaluation for our models was to examine how well they capture similarities among words. We used word vectors for a very small subset of sentiment words and applied a clustering algorithm to see whether words of the same polarity cluster together or not. We did the same thing with a set of randomly selected named entities of known types.

Qualitative Evaluation using Sentiment words | Twitter Model Twitter CBOW Model Twitter Skip-Gram Model

Qualitative Evaluation using Sentiment words | WWW Model WWW CBOW Model WWW Skip-Gram Model

Qualitative Evaluation using Sentiment words | Wikipedia Model Wikipedia CBOW Model Wikipedia Skip-Gram Model

Qualitative Evaluation using Named Entities | Twitter Model Twitter CBOW Model Twitter Skip-Gram Model

Qualitative Evaluation using Named Entities | WWW Model WWW CBOW Model WWW Skip-Gram Model

Qualitative Evaluation using Named Entities | Wikipedia Model Wikipedia CBOW Model Wikipedia Skip-Gram Model

bit.ly/nu_aravec Get the models! Please Reach the Models via this page: bit.ly/nu_aravec

Questions