Download presentation
Presentation is loading. Please wait.
1
Word Embeddings and their Applications
Presented by Samhaa R. El-Beltagy Director of the Center for Informatics Science Head of the Text Mining Research Group Center for Informatics Science Nile University, Egypt.
2
Outline A general introduction to word embeddings:
What they are How they are built Available Tools Available pre-trained models AraVec – A set of Arabic word embedding models
3
Language Language is the way humans are able to communicate, express ideas and transfer knowledge. Written language is what preserves our knowledge and thoughts There are many factors that influence our ability to understand each other; including context, and culture. Natural language understanding is one of the biggest challenges facing the AI community and despite the hype, its far from being solved. Sumarian
4
Words A word is defined as “a single distinct meaningful element of speech or writing” Words are the main constituents of any language When building computer systems that deal with natural language, one of the first decisions that need to be made is how words in documents/sentences/etc, are going to be represented in the system. The BOW model is a very popular and widely used document representation scheme. So what are word embeddings? And why are they important? Talk about the sparseness of the BOW model as well as the sparseness of the one hot encoding model
5
Word embeddings Word embeddings are the outcome of applying any set of techniques that result in words or phrases in some given vocabulary being mapped to vectors of real numbers. All widely used word embedding generation techniques, take in a large text corpus as an input, and generate real numbered vectors for unique terms that appear in that corpus (its vocabulary ) Resulting vectors are usually dependent on the context in which a word appears The better the used techniques, the more semantically representative the vector is. First point is particularly important when using deep learning models. The lternative representation of a word as a vector is the one hot representation
6
Word embedding properties
Similar words tend to have similar embeddings or vectors. Since words are represented as real valued dense vectors, the similarity between them can be measured using the cosine similarity measure.
7
Word embedding properties
Similar words tend to have similar embeddings or vectors. Since words are represented as real valued dense vectors, the similarity between them can be measured using the cosine similarity measure. Top 10 most similar terms to مصر Top 10 most similar terms to باريس Top 10 most similar terms to روما
8
Examples of semantic expressiveness
The famous example of King – Man + Woman = Queen Top 10 most similar terms to رئيس+مصر Top 10 most similar terms to اردوغان
9
A more expressive example
Model trained on 783 million word. Word vectors have a dimensionality of 300. France - Paris +Itay
10
Example Techniques used to generate Word Embeddings
Latent Semantic Indexing (LSI) Represents words based on “the company” they keep Applies singular value decomposition (SVD) to identify patterns in the relationships between the terms Fastext - Facebook GloVe (Global Vectors for Word Representation) - Stanford Word2Vec - Google
11
Word2Vec A shallow neural network which generates a fixed vector for each word in a corpus based on different context windows. Architectures: CBOW (Continuous Bag of words) Skip – Gram model
12
Word2Vec - CBOW (Continuous Bag of words)
Predict the probability of a word given some input context (neighboring words).
13
Word2Vec – Skip Gram model
Skip-Gram follows the same topology as of CBOW. But It just flips CBOW’s architecture on its head. The aim of skip-gram is to predict the context given a word.
14
Tools Gensim ( A python based tool that offers Word2Vec, LSI, and LDA implementations. GloVe ( – Standford’s tool for generation word embedding FastText - “a library for efficient learning of word representations and sentence classification”. Word2Vec ( The original Google implementation of the Word2Vec algorithm. Code is written in C. Word2Vec in Java (
15
Applications Compute similarities between words
Create clusters of related words Use as features in text classification Use for document clustering Employ for query expansion Use for other NLP related tasks such as: Sentiment Analysis Named Entity recognition Machine Translation Language modelling POS tagging
16
Pre-trained Embeddings
Embeddings for 294 languages, trained on Wikipedia using fastText retrained-vectors.md Pre-trained vectors trained on part of Google News dataset (about 100 billion words) using word2Vec. The model contains 300-dimensional vectors for 3 million words and phrases. GloVe vectors trained on Wikipedia 2014,and Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors), GloVe vectors trained on Twitter (2B tweets, 27B tokens, 1.2M vocab, 25d, 50d, 100d, & 200d vectors):
17
Why AraVec ? There are currently no publicly available sizable pretrained Arabic word embeddings. Training a ‘decent’ word embeddings model requires sizeable datasets and computation power The main goal of this work is to publicly avail a set of pre- trained Arabic distributed word representations (word embeddings) for immediate used in NLP tasks across different text domains.
18
Data Resources The first version of AraVec provides six different word embedding models built on top of three different Arabic context domains; Twitter Arabic Tweets World Wide Web Arabic Pages Arabic Wikipedia Corpus The total number of tokens used to build the models amounts to more than 3,300,000,000 (or a little over 3 billion)
19
Data Resources | Twitter
Motivation: Many recent NLP researchers targeting social media analysis and applications have used Twitter as a main data resource for carrying out their work. Nature: Has text in a variety of dialects and sub-dialects. Examples include: Modern Standard Arabic (MSA , فصحى) Egyptian or more specifically Cairene, Gulf, Moroccan, Tunisian, Algerian, and Levantine dialects. Data Collection: Designed a crawler to collect tweets to overcome limitations posed by Twitter search API. More than 2100 queries were used to collect our final twitter dataset. Collected more than 77,600,000 Arabic tweets posted between 2008 and
20
Data Resources | World Wide Web
Motivation: The web offers a diverse selection of web pages spanning multiple domains, many of which can not be found in single location. Data Collection: Rather than directly crawl the web, we made use of the Common Crawl project. Common Crawl is a nonprofit organization that maintains an open repository of web crawl data. We only used 30% of the data contained in January 2017 dump (about one billion web pages) The page were filtered so that only Arabic content was left.
21
Data Resources | Wikipedia
Motivation: Is a clean resource collaboratively written by users all over the world. Is a very most commonly used resource for working . Because articles in Wikipedia are written in many languages, it’s a commonly used resource for many NLP tasks. Data Collection: Downloaded the Arabic dump dated January 2017. After segmenting the articles into paragraphs, we ended up with 1,800,000 paragraphs, each representing a document to be used for building our model.
22
Unique tokens after applying min-count filter
Summary of used data Twitter WWW Wikipedia # of documents 66,900,000 132,750,000 1,800,000 Total # of Tokens 1,090,082,092 2,225,317,169 78,993,036 Unique tokens after applying min-count filter 164,077 146,237 140,319
23
Data Preprocessing | Normalization
Normalization of Arabic characters is a common preprocessing step when dealing with Arabic text. In this step we started with unifying some characters different forms. For example all of [ أ-آ-إ] will be [ا] Removing special characters like diacritics. normalizing mentions, URLs, emojis and emoticons.
24
Data Preprocessing | X-Rated Content
While “X-Rated” may be desirable to have coverage for this content for its semantic value, we did not want this kind of content to dominate our dataset as this would results in a dataset that is skewed towards this kind of content. We built an “X-Rated” lexicon to detect the percentage of the occurrence of the X-Rated words in a paragraph. To filter out paragraphs, we calculate the percentage of X- Rated words contained within them. If this percentage is greater that 20%, we discard that paragraph.
25
Building the Models The models that we have constructed were built using the Python Gensim tool An Intel Quad core GHz PC with 32 GB of RAM running Ubuntu was used to train the models Training times were as follows: The Wikipedia model: 10 hours The Twitter model: 1.5 days The Common Crawl (Web) model: 4 days
26
Building the Models We built two models of each of the three text domains. We ended up with six different as described in the following table: Model Docs No. Dimension Min Word Freq. Window Size. Technique Twittert-CBOW 66,900,000 300 500 3 CBOW Twittert-SG Skip-Gram WWW-CBOW 132,750,000 5 WWW-SG Wikipedia-CBOW 1,800,000 20 Wikipedia-SG
27
Example of terms that are most similar to Vodafone obtained from twitter using CBOW and Skip-gram
28
This data was produced using the twitter trained model
30
Qualitative Evaluation
The purpose of carrying out qualitative evaluation for our models was to examine how well they capture similarities among words. We used word vectors for a very small subset of sentiment words and applied a clustering algorithm to see whether words of the same polarity cluster together or not. We did the same thing with a set of randomly selected named entities of known types.
31
Qualitative Evaluation using Sentiment words | Twitter Model
Twitter CBOW Model Twitter Skip-Gram Model
32
Qualitative Evaluation using Sentiment words | WWW Model
WWW CBOW Model WWW Skip-Gram Model
33
Qualitative Evaluation using Sentiment words | Wikipedia Model
Wikipedia CBOW Model Wikipedia Skip-Gram Model
34
Qualitative Evaluation using Named Entities | Twitter Model
Twitter CBOW Model Twitter Skip-Gram Model
35
Qualitative Evaluation using Named Entities | WWW Model
WWW CBOW Model WWW Skip-Gram Model
36
Qualitative Evaluation using Named Entities | Wikipedia Model
Wikipedia CBOW Model Wikipedia Skip-Gram Model
37
bit.ly/nu_aravec Get the models!
Please Reach the Models via this page: bit.ly/nu_aravec
38
Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.