Download presentation
Presentation is loading. Please wait.
Published byArleen Burns Modified over 6 years ago
1
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus
speaker : 양희정 date :
2
0. Index Introduction Prerequisite Corpus Construction
Building Evaluation Task Training Vector Model Result Analysis Project Output
3
1. Introduction project name project abstract Korean version of GloVe
We studied semantic vector space models which represents each word in real-valued vector for Korean We conducted an experiment on Korean word representations for word analogy task and word similarity task in which a Global Vector model, continuous bag-of-words model as well as skip-gram model are compared
4
2. Prerequisite semantic vector space model
Representing each word with a real-valued vector can be used as features in a variety of applications information retrieval, document classification, question answering, named entity recognition, parsing et al. vector representation of document space indexing document space by vector representation
5
2. Prerequisite Global Vector model
Global Vector Model(GloVe) is an unsupervised learning algorithm for obtaining vector representations for words. The main intuition on GloVe is that the ratios of word-word co-occurrence probabilities have some meaning. ice co-occurs more frequently with solid than gas, whereas steam co-occurs more frequently with gas than solid co-occurrence probabilities for ice and steam with context words
6
2. Prerequisite Global Vector model
GloVe is a log-bilinear model with a weighted least-squares objective. weighted least squares regression model of GloVe plot of weighting function with a = 3/4 weighting function
7
2. Prerequisite word2vec model
The word2vec model is a simple single-layer neural network architecture consists of two models, the continuous bag-of-words (CBOW) and skip-gram(SG) models of Mikolov et al. (2013a). word2vec model architecture
8
2. Prerequisite word2vec model
The input of the skip-gram model is a single word wI and the output is the words in wI‘s context in the sentence “I drove my car to the store”, when the word “car” is given as an input, then {“I”, “drove”, “my”, “to”, “the”, “store”} is its output skip-gram neural network architecture
9
2. Prerequisite word2vec model
The input of the continuous bag-of-words model is the words in wI‘s context and the output is a single word wI CBOW can be considered as reversed version of SG model in the sentence “I drove my car to the store”, when the words {“I”, “drove”, “my”, “to”, “the”, “store”} is given as an input, then “car” is its output continuous bag-of-words neural network architecture
10
2. Prerequisite project hypothesis
Global Vector model is applicable to Korean data as a universal learning algorithm Global Vector model fits well to Korean data than word2vec model as English data does correlation on word similarity task accuracy on word analogy task
11
2. Prerequisite summary project component topic description
project target GloVe Vector model (GloVe) unsupervised learning algorithm trained from global word-word co-occurrence matrix word2vec model neural network architecture consists of skip-gram and continuous bag-of-words model project hypothesis Global Vector model fits well to Korean data result can vary because of specialty of Korean Global Vector model is better to apply to Korean data word2vec model
12
3. Corpus Construction corpus construction
From one million sentences collected from web, 3,552,280 words are collected. samples of Korean sentences samples of Korean words
13
3. Corpus Construction summary work no. description output date detail
1.0 corpus construction one million Korean text from crawled from web example of Korean corpus
14
4. Building Evaluation Task
evaluation tasks on previous study evaluation task sets previously on GloVe WordSim353 evaluation set in GloVe
15
4. Building Evaluation Task
word similarity task in Korean word similarity task compares similarity values of words obtained by the cosine-similarity of corresponding vectors with the ones by human judgement score human judgement is done by two graduate students majoring in linguistics 엄마 어머니 cosine-similarity and human- judgement score of ‘엄마’, ‘어머니’
16
4. Building Evaluation Task
word similarity task in Korean Korean is an agglutinative language seven slots of a finite Korean verb
17
4. Building Evaluation Task
word similarity task in Korean Korean allows a word to have multiple particles these derived forms should be recognized as same word comparison of derived forms between Korean and English
18
4. Building Evaluation Task
word similarity task in Korean To reflex the agglutinative feature, reset vector of word i, wi by building a set V sharing same stem i and recalculate by weighted sum of elements in V 86662 66577 35627 17660 6781 6614 visualization of recalculating word ‘밥’ recalculating of wi
19
4. Building Evaluation Task
word similarity task in Korean we constructed 819 word pairs based on their semantic relatedness and classified into 4 categories vector synthesis method will be applied or not word categorization table
20
4. Building Evaluation Task
word similarity task in Korean example of 4 category pairs word pair examples based on ‘감정’ : from left to right, categories are modifier-nouns, entailment, relational pairs, and collocations
21
4. Building Evaluation Task
word analogy task in Korean word analogy task tries to answer “ a is to b as c is to __ ? “ finding the word d whose representation wd is closest to wb-wa+wc word analogy test on syntactic word pairs
22
4. Building Evaluation Task
word analogy task in Korean 3COSADD method find the word d whose representation wd is closest to wb-wa+wc by the cosine similarity 3COSMUL method modified version of 3COSADD
23
4. Building Evaluation Task
word analogy task in Korean we constructed 90 word quadruplets based on their semantic relatedness and classified into 2 categories semantic analogy 48, syntactic analogy 42 two analogy calculating method(3COSADD, 3COSMUL) will be applied semantic analogy syntactic analogy
24
4. Building Evaluation Task
summary work no. description output date detail 2.1 prerequisite information about previous study WordSim353 2.2 word similarity task 819 word pairs with 4 categories example of collocation pairs vector synthesis 2.3 word analogy task 90 word quadruplets with 2 categories example of semantic analogy quadruplets 2 calculation method
25
5. Training Vector Model training vector model
trained GloVe and word2vec(SG, CBOW) by 1.0. corpus 50, 100, 200, 300, 400, 500, 1000 dimension vector files word vector lists
26
5. Training Vector Model summary work no. description output date
detail 3.0 trained vector model vector files trained by GloVe or word2vec of various dimensions example of vector file
27
6. Result Analysis similarity task result
best performance was achieved from Pearson correlation in dimension 500 with vector synthesized. correlation coefficients on various dimensions before and after the word synthesis
28
6. Result Analysis similarity task result
comparison between word2vec and GloVe correlation coefficients of word2vec and GloVe
29
6. Result Analysis analogy task result
in semantic analogy task, 3COSADD method in dimension 1000 achieved the highest, and in syntactic, 3COSADD method in dimension 50 was the highest accuracy on the semantic analogy task accuracy on the syntactic analogy task
30
6. Result Analysis analogy task result
comparison between word2vec and GloVe accuracy on the analogy task of word2vec and GloVe
31
6. Result Analysis summary work no. description output date detail 4.1
similarity task result GloVe correlations on various dimensions GloVe resulted Pearson correlation coefficient on dimension 500 correlation comparison between GloVe and word2vec CBOW resulted , SG resulted Pearson correlation coefficient 4.2 analogy task result GloVe accuracy on various dimensions GloVe resulted 69% accuracy on semantic, 64% accuracy on syntactic task, overall 67% accuracy accuracy comparison between GloVe and word2vec CBOW resulted 75% accuracy on semantic, 57% on syntactic task(overall 66%) while SG resulted 65% accuracy on semantic, 48% on syntactic task(overall 57%)
32
7. Project Output A Study on Word Vector Models for Representing Korean Semantic Information
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.