Download presentation
Presentation is loading. Please wait.
Published byClaud Edwards Modified over 6 years ago
1
Zhe Ye yezhejack@sjtu.edu.cn 2017.9.28
Word2vec Tutorial Zhe Ye
2
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
3
One-hot representation vs word vectors
Sparse: using 3000K dimensions to represent vocabulary with 3000K word types Not related Word vectors Dense: using 300 (or less) to represent vocabulary with 3000K word types related
4
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
5
Virtual environment Features
Provide separate dependency libraries Do not require admin or sudo to install package Two famous tools which provide these features Anaconda It’s convenient in windows (scipy) Virtualenv
6
Python It’s very popular in NLP or (Data Science)
It’s very simple and easy to understand Version: Python 2.7
7
Corpus Tokenized plain text (Chinese and English is ok)
我们 很 高兴 We are very happy . Tokenized plain text resource atmt.org/lm-benchmark/1-billion-word-language- modeling-benchmark-r13output.tar.gz Tokenizer LTP for Chinese ( Stanford Tokenizer (
8
Gensim Implementing a wrapper for word2vec ( It provide python api
9
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
10
Training word vectors Linux+virtualenv+gensim is recommended
Windows 10 (64bit) + anaconda+gensim is ok
11
Outline One-hot representation vs word vectors Requirement
Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering
12
Evaluation Analogy Word Clustering
vector(‘paris’)-vector(‘France’)+vector(‘Italy’)=vector(‘Rome’) Word Clustering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.