Outline Problem Background Theory Extending to NLP and Experiment

Name: Outline Problem Background Theory Extending to NLP and Experiment
Uploaded: 2017-12-18T11:13:28+00:00
Duration: PTM12S4
Channel: Beryl Hunt
Description: Outline Problem Background Theory Extending to NLP and Experiment

Chen LUO，Mohamed AbdElRahman
Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO，Mohamed AbdElRahman Instructor: Dr. Anshumali Shrivastava †Rice University April 26, 2017

Outline Problem Background Theory Extending to NLP and Experiment
LSH Function Preserving Cosine Similarity (Dimension Reduction) Fast Search Algorithm (From n2 to n) Extending to NLP and Experiment

Motivation What is the meaning of the word: “tezgüno” ？

Motivation Still not Sure? Consider the following Context:
A bottle of tezgüno is on the table. Everyone likes tezgüno. Tezgüno makes you drunk. We make tezgüno out of corn. Still not Sure?

Motivation Consider the following Context:
A bottle of tezgüno is on the table. Everyone likes tezgüno. Tezgüno makes you drunk. We make tezgüno out of corn. A bottle of beer is on the table. Everyone likes beer. Beer makes you drunk. We make beer out of corn. “Beer” and “tezgüno” have similar context, have similar meaning.

Motivation We want this process to be done automatically by a computer! So, the main task here is to find similar nouns! Noun Clustering

Problem Background Task: Clustering Very Large scale nouns
n nodes (n nouns) Each nodes has k features. (Details Later) Calculate Full Similarity Matrix Complexity: Can not be tolerated when n is very large!

Problem Background By 2000 Now We want linear: Hashing is a good way!
Over 500 billion readily accessible words on the web Now Very Very Very Large amount! We want linear: Hashing is a good way!

Outline Problem Background Theory Extending to NLP and Experiment

LSH Function Preserving Cosine Similarity
The similarity measure between each node is Cosine Similarity. Cosine Similarity We want to design a hash function that preserve this similarity.

From the paper, the hash function is defined as follow: In above, r is a spherically symmetric random vector of unit length.

Then, for vectors u and v, we have: Or Directly proportional!

From the equation bellow: We can have Then, we can estimate cosine similarity using:

Each vector u can be represented by a bit stream length d using the hash function. (etc with d=6). Then will be close related to hamming distance between u and v.

For example: Given: Then: So, the Hamming Distance:

Convert: Finding the cosine distance of two vectors Finding the Hamming Distance between bit streams Dimension Reduction! But the complexity is still n2

Outline Problem Background Theory and Algorithm

Fast Search Algorithm Task: Given the signature for each vectors:
Stream bit (e.g. 1001) for each vectors. Find the nearest neighbors for each vector.

Fast Search Algorithm Apply q Randomly Permutation on each bit stream.
We can get q random permuted list. Complexity: O(n) For example: Given a bit stream , and two permutation (q=2). Then

Kn n log n Fast Search Algorithm Sorting the q random permuted list, and find the nearest B neighbors on these sorted list. Complexity: O(n log n) For example B=2, q=2 (Constant) Kn Kn v v 1 2

Question What is the hamming distance between two bit stream: A=[ ], B=[ ]? Ans. Hamming(A,B)=3 Suppose we have two 2-dimension vectors u=[1,0], v=[0,1]. r is the spherically symmetric random vector. Then what is the value of ?

Outline Problem Background Theory and Algorithm

Testing Datasets Corpus Web Corpus Newspaper Corpus Corpus Size
From 70 million to 31 million web pages (138GB) “remove non-English docs & duplicate and near duplicate docs” 6GB Nouns Identification Using a noun-phrase identifier Using the dependency parser Minipar (Lin, 1994) Unique Nouns 655,495 65,547 Features Identification for noun vector For each noun phrase ←←noun phrase→→ Take the grammatical context of the noun as in Minipar Feature Size 1,306,482 940,154

Calculation of Feature Vectors
Mutual Information Vector Used to measure the association strength between two words. Here, it is used between word (e) and feature (f) Cef, is the number of times word (e) occurred in context (f) N, is the total frequency count of all features of all words n, is the number of words For each noun, we have MI(e) = (mi(e1), mi(e2), … mi(ek))

 “Don’t change your wife”
Example Soccer Quotes from the internet: A soccer team is like a beautiful woman. When you do not tell her, she forgets she is beautiful. (Arsène Wenger) In his life, a man can change wives, political parties or religions but he cannot change his favorite soccer team.(Eduardo Hughes Galeano)  “Don’t change your wife” Removing stop words, and identifying nouns: Soccer team like beautiful woman tell forgets beautiful. Life man change wives political parties religions change favorite soccer team. Features:, 2 left, (for each noun), 2 right All Nouns (5): {Soccer team, woman, wives, parties, religions} All Features (11): {Like, beautiful, tell, forgets, man, change, political, parties, wives, religions, favorite}

Example Soccer team like beautiful woman tell forgets beautiful. Life man change wives political parties religions change favorite soccer team. Like Beautiful Tell Forgets Man Change Political Parties wives Religions favorite Total Soccer Team 1 4 Woman Wives 2 3 20

Example Like Beautiful Tell Forgets Man Change Political Parties wives Religions favorite Total Soccer Team 1 4 Woman Wives 2 3 20 MI(soccer team) = (mi(soccer team, like), mi(soccer team, beautiful), … mi(soccer team, favorite) mi(soccer team, like) = log (1/20) / (2/20) X (4/20) ~ (0.4 )

Evaluation: LSH function
d↑, Error↓, Time ↑ Randomly choose 100 nouns (vectors) from the web collection (using the web corpus dataset) (i) is for all pairs with CS(real,i) >= 0.15

Evaluation: Fast Hamming Distance
Randomly choose 100 nouns from the (web corpus dataset). For each, calculate all pairwise hamming distance manually. Filter for those >= “Gold Standard test set”. Obtain a list of bit streams for all nouns from Web Corpus Dataset for hamming distance calculation. Compare Top N elements retrieved by the fast hamming distance against those in the gold standard test set (calculate percentage overlap).

Evaluation: Fast Hamming Distance
B↑, q↑, Accuracy↑ , Search Time↑

Evaluation: Final Similarity Lists
Using (the Newspaper Corpus). Randomly choose 100 nouns and calculate top N elements using the randomized algorithm, and compare with those resulted from (Pantel and Lin (2002) system) and calculate (percentage overlap).

Summary Using random vectors, we manage to represent each noun as a bit of stream of length d << number of features, which result in dimensionality reduction. The proposed method reduced the running time from quadratic time to kn, with similarity accuracy of ~ 70%.

Chen LUO，Mohamed AbdElRahman
Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO，Mohamed AbdElRahman Instructor: Dr. Anshumali Shrivastava †Rice University April 26, 2017

Outline Problem Background Theory Extending to NLP and Experiment

Similar presentations

Presentation on theme: "Outline Problem Background Theory Extending to NLP and Experiment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Problem Background Theory Extending to NLP and Experiment

Similar presentations

Presentation on theme: "Outline Problem Background Theory Extending to NLP and Experiment"— Presentation transcript:

Similar presentations

About project

Feedback