Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search in Google's N-grams

Similar presentations


Presentation on theme: "Search in Google's N-grams"— Presentation transcript:

1 Search in Google's N-grams
J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept. National Taiwan University

2 Linggle Goals Demos Linguistic search
Computer-assisted language learning Demos Query examples present a method _ propose * to at/in the afternoon discuss ?about the issue v. death penalty to v. education to v. ?prep. ?det. difficulty/difficulties

3 Google‘s Web1T N-gram Dataset
Blog for Web1T Statistics 1,024,908,267,229 words of running text 1,176,470,663 five-word sequences (appear at least 40 times) 13,588,391 unique words (appear at least 200 times) Applications: machine translation, speech recognition, spelling correction, language learning, and others From sentences to n-grams Document N-grams

4 Offline Task: Inverted Index
Documents (n-gram entries) have a book for good good to have good job Inverted index a  0 book  0 for  1 good  1 2 3 have  0 2 job  3 to  2 Dictionary Postings

5 Online Task 1: Query Expansion
Supported queries Wild cards _  listen _ music *  a * book Alternatives ?  discuss ?about the issue /  in/at the afternoon Query expansion To expand a query until it is composed of literals and “_ discuss ?about the issue in/at the afternoon a * book give * a *

6 Online Task 2: Merge Postings
Find intersection of postings give  _  joy  Ordering checkup Print based on descending frequency

7 Typical Approach Offline task: Inverted index
Extract all words from the n-gram dataset Create a dictionary of all sorted unique words Generate a posting for each word in the dictionary Online task: Query processing Expand the query until it contains only words or "_". Extract words from the query. Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets. Combine postings to have the candidate output set. Generate final output by considering ordering, etc. Sort and print the output ordered by frequency.

8 3 Steps for Query Expansion (1/2)
First, expand "?". abc ?x/y/z _ ?*/p  abc _ abc _ */p abc x/y/z _ abc x/y/z _ */p Second, expand "/". abc x/y/z _ */p  abc x _ * abc x _ p abc y _ * abc y _ p abc z _ * abc z _ p

9 3 Steps for Query Expansion (2/2)
Last, expand "*". give * a *  give a give a _ give a _ _ give a _ _ _ give _ a give _ a _ give _ a _ _ give _ _ a give _ _ a _ give _ _ _ a Note that it will be easier to write a recursive function for each of the above expansions.

10 Recursive Formula for Query Expansion
Expand "?" expand1({?a, b, c}) is the union of expand1({b, c}) a+expand1({b, c}) Expand "/" expand2({a/b, c, d/e}) is the union of a+expand2({c, d/e}) b+expand2({c, d/e}) Expand "*" expand3({*, b, c, d}) is the union of expand3({b, c, d}) _+expand3({b, c, d}) _+_+expand3({b, c, d})

11 Web Resources Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning Inverted index This HW does not need to do tokenization, normalization, stemming, stop words Merge postings

12 How to Optimize Your Program
Strategies for speedup Use tries instead of binary search (***) When sorting, use pointers instead of moving entries (***) Do not use STL sets for postings. Write your own function for merging postings (**) Read a file into memory before further processing (*) Use constant-size arrays since n-gram files are fixed (*) Strategies for saving memory Process tokens on the fly

13 Suggested schedule Time is limited! Finish inverted index this week
Finish query expansion and overall test next week


Download ppt "Search in Google's N-grams"

Similar presentations


Ads by Google