Download presentation
Presentation is loading. Please wait.
Published byJonathan Parrish Modified over 9 years ago
1
N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
2
July 30th, 2009Lexical Knowledge from Ngrams2 Hammer : Fast and multi- functional n-gram search engine 2 ngrams Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text
3
July 30th, 2009Lexical Knowledge from Ngrams3 Characteristics Search up to 7 grams with wildcards Multi-level input – Token, POS, chunk, NE, combinations – NOT, OR for POS, chunk, NE Multi-level output – Token, POS, chunk, NE – document information – Original sentences, KWIC, ngram Display – Show the results in the order of frequency Running Environment – Single CPU, PC-Linux, 400MB process, 500GB disk 3
4
July 30th, 2009Lexical Knowledge from Ngrams4 Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2
5
July 30th, 2009Lexical Knowledge from Ngrams5 Available for you Web system – At NYU http://nlp.cs.nyu.edu/nsearch – At JHU? USB Hard drive
6
July 30th, 2009Lexical Knowledge from Ngrams6 1. Search candidates 2. Filtering 3. Display Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request
7
July 30th, 2009Lexical Knowledge from Ngrams7 1. Search candidates Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request
8
July 30th, 2009Lexical Knowledge from Ngrams8 Example: 3-grams Posting list From n-gram to Inverted Index Ngram IDPosition=1Position=2Position=3 1ABC 2ABB 3BAC 3 A pos=2 12 A pos=1 3 B pos=1 12 B pos=2 2 B pos=3 13 C pos=3
9
July 30th, 2009Lexical Knowledge from Ngrams9 Posting list Wide variation of posting list size (in 7-gram: 1.27B) – “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672) – conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size – Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) 3.2B bits (list) – List of ngramID – Encoded into pointer (freq=1) 13 C pos=3 1000110100001001 C pos=3 5
10
July 30th, 2009Lexical Knowledge from Ngrams10 Search Given an n-gram request (A B C) – Get posting lists for A, B and C – Search intersections of posting lists – Use “look ahead” to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) 43334557680899299 4121519223337465960627682899498 SKIP
11
July 30th, 2009Lexical Knowledge from Ngrams11 1 Search candidates. 2. Filtering Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request
12
July 30th, 2009Lexical Knowledge from Ngrams12 Filtering Not all candidate ngramID’s match the request We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID – Reduce the index more than 200GB NN VB PERSON LOC A B Freq=123 Freq=10 Freq=5
13
July 30th, 2009Lexical Knowledge from Ngrams13 1. Search candidates 3. Display 2. Filtering Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request
14
July 30th, 2009Lexical Knowledge from Ngrams14 Display N-gram will be displayed in the descending order of frequency – N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC
15
July 30th, 2009Lexical Knowledge from Ngrams15 Size of data Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array For text POS, chunk, NE for N-gram data 108 GB 6 GB 8 GB 260 GB 100 GB Others 40 GB Text 1.7 G words 200M sentences 2.4M articles Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Total 530GB
16
July 30th, 2009Lexical Knowledge from Ngrams16 Future Work Other information (ex: parse, coref, relation, genre, discourse…) Longer n-gram Compress index, dictionary Ease the indexing load – Now we need a big memory machine – Distributing indexing Union operation for tokens
17
July 30th, 2009Lexical Knowledge from Ngrams17 Available for you Web demo – At NYU http://nlp.cs.nyu.edu/nsearch – At JHU? USB Hard drive
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.