N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned.
P3- Represent how data flows around a computer system
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
File Systems.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Inverted Index Hongning Wang
Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Information Retrieval IR 4. Plan This time: Index construction.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
INF 2914 Information Retrieval and Web Search
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Dictionaries and Tolerant retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Why indexing? For efficient searching of a document
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Networked Software Systems Laboratory
Big Data is a Big Deal!.
Search in Google's N-grams
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Query processing: phrase queries and positional indexes
Lecture 7: Index Construction
Implementation Issues & IR Systems
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Index Construction: sorting
Lecture 7: Index Construction
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Inverted Indexing for Text Retrieval
Query processing: phrase queries and positional indexes
Page Table Implementations
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

July 30th, 2009Lexical Knowledge from Ngrams2 Hammer : Fast and multi- functional n-gram search engine 2 ngrams Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text

July 30th, 2009Lexical Knowledge from Ngrams3 Characteristics Search up to 7 grams with wildcards Multi-level input – Token, POS, chunk, NE, combinations – NOT, OR for POS, chunk, NE Multi-level output – Token, POS, chunk, NE – document information – Original sentences, KWIC, ngram Display – Show the results in the order of frequency Running Environment – Single CPU, PC-Linux, 400MB process, 500GB disk 3

July 30th, 2009Lexical Knowledge from Ngrams4 Demo

July 30th, 2009Lexical Knowledge from Ngrams5 Available for you Web system – At NYU – At JHU? USB Hard drive

July 30th, 2009Lexical Knowledge from Ngrams6 1. Search candidates 2. Filtering 3. Display Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request

July 30th, 2009Lexical Knowledge from Ngrams7 1. Search candidates Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request

July 30th, 2009Lexical Knowledge from Ngrams8 Example: 3-grams Posting list From n-gram to Inverted Index Ngram IDPosition=1Position=2Position=3 1ABC 2ABB 3BAC 3 A pos=2 12 A pos=1 3 B pos=1 12 B pos=2 2 B pos=3 13 C pos=3

July 30th, 2009Lexical Knowledge from Ngrams9 Posting list Wide variation of posting list size (in 7-gram: 1.27B) – “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672) – conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size – Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) 3.2B bits (list) – List of ngramID – Encoded into pointer (freq=1) 13 C pos= C pos=3 5

July 30th, 2009Lexical Knowledge from Ngrams10 Search Given an n-gram request (A B C) – Get posting lists for A, B and C – Search intersections of posting lists – Use “look ahead” to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) SKIP

July 30th, 2009Lexical Knowledge from Ngrams11 1 Search candidates. 2. Filtering Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request

July 30th, 2009Lexical Knowledge from Ngrams12 Filtering Not all candidate ngramID’s match the request We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID – Reduce the index more than 200GB NN VB PERSON LOC A B Freq=123 Freq=10 Freq=5

July 30th, 2009Lexical Knowledge from Ngrams13 1. Search candidates 3. Display 2. Filtering Implementation: Overview Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array for text POS, chunk, NE for N-gram data Search request

July 30th, 2009Lexical Knowledge from Ngrams14 Display N-gram will be displayed in the descending order of frequency – N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC

July 30th, 2009Lexical Knowledge from Ngrams15 Size of data Wikipedia text Wikipedia POS, chunk, NE N-gram data Inverted index for n-gram data Suffix array For text POS, chunk, NE for N-gram data 108 GB 6 GB 8 GB 260 GB 100 GB Others 40 GB Text 1.7 G words 200M sentences 2.4M articles Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Total 530GB

July 30th, 2009Lexical Knowledge from Ngrams16 Future Work Other information (ex: parse, coref, relation, genre, discourse…) Longer n-gram Compress index, dictionary Ease the indexing load – Now we need a big memory machine – Distributing indexing Union operation for tokens

July 30th, 2009Lexical Knowledge from Ngrams17 Available for you Web demo – At NYU – At JHU? USB Hard drive