Search in Google's N-grams

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Dynamic Time Warping (DTW)
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Shallow Copy Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval IR 4. Plan This time: Index construction.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
LIS618 lecture 2 the Boolean model Thomas Krichel
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Sorting Algorithms Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
RuSSIR 2013 QBSH and AFP as Two Successful Paradigms of Music Information Retrieval Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept.
Binary Search Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
STL: Maps Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Discussions on Audio Melody Extraction (AME) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Simulation of Stock Trading J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Linear Classifiers (LC) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Final Project: English Preposition Usage Checker J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Introduction to Music Information Retrieval (MIR)
From C to C++ Jyh-Shing Roger Jang (張智星)
Why indexing? For efficient searching of a document
CSIE Dept., National Taiwan Univ., Taiwan
Large Scale Search: Inverted Index, etc.
DP for Optimum Strategies in Games
Linguistic Graph Similarity for News Sentence Searching
Query by Singing/Humming via Dynamic Programming
CS122B: Projects in Databases and Web Applications Winter 2017
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Modified from Stanford CS276 slides Lecture 4: Index Construction
Intro to Machine Learning
Text Based Information Retrieval
Web News Sentence Searching Using Linguistic Graph Similarity
Information Retrieval and Web Search
Introduction to Music Information Retrieval (MIR)
Chapter 12: Query Processing
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Query Languages.
CSCE 561 Information Retrieval System Models
Search in OOXX Games J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept.
Introduction to Music Information Retrieval (MIR)
CS122B: Projects in Databases and Web Applications Winter 2017
Given value and sorted array, find index.
Machine Learning in FinTech
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
National Taiwan University
Applications of Heaps J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept.
Query by Singing/Humming via Dynamic Programming
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Game Trees and Minimax Algorithm
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Search Engines
Sorting Algorithms Jyh-Shing Roger Jang (張智星)
Presentation transcript:

Search in Google's N-grams J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Linggle Goals Demos Linguistic search Computer-assisted language learning Demos http://linggle.com Query examples present a method _ propose * to at/in the afternoon discuss ?about the issue v. death penalty to v. education to v. ?prep. ?det. difficulty/difficulties

Google‘s Web1T N-gram Dataset Blog for Web1T Statistics 1,024,908,267,229 words of running text 1,176,470,663 five-word sequences (appear at least 40 times) 13,588,391 unique words (appear at least 200 times) Applications: machine translation, speech recognition, spelling correction, language learning, and others From sentences to n-grams Document N-grams

Offline Task: Inverted Index Documents (n-gram entries) have a book for good good to have good job Inverted index a  0 book  0 for  1 good  1 2 3 have  0 2 job  3 to  2 Dictionary Postings

Online Task 1: Query Expansion Supported queries Wild cards _  listen _ music *  a * book Alternatives ?  discuss ?about the issue /  in/at the afternoon Query expansion To expand a query until it is composed of literals and “_ discuss ?about the issue in/at the afternoon a * book give * a *

Online Task 2: Merge Postings Find intersection of postings give  2 5 7 54 78 97 267 _  joy  3 7 45 23 97 345 890 Ordering checkup Print based on descending frequency

Typical Approach Offline task: Inverted index Extract all words from the n-gram dataset Create a dictionary of all sorted unique words Generate a posting for each word in the dictionary Online task: Query processing Expand the query until it contains only words or "_". Extract words from the query. Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets. Combine postings to have the candidate output set. Generate final output by considering ordering, etc. Sort and print the output ordered by frequency.

3 Steps for Query Expansion (1/2) First, expand "?". abc ?x/y/z _ ?*/p  abc _ abc _ */p abc x/y/z _ abc x/y/z _ */p Second, expand "/". abc x/y/z _ */p  abc x _ * abc x _ p abc y _ * abc y _ p abc z _ * abc z _ p

3 Steps for Query Expansion (2/2) Last, expand "*". give * a *  give a give a _ give a _ _ give a _ _ _ give _ a give _ a _ give _ a _ _ give _ _ a give _ _ a _ give _ _ _ a Note that it will be easier to write a recursive function for each of the above expansions.

Recursive Formula for Query Expansion Expand "?" expand1({?a, b, c}) is the union of expand1({b, c}) a+expand1({b, c}) Expand "/" expand2({a/b, c, d/e}) is the union of a+expand2({c, d/e}) b+expand2({c, d/e}) Expand "*" expand3({*, b, c, d}) is the union of expand3({b, c, d}) _+expand3({b, c, d}) _+_+expand3({b, c, d})

Web Resources Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning Inverted index This HW does not need to do tokenization, normalization, stemming, stop words Merge postings

How to Optimize Your Program Strategies for speedup Use tries instead of binary search (***) When sorting, use pointers instead of moving entries (***) Do not use STL sets for postings. Write your own function for merging postings (**) Read a file into memory before further processing (*) Use constant-size arrays since n-gram files are fixed (*) Strategies for saving memory Process tokens on the fly

Suggested schedule Time is limited! Finish inverted index this week Finish query expansion and overall test next week