Search in Google's N-grams

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Dynamic Time Warping (DTW)

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Introduction to Information Retrieval

Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Shallow Copy Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.

Information Retrieval IR 4. Plan This time: Index construction.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.

LIS618 lecture 2 the Boolean model Thomas Krichel

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Sorting Algorithms Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

RuSSIR 2013 QBSH and AFP as Two Successful Paradigms of Music Information Retrieval Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept.

Binary Search Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

STL: Maps Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Discussions on Audio Melody Extraction (AME) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

Simulation of Stock Trading J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

Linear Classifiers (LC) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

Final Project: English Preposition Usage Checker J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.

Introduction to Music Information Retrieval (MIR)

From C to C++ Jyh-Shing Roger Jang (張智星)

Why indexing? For efficient searching of a document

CSIE Dept., National Taiwan Univ., Taiwan

Large Scale Search: Inverted Index, etc.

DP for Optimum Strategies in Games

Linguistic Graph Similarity for News Sentence Searching

Query by Singing/Humming via Dynamic Programming

CS122B: Projects in Databases and Web Applications Winter 2017

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Modified from Stanford CS276 slides Lecture 4: Index Construction

Intro to Machine Learning

Text Based Information Retrieval

Web News Sentence Searching Using Linguistic Graph Similarity

Information Retrieval and Web Search

Introduction to Music Information Retrieval (MIR)

Chapter 12: Query Processing

Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics

Query Languages.

CSCE 561 Information Retrieval System Models

Search in OOXX Games J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept.

Introduction to Music Information Retrieval (MIR)

CS122B: Projects in Databases and Web Applications Winter 2017

Given value and sorted array, find index.

Machine Learning in FinTech

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

National Taiwan University

Applications of Heaps J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept.

Query by Singing/Humming via Dynamic Programming

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Game Trees and Minimax Algorithm

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Introduction to Search Engines

Sorting Algorithms Jyh-Shing Roger Jang (張智星)

Presentation transcript:

Search in Google's N-grams J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Linggle Goals Demos Linguistic search Computer-assisted language learning Demos http://linggle.com Query examples present a method _ propose * to at/in the afternoon discuss ?about the issue v. death penalty to v. education to v. ?prep. ?det. difficulty/difficulties

Google‘s Web1T N-gram Dataset Blog for Web1T Statistics 1,024,908,267,229 words of running text 1,176,470,663 five-word sequences (appear at least 40 times) 13,588,391 unique words (appear at least 200 times) Applications: machine translation, speech recognition, spelling correction, language learning, and others From sentences to n-grams Document N-grams

Offline Task: Inverted Index Documents (n-gram entries) have a book for good good to have good job Inverted index a  0 book  0 for  1 good  1 2 3 have  0 2 job  3 to  2 Dictionary Postings

Online Task 1: Query Expansion Supported queries Wild cards _  listen _ music *  a * book Alternatives ?  discuss ?about the issue /  in/at the afternoon Query expansion To expand a query until it is composed of literals and “_ discuss ?about the issue in/at the afternoon a * book give * a *

Online Task 2: Merge Postings Find intersection of postings give  2 5 7 54 78 97 267 _  joy  3 7 45 23 97 345 890 Ordering checkup Print based on descending frequency

Typical Approach Offline task: Inverted index Extract all words from the n-gram dataset Create a dictionary of all sorted unique words Generate a posting for each word in the dictionary Online task: Query processing Expand the query until it contains only words or "_". Extract words from the query. Retrieve each word (by binary search or the likes) and its posting from one of the n-gram sets. Combine postings to have the candidate output set. Generate final output by considering ordering, etc. Sort and print the output ordered by frequency.

3 Steps for Query Expansion (1/2) First, expand "?". abc ?x/y/z _ ?*/p  abc _ abc _ */p abc x/y/z _ abc x/y/z _ */p Second, expand "/". abc x/y/z _ */p  abc x _ * abc x _ p abc y _ * abc y _ p abc z _ * abc z _ p

3 Steps for Query Expansion (2/2) Last, expand "*". give * a *  give a give a _ give a _ _ give a _ _ _ give _ a give _ a _ give _ a _ _ give _ _ a give _ _ a _ give _ _ _ a Note that it will be easier to write a recursive function for each of the above expansions.

Recursive Formula for Query Expansion Expand "?" expand1({?a, b, c}) is the union of expand1({b, c}) a+expand1({b, c}) Expand "/" expand2({a/b, c, d/e}) is the union of a+expand2({c, d/e}) b+expand2({c, d/e}) Expand "*" expand3({*, b, c, d}) is the union of expand3({b, c, d}) _+expand3({b, c, d}) _+_+expand3({b, c, d})

Web Resources Tutorials by Stanford NLP-Professor Dan Jurafsky & Chris Manning Inverted index This HW does not need to do tokenization, normalization, stemming, stop words Merge postings

How to Optimize Your Program Strategies for speedup Use tries instead of binary search (***) When sorting, use pointers instead of moving entries (***) Do not use STL sets for postings. Write your own function for merging postings (**) Read a file into memory before further processing (*) Use constant-size arrays since n-gram files are fixed (*) Strategies for saving memory Process tokens on the fly

Suggested schedule Time is limited! Finish inverted index this week Finish query expansion and overall test next week