Index Construction David Kauchak cs458 Fall 2012 adapted from:

Slides:



Advertisements
Similar presentations
Lecture 4: Index Construction
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Index Construction David Kauchak cs160 Fall 2009 adapted from:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
PrasadL06IndexConstruction1 Index Construction Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Information Retrieval IR 4. Plan This time: Index construction.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Lecture 4 Index construction
INF 2914 Information Retrieval and Web Search
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 4 9/1/2011.
Lecture 11: DMBS Internals
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
Dictionaries and Tolerant retrieval
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
An Introduction to IR Lecture 4 Index construction 1.
Index Construction 1 Lecture 5: Index Construction Web Search and Mining.
Index Compression David Kauchak cs458 Fall 2012 adapted from:
Index Construction (Modified from Stanford CS276 Class Lecture 4 Index construction)
1 ITCS 6265 Lecture 4 Index construction. 2 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex This lecture:
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
PrasadL06IndexConstruction1 Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 4: Index Construction United International College.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 4: Index Construction.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER IV Index Construction 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS276 Lecture 4 Index construction. Plan Last lecture: Tolerant retrieval Wildcards Spell correction Soundex This time: Index construction.
CS315 Introduction to Information Retrieval Boolean Search 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Index Construction.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Index Construction Some of these slides are based on Stanford IR Course slides at
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Large Scale Search: Inverted Index, etc.
Chapter 4 Index construction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Modified from Stanford CS276 slides Lecture 4: Index Construction
Lecture 7: Index Construction
Lecture 11: DMBS Internals
CS276: Information Retrieval and Web Search
Index Construction: sorting
Lecture 7: Index Construction
Lecture 2- Query Processing (continued)
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Lecture 4: Index Construction
Index construction 4장.
CENG 351 Data Management and File Structures
CS276: Information Retrieval and Web Search
Presentation transcript:

Index Construction David Kauchak cs458 Fall 2012 adapted from:

Administrative Homework 1? Homework 2 out soon Issues with assignment 1? Talks Thursday videos?

Chinese

Google trends Play with it at some point… Many terms peak: - Mitt Romney - Michael Jackson Some terms are cyclical: - sunscreen - christmas

Stemming Reduce terms to their “roots” before indexing The term “stemming” is used since it is accomplished mostly by chopping off part of the suffix of the word automate automates automatic automation run runs running automat run

Stemming example Take a cours in inform retriev is more excit than most cours Taking a course in information retrieval is more exciting than most courses or use the class from assign1 to try some examples out

Porter’s algorithm (1980) Most common algorithm for stemming English Results suggest it’s at least as good as other stemming options Multiple sequential phases of reductions using rules, e.g. sses  ss ies  i ational  ate tional  tion

Lemmatization Reduce inflectional/variant forms to base form Stemming is an approximation for lemmatization Lemmatization implies doing “proper” reduction to dictionary headword form e.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors the boy car be different color

What normalization techniques to use… What is the size of the corpus? small corpora often require more normalization Depends on the users and the queries Query suggestion (i.e. “did you mean”) can often be used instead of normalization Most major search engines do little to normalize data except lowercasing and removing punctuation (and not even these always)

Hardware basics Many design decisions in information retrieval are based on the characteristics of hardware disk main memory cpu

Hardware basics fast, particularly relative to hard-drive access times gigahertz processors multi-core 64-bit for larger workable address space disk main memory cpu ?

Hardware basics GBs to 100s of GBs for servers main memory buses run at hundreds of megahertz ~random access disk main memory cpu ?

Hardware basics No data is transferred from disk while the disk head is being positioned Transferring one large chunk of data from disk to memory is faster than transferring many small chunks Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB. disk main memory cpu ?

Hardware basics A few TBs average seek time 5 ms transfer time per byte 0.02 μs disk main memory cpu ?

RCV1: Our corpus for this lecture As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection This is one year of Reuters newswire (part of 1995 and 1996) Still only a moderately sized data set

Reuters RCV1 statistics statisticvalue documents800K avg. # tokens per doc 200 terms 400K non-positional postings100M

Index construction word 1 word 2 word n documents 1 2 m … … index Input: tokenized, normalized documents Output: postings lists sorted by docID How can we do this?

I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Index construction: collecting docIDs now what? running time?  (tokens) memory? O(1)

Index construction: sort dictionary sort based on terms and then? running time?  (T log T) memory?  (T)

Index construction: create postings list create postings lists from identical entries word 1 word 2 word n … running time?  (tokens) What does this imply about the sorting algorithm?

Scaling index construction In-memory index construction does not scale! What is the major limiting step? both the collecting document IDs and creating posting lists require little memory since it’s just a linear traversal of the data sorting is memory intensive! Even in-place sorting algorithms still require O(n) memory

Scaling index construction In-memory index construction does not scale! For RCV1: statisticvalue documents800K avg. # tokens per doc 200 terms 400K non-positional postings100M How much memory is required?

Scaling index construction In-memory index construction does not scale! For RCV1: statisticvalue documents800K avg. # tokens per doc 200 terms 400K non-positional postings100M What about for 10 years of news data?

Scaling index construction In-memory index construction does not scale! For RCV1: statisticvalue documents800K avg. # tokens per doc 200 terms 400K non-positional postings100M What about for 300 billion web pages?

On-disk sorting What are our options? sort on-disk: keep all data on disk. When we need to access entries, access entries Random access on disk is slow…… Break up list into chunks. Sort chunks, then merge chunks (e.g. unix “merge” function or mergesort) split sort chunks merge chunks

On-disk sorting: splitting split Do this while processing When we reach a particular size, start the sorting process

On-disk sorting: sorting chunks Pick the chunk size so that we can sort the chunk in memory Generally, pick as large a chunk as possible while still being able to sort in memory sort chunks

On-disk sorting How can we do this? merge chunks ?

On-disk sorting How can we do this so that it is time and memory efficient? merge chunks ? ?

Binary merges Could do binary merges, with a merge tree For n chunks, how many levels will there be? log(n)

n-way merge More efficient to do an n-way merge, where you are reading from all blocks simultaneously Providing you read decent-sized chunks of each block into memory, you’re not killed by disk seeks Only one level of merges! Is it linear?

Another approach: SPIMI Sorting can still be expensive Is there any way to do the indexing without sorting? - Accumulate posting lists as they occur - When size gets too big, start a new chunk - Merge chunks at the end

Another approach I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 I did enact julius caesar was killed i’ the capitol brutus me

Another approach I did enact julius caesar was killed i’ the capitol brutus me so let it be with noble … So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc

Another approach word 1 word 2 word n … word 1 word 2 word n … … When posting lists get to large to fit in memory, we write to disk and start another one

The merge word 1 word 2 word n … word 1 word 2 word m … word 1 word 2 word k … Running time? linear in the sizes of the postings list being merged As with merging sorted dictionary entries we can either do pairwise binary tree type merging or do an n-way merge

Distributed indexing For web-scale indexing we must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Google data centers Google data centers mainly contain commodity machines Data centers are distributed around the world Estimates: : a total of 1 million servers, 3 million processors - Google says: In planning 1 million – 10 million machines - Google installs 100,000 servers each quarter Based on expenditures of 200–250 million dollars per year This would be 10% of the computing capacity of the world!?! % of the total worldwide electricity

Fault tolerance Hardware failures What happens when you have 1 million servers? Hardware is always failing! >30% chance of failure within 5 years