Indexing and Complexity. Agenda Inverted indexes Computational complexity.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Announcements You survived midterm 2! No Class / No Office hours Friday.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
Week 12 - Wednesday.  What did we talk about last time?  Hunters and prey  Class variables  Big Oh notation.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
LBSC 796/INFM 718R: Week 5 Indexing Jimmy Lin College of Information Studies University of Maryland Monday, February 27, 2006.
Information Retrieval Review
Modern Information Retrieval
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 File Structure n File as a stream of characters l No structure l Consider students registered in a course Joe SmithSC Kathy LeeEN Albert.
Information Retrieval IR 4. Plan This time: Index construction.
CS/Info 430: Information Retrieval
Advance Information Retrieval Topics Hassan Bashiri.
Primary Indexes Dense Indexes
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
LIS618 lecture 2 the Boolean model Thomas Krichel
File Processing - Indexing MVNC1 Indexing Jim Skon.
Ch 18 – Big-O Notation: Sorting & Searching Efficiencies Our interest in the efficiency of an algorithm is based on solving problems of large size. If.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 Today’s Material Iterative Sorting Algorithms –Sorting - Definitions –Bubble Sort –Selection Sort –Insertion Sort.
Sorting – Insertion and Selection. Sorting Arranging data into ascending or descending order Influences the speed and complexity of algorithms that use.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Chapter 1 Introduction File Structures Readings: Folk, Chapter 1.
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
Week 12 - Friday.  What did we talk about last time?  Finished hunters and prey  Class variables  Constants  Class constants  Started Big Oh notation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Internal and External Sorting External Searching
CSE 326 Nov 18, 1999 (Title pages make Powerpoint happy)
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Today’s Material Sorting: Definitions Basic Sorting Algorithms
Sorting Algorithms Written by J.J. Shepherd. Sorting Review For each one of these sorting problems we are assuming ascending order so smallest to largest.
Information Retrieval On the use of the Inverted Lists.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS203 – Advanced Computer Architecture Virtual Memory.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Text Indexing and Search
Sorting Algorithms Written by J.J. Shepherd.
Wednesday, April 18, 2018 Announcements… For Today…
Lecture 7: Index Construction
Lecture 13: Computer Memory
Self-Balancing Search Trees
Presentation transcript:

Indexing and Complexity

Agenda Inverted indexes Computational complexity

Some Interesting Questions How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?

A Cautionary Tale Searching is easy - just ask Microsoft! –“Find” can search my 1 GB disk in 30 seconds Well, actually it only looks at the file names... How long do you think find would take for –The 100 GB disk we just got? –For the World Wide Web? Computers are getting faster, but… –How does AltaVista give answers in 5 seconds?

The “Inverted File” Trick Organize the bag of words matrix by terms –You know the terms that you are looking for Look up terms like you search phone books –For each letter, jump directly to the right spot For terms of reasonable length, this is very fast –For each term, store the document identifiers For every document that contains that term At query time, use the document identifiers –Consult a “postings file”

An Example quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

What Goes in a Postings File? Boolean retrieval –Just the document number Ranked Retrieval –Document number and term weight (TF*IDF,...) Proximity operators –Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used! Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents! But access is fast - you know where to look

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

Starting a B+ Tree Inverted File nowtimegoodall aaaaanow Now is the time for all good …

Adding a New Term nowtimegoodall aaaaanow Now is the time for all good men … aaaaamen

How Big is the Inverted Index? Typically smaller than the postings file –Depends on number of terms, not documents Eventually almost all terms will be indexed –But the postings file will continue to grow Postings dominate asymptotic space complexity –Linear in the number of documents Assuming that the documents remain about the same size

Some Facts About Disks It takes a long time to get the first byte –A Pentium can do 1,000,000 operations in 10 ms But you can get 1,000 bytes just about as fast –40 MB/sec transfer rates are typical So it pays to put related stuff in each “block” –M-ary trees B+ are better than binary B+ trees Time complexity is measured in disk blocks read –Since computing time is negligible by comparison

Time Complexity Indexing –Walk the inverted file, splitting if needed –Insert into the postings file in sorted order –Hours or days for large collections Query processing –Walk the inverted file –Read the postings file –Seconds, even for enormous collections

Summary Slow indexing yields fast query processing We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk block reads are the critical resource –Fast disks are more useful than fast computers

A Question If insertions are more common than queries (for example, filtering news stories as they arrive and then never looking at them again), what kind of an index should you build?

Indexing High Volume Streams Build an index based on dates –Index based on anticipated search strategies Balanced trees allow easy insertions –Easier than sorted arrays Unbalanced trees might be even faster –Indexing time saved could justify query time cost Don’t do any indexing at all –If the queries are stable, just keep them in RAM