Implementation Issues & IR Systems

Implementation Issues & IR Systems
(Lecture for CS410 Intro Text Info Systems) Feb. 14, 2007 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Lecture Plan How to implement a simple IR system
Index construction Scoring Open source IR toolkits

IR System Architecture
docs INDEXING Query Rep query Doc Rep User Ranking SEARCHING results INTERFACE Feedback judgments QUERY MODIFICATION

Indexing Indexing = Convert documents to data structures that enable fast search Inverted index is the dominating indexing method (used by all search engines) Other indices (e.g., document index) may be needed for feedback

Inverted Index Fast access to all docs containing a given term (along with freq and pos information) For each term, we get a list of tuples (docID, freq, pos). Given a query, we can fetch the lists for all query terms and work on the involved documents. Boolean query: set operation Natural language query: term weight summing More efficient than scanning docs (why?)

Inverted Index Example
Doc 1 Dictionary Postings This is a sample document with one sample sentence Term # docs Total freq This 2 is sample 3 another 1 … Doc id Freq 1 2 … Doc 2 This is another sample document

Data Structures for Inverted Index
Dictionary: modest size Needs fast random access Preferred to be in memory Hash table, B-tree, trie, … Postings: huge Sequential access is expected Can stay on disk May contain docID, term freq., term pos, etc Compression is desirable

Inverted Index Compression
Observations Inverted list is sorted (e.g., by docid or termfq) Small numbers tend to occur more frequently Implications “d-gap” (store difference): d1, d2-d1, d3-d2-d1,… Exploit skewed frequency distribution: fewer bits for small (high frequency) integers Binary code, unary code, -code, -code

Integer Compression Methods
In general, to exploit skewed distribution Binary: equal-length coding Unary: x1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110 -code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001 -code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101 Unary is best for small numbers γ-code: for sale number e.g: [log7]=2, 7=>11011 γδ-code:

Constructing Inverted Index
The main difficulty is to build a huge index with limited memory Memory-based methods: not usable for large collections Sort-based methods: Step 1: collect local (termID, docID, freq) tuples Step 2: sort local tuples (to make “runs”) Step 3: pair-wise merge runs Step 4: Output inverted file

... Sort-based Inversion ... ... Parse & Count “Local” sort Merge sort
<1,1,3> <2,1,2> <3,1,1> ... <1,2,2> <3,2,3> <4,2,2> … <1,300,3> <3,300,1> Sort by doc-id Parse & Count <1,1,3> <1,2,2> <2,1,2> <2,4,3> ... <1,5,3> <1,6,2> … <1,299,3> <1,300,1> Sort by term-id “Local” sort <1,1,3> <1,2,2> <1,5,2> <1,6,3> ... <1,300,3> <2,1,2> … <5000,299,1> <5000,300,1> Merge sort All info about term 1 Term Lexicon: the 1 cold 2 days 3 a 4 ... doc1 doc1 ... DocID Lexicon: doc1 1 doc2 2 doc3 3 ... doc300

Searching Given a query, score documents efficiently Boolean query
Fetch the inverted list for all query terms Perform set operations to get the subset of docs that satisfy the Boolean condition E.g., Q1=“info” AND “security” , Q2=“info” OR “security” info: d1, d2, d3, d4 security: d2, d4, d6 Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)

Ranking Documents Assumption:score(d,q)=f[g(w(d,q,t1),…w(d,q,tn)), w(d),w(q)], where, ti’s are the matched terms Maintain a score accumulator for each doc to compute function g For each query term ti Fetch the inverted list {(d1,f1),…,(dn,fn)} For each entry (dj,fj), Compute w(dj,q,ti), and Update score accumulator for doc di Adjust the score to compute f, and sort

Ranking Documents: Example
Query = “info security” S(d,q)=g(t1)+…+g(tn) [sum of freq of matched terms] Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5) Security: (d2, 3), (d4,1), (d5, 3) Accumulators: d d d3 d d5 (d1,3) => (d2,4) => (d3,1) => (d4,5) => (d2,3) => (d4,1) => (d5,3) => info security

Further Improving Efficiency
Keep only the most promising accumulators Sort the inverted list in decreasing order of weights and fetch only N entries with the highest weights Pre-compute as much as possible

Open Source IR Toolkits
Smart (Cornell) MG (RMIT & Melbourne, Australia; Waikato, New Zealand), Lemur (CMU/Univ. of Massachusetts) Terrier (Glasgow) Lucene (Open Source)

Smart The most influential IR system/toolkit
Developed at Cornell since 1960’s Vector space model with lots of weighting options Written in C The Cornell/AT&T groups have used the Smart system to achieve top TREC performance

MG A highly efficient toolkit for retrieval of text and images
Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s Written in C, running on Unix Vector space model with lots of compression and speed up tricks People have used it to achieve good TREC performance

Lemur/Indri An IR toolkit emphasizing language models
Developed at CMU and Univ. of Massachusetts in 2000’s Written in C++, highly extensible Vector space and probabilistic models including language models Achieving good TREC performance with a simple language model

Terrier A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support Developed at University of Glasgow, UK Written in Java, open source “Divergence from randomness” retrieval model and other modern retrieval formulas

Lucene Open Source IR toolkit
Initially developed by Doug Cutting in Java Now has been ported to some other languages Good for building IR/Web applications Many applications have been built using Lucene (e.g., Nutch Search Engine) Currently the retrieval algorithms have poor accuracy

Visit the course resource page for links to these toolkits and other tools…

What You Should Know What is an inverted index
Why does an inverted index help make search fast How to construct a large inverted index Simple integer compression methods How to use an inverted index to rank documents efficiently HOW TO IMPLEMENT A SIMPLE IR SYSTEM

Implementation Issues & IR Systems

Similar presentations

Presentation on theme: "Implementation Issues & IR Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementation Issues & IR Systems

Similar presentations

Presentation on theme: "Implementation Issues & IR Systems"— Presentation transcript:

Similar presentations

About project

Feedback