Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Course purpose 2 Teach in English in most time Introduce senior undergraduate students to some advanced topics in computer science
Course contents 3 Introduction to information retrieval Approximate string processing XML data management Cloud computing
Lecturer Academic experience ~ University of California, Irvine, Postdoc researcher Supervisor : Prof. Chen Li ~ National University of Singapore, PhD candidate Supervisor : Prof. Ling Tok Wang ~ Shanghai Jiao Tong University Master candidate
University of California, Irvine
Research in Postdoc 6 6 Data integration in medical system [US patent] Approximate string search [ICDE08]
7 National University of Singapore
Course grading 8 Presentation in English/Chinese only 40% Programming only 40% In-class presence and quiz 20%
Any question and any comments ?
Evaluating Information Retrieval
Online text book: Introduction to Information Retrieval book.html
search engine Have you any comments about search engine? Baidu Google Sogou Yahoo
Measures for a search engine How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of query language Speed on complex queries
Measures for a search engine All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise The key measure: user happiness What is this? Speed of response/size of index are factors But blindingly fast, useless answers won ’ t make a user happy Need a way of quantifying user happiness
Measuring user happiness Issue: who is the user we are trying to make happy? Depends on the setting Web engine: user finds what they want and return to the engine Can measure rate of return users eCommerce site: user finds what they want and make a purchase Is it the end-user, or the eCommerce site, whose happiness we measure? Measure time to purchase, or fraction of searchers who become buyers?
Measuring user happiness Enterprise (company/govt/academic): Care about “ user productivity ” How much time do my users save when looking for information? Many other criteria having to do with breadth of access, secure access … more later
Happiness: elusive to measure But how do you measure relevance? Will detail a methodology here, then examine its issues Requires 3 elements: 1.A benchmark document collection 2.A benchmark suite of queries 3.A binary assessment of either Relevant or Irrelevant for each query-doc pair
Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective
Standard relevance benchmarks TREC - National Institute of Standards and Testing (NIST) has run large IR benchmark for many years Reuters and other benchmark doc collections used “ Retrieval tasks ” specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Irrelevant or at least for subset of docs that some system returned for that query
Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtpfp Not Retrieved fntn
Accuracy – a different measure Given a query an engine classifies each doc as “ Relevant ” or “ Irrelevant ”. Accuracy of an engine: the fraction of these classifications that is correct.
Why not just use accuracy? How to build a % accurate search engine on a low budget …. People doing information retrieval want to find something and have a certain tolerance for junk.
Precision/Recall Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases (in a good system)
Difficulties in using precision/recall Should average over large corpus/query ensembles Need human relevance assessments People aren ’ t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another
A combined measure: F Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): People usually use balanced F 1 measure i.e., with = 1 or = ½
Any question and any comments ?
Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtpfp Not Retrieved fntn
Precision and Recall Quiz Precision P = tp/(tp + fp) = 10/13= 77% Recall R = tp/(tp + fn)=10/15= 67% RelevantNot Relevant Retrieved103 Not Retrieved 52
Introduction to Information Retrieval System
Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare ’ s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase Romans and countrymen) not feasible
Term-document incidence 1 if play contains word, 0 otherwise
Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND AND AND =
Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
Bigger document collections Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these.
Can ’ t build the matrix 500K x 1M matrix has half-a-trillion 0 ’ s and 1 ’ s. But it has no more than one billion 1 ’ s. matrix is extremely sparse. What ’ s a better representation? We only record the 1 positions. Why?
Inverted index For each term T: store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar What happens if the word Caesar is added to document 14?
Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings Sorted by docID (more later on why).
Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman More on these later. Documents to be indexed. Friends, Romans, countrymen.
Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps
Sort by terms. Core indexing step.
Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.
The result is split into a Dictionary file and a Postings file.
Where do we pay in storage? Pointers Terms Will quantify the storage, later.
The index we just built How do we process a Boolean query? Later - what kinds of queries can we process? Today’s focus
Query processing Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “ Merge ” the two postings: Brutus Caesar
The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
Basic postings intersection
Boolean queries: Exact match Queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., Lawyers) still like Boolean queries: You know exactly what you ’ re getting.
Example: WestLaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search
More general merges Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)?
Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “ linear ” time? Linear in what? Can we do better?
Query optimization What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. Brutus Calpurnia Caesar Query: Brutus AND Calpurnia AND Caesar
Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. Brutus Calpurnia Caesar This is why we kept freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.
Query optimization
More general optimization e.g., (madding OR crowd) AND (ignoble OR strife) Get freq ’ s for all terms. Estimate the size of each OR by the sum of its freq ’ s (conservative). Process in increasing order of OR sizes.
Exercise Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Query processing exercises If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen? Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size? Hint: Begin with the case of a Boolean formula query: the each query term appears only once in the query.
Greedy optimization (Process in increasing order of term frequency): Is this always guaranteed to be optimal?
Beyond Boolean term search What about phrases? Proximity: Find Gates NEAR Microsoft. Need index to capture position information in docs. More later. Zones in documents: Find documents with (author = Ullman) AND (text contains automata).
Evidence accumulation 1 vs. 0 occurrence of a search term 2 vs. 1 occurrence 3 vs. 2 occurrences, etc. Need term frequency information in docs. Used to compute a score for each document Matching documents rank-ordered by this score.