Dictionary search Making one-side errors Paper on Bloom Filter.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Information Retrieval in Practice
Search Engines and Information Retrieval
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Algoritmi per IR Ranking. The big fight: find the best ranking...
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Faster TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Basic ranking Models Boolean and Vector Space Models.
Chapter 6: Information Retrieval and Web Search
ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Information Retrieval Quality of a Search Engine.
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval in Practice
The Vector Space Models (VSM)
Information Retrieval and Web Search
Top-K documents Exact retrieval
Ch 6 Term Weighting and Vector Space Model
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Chapter 5: Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS276 Lecture 7: Scoring and results assembly
Term Frequency–Inverse Document Frequency
Presentation transcript:

Dictionary search Making one-side errors Paper on Bloom Filter

Crawling How to keep track of the URLs visited by a crawler? URLs are long Check should be very fast No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs

Searching with errors...

Problem: false positives

TTT 2

Not perfectly true but...

m/n = 8 Opt k = We do have an explicit formula for the optimal k

Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

The big fight: find the best ranking...

Ranking: Google vs Google.cn

Document ranking Text-based Ranking (1° generation) Reading 6.2 and 6.3

Similarity between binary vectors Document is binary vector X,Y in {0,1} D Score: overlap measure What’s wrong ?

Normalization Dice coefficient (wrt avg #terms) : Jaccard coefficient (wrt possible terms) : OK, triangular NO, triangular

What’s wrong in doc-similarity ? Overlap matching doesn’t consider: Term frequency in a document Talks more of t ? Then t should be weighted more. Term scarcity in collection of commoner than baby bed Length of documents score should be normalized

A famous “weight”: tf-idf Number of occurrences of term t in doc d tf t,d where n t = #docs containing term t n = #docs in the indexed collection log n n idf t t         Vector Space model

Why distance is a bad idea Sec. 6.3

A graphical example Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2  cos(  ) = v  w / ||v|| * ||w|| The user query is a very short doc Easy to Spam Sophisticated algos to find top-k docs for a query Q

cosine(query,document) Dot product q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. Sec. 6.3

Cos for length-normalized vectors For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized.

Cosine similarity among 3 docs termSaSPaPWH affection jealous10711 gossip206 wuthering0038 How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

3 documents example contd. Log frequency weighting termSaSPaPWH affection jealous gossip wuthering After length normalization termSaSPaPWH affection jealous gossip wuthering cos(SaS,PaP) ≈ × × × × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69

Storage For every term, we store the IDF in memory, in terms of n t, which is actually the length of its posting list (so anyway needed). For every docID d in the posting list of term t, we store its frequency tf t,d which is tipically small and thus stored with unary/gamma. Sec

Vector spaces and other operators Vector space OK for bag-of-words queries Clean metaphor for similar-document queries Not a good combination with operators: Boolean, wild-card, positional, proximity First generation of search engines Invented before “spamming” web search

Document ranking Top-k retrieval Reading 7

Speed-up top-k retrieval Costly is the computation of the cos Find a set A of contenders, with K < |A| << N Set A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A, according to the score The same approach is also used for other (non- cosine) scoring functions Will look at several schemes following this approach Sec

Possible Approaches Consider docs containing at least one query term. Hence this means… Take this further: 1. Only consider high-idf query terms 2. Champion lists: top scores 3. Only consider docs containing many query terms 4. Fancy hits: sophisticated ranking functions 5. Clustering Sec

Approach #1: High-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs  these (many) docs get eliminated from set A of contenders Sec

Approach #2: Champion Lists Preprocess: Assign to each term, its m best documents Search: If |Q| = q terms, merge their preferred lists (  mq answers). Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)

Approach #3: Docs containing many query terms For multi-term queries, compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversal

3 of 4 query terms Brutus Caesar Calpurnia Antony Scores only computed for docs 8, 16 and 32. Sec

Complex scores Consider a simple total score combining cosine relevance and authority net-score(q,d) = PR(d) + cosine(q,d) Can use some other linear combination than an equal weighting Now we seek the top K docs by net score Sec

Approach #4: Fancy-hits heuristic Preprocess: Assign docID by decreasing PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest (i.e. incr docID = decr PR weight) Idea: a document that scores high should be in FH or in the front of IL Search for a t-term query: First FH: Take the common docs of their FH compute the score of these docs and keep the top-k docs. Then IL: scan ILs and check the common docs Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score becomes smaller than some threshold.

Approach #5: Clustering Query LeaderFollower Sec

Cluster pruning: preprocessing Pick  N docs at random: call these leaders For every other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~  N followers. Sec

Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek K nearest docs from among L’s followers. Sec

Why use random sampling Fast Leaders reflect data distribution Sec

General variants Have each follower attached to b 1 =3 (say) nearest leaders. From query, find b 2 =4 (say) nearest leaders and their followers. Can recur on leader/follower construction. Sec

Document ranking Relevance feedback Reading 9

Relevance Feedback Relevance feedback: user feedback on relevance of docs in initial set of results User issues a (short, simple) query The user marks some results as relevant or non-relevant. The system computes a better representation of the information need based on feedback. Relevance feedback can go through one or more iterations. Sec. 9.1

Rocchio (SMART) Used in practice: D r = set of known relevant doc vectors D nr = set of known irrelevant doc vectors q m = modified query vector; q 0 = original query vector; α, β, γ : weights (hand-chosen or set empirically) New query moves toward relevant documents and away from irrelevant documents Sec

Relevance Feedback: Problems Users are often reluctant to provide explicit feedback It’s often harder to understand why a particular document was retrieved after applying relevance feedback There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Pseudo relevance feedback Pseudo-relevance feedback automates the “manual” part of true relevance feedback. Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift. Sec

Query Expansion In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents In query expansion, users give additional input (good/bad search term) on words or phrases Sec

How augment the user query? Manual thesaurus (costly to generate) E.g. MedLine: physician, syn: doc, doctor, MD Global Analysis (static; all docs in collection) Automatically derived thesaurus (co-occurrence statistics) Refinements based on query-log mining Common on the web Local Analysis (dynamic) Analysis of documents in result set Sec

Query assist Would you expect such a feature to increase the query volume at a search engine?

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy

Happiness: elusive to measure Commonest approach is given by the relevance of search results How do we measure it ? Requires 3 elements: 1.A benchmark document collection 2.A benchmark suite of queries 3.A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system Standard benchmarks TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant  On the Web everything is more complicated since we cannot mark the entire corpus !!

General scenario Relevant Retrieved collection

Precision: % docs retrieved that are relevant [issue “junk” found] Precision vs. Recall Relevant Retrieved collection Recall: % docs relevant that are retrieved [issue “info” found]

How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtp (true positive) fp (false positive) Not Retrievedfn (false negative) tn (true negative)

Some considerations Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases

Precision-Recall curve We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries precision recall x x x x

A common picture precision recall x x x x

F measure Combined measure (weighted harmonic mean) : People usually use balanced F 1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) Use this if you need to optimize a single measure that balances precision and recall.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa

Recommendations We have a list of restaurants with  and  ratings for some Which restaurant(s) should I recommend to Dave?

Basic Algorithm Recommend the most popular restaurants say # positive votes minus # negative votes What if Dave does not like Spaghetti?

Smart Algorithm Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. Perhaps recommend Straits Cafe to Dave  Do you want to rely on one person’s opinions?

A glimpse on XML retrieval Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10

What is XML? eXtensible Markup Language A framework for defining markup languages No fixed collection of markup tags Each XML language targeted for application

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML can be formalized as an XML language (XHTML) XML defines logical structure only HTML: same intention, but has evolved into a presentation language

XML Example (visual)

XML Example (textual) FileCab This chapter describes the commands that manage the FileCab inet application.

Basic Structure An XML document is an ordered, labeled tree character data: leaf nodes contain the actual data (text strings) element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes

XML: Design Goals Separate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured data (data that are structured, but don’t fit relational model) XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

XML Schemas Schema = syntax definition of XML language Schema language = formal language for expressing XML schemas Examples Document Type Definition XML Schema (W3C) Relevance for XML IR Our job is much easier if we have a (one) schema

XML Indexing and Search Most native XML databases have taken a DB approach Exact match Evaluate path expressions No IR type relevance ranking Only a few that focus on relevance ranking

Data vs. Text-centric XML Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality E.g., find me the ISBN #s of Book s with at least three Chapter s discussing cocoa production, ranked by Price

IR XML Challenges There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving/displaying fragment E.g., give me the Abstract s of Paper s on existentialism Where do you retrieve the Abstract from? Need to identify similar elements in different schemas Example: employee

IR XML Challenges: XQuery SQL for XML Usage scenarios Human-readable documents Data-oriented documents Mixed documents (e.g., patient records) Relies on XPath XML Schema datatypes

Queries Supported by XQuery Simple attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Employees with two managers What about relevance ranking?

Data structures for XML retrieval What are the primitives we need? Inverted index: give me all elements matching text query Q We know how to do this – treat each element as a document Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Positional containment Doc: Play Verse Term:droppeth 720 droppeth under Verse under Play. Containment can be viewed as merging postings.

Summary of data structures Path containment etc. can essentially be solved by positional inverted indexes Retrieval consists of “merging” postings All the compression tricks are still applicable Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course