Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
UNESCO ICTLIP Module 4. Lesson 3 Database Design, and Information Storage and Retrieval Lesson 3. Information storage and retrieval using WinISIS.
ADFOCS 2004 Prabhakar Raghavan Lecture 3. Zones A zone is an identified region within a doc E.g., Title, Abstract, Bibliography Generally culled from.
Information Retrieval in Practice
Information Retrieval Lecture 6. Recap of lecture 4 Query expansion Index construction.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Information Retrieval IR 5. Plan Last lecture Index construction This lecture Parametric and field searches Zones in documents Wild card queries Scoring.
Hinrich Schütze and Christina Lioma
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
LIS618 lecture 2 the Boolean model Thomas Krichel
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in.
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Information Retrieval and Web Search Lecture 7 1.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Lecture 5 Term and Document Frequency CS 6633 資訊檢索 Information Retrieval and Web Search 資電 125 Based on ppt files by Hinrich Schütze.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CiNii Articles is a service that provides information on scholastic articles, with an emphasis on Japanese papers. It allows users to find the articles.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
ITGS Databases.
Information Retrieval and Web Mining Lecture 6. This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
CS315 Introduction to Information Retrieval Boolean Search 1.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval and Web Search
CS122B: Projects in Databases and Web Applications Winter 2017
Query processing: phrase queries and positional indexes
Text Based Information Retrieval
Implementation Issues & IR Systems
Search Techniques and Advanced tools for Researchers
Introduction to Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
INF 141: Information Retrieval
Introduction to Search Engines
Presentation transcript:

Parametric search and zone weighting Lecture 6

Recap of lecture 4 Query expansion Index construction

This lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring

Parametric search Each document has, in addition to text, some “meta-data” in fields e.g., Language = French Format = pdf Subject = Physics etc. Date = Feb 2000 A parametric search interface allows the user to combine a full-text query with selections on these field values e.g., language, date range, etc. Fields Values

Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort. Parametric search example

We can add text search.

Parametric/field search In these examples, we select field values Values can be hierarchical, e.g., Geography: Continent  Country  State  City A paradigm for navigating through the document collection, e.g., “Aerospace companies in Brazil” can be arrived at first by selecting Geography then Line of Business, or vice versa Winnow docs in contention and run text searches scoped to subset

Index support for parametric search Must be able to support queries of the form Find pdf documents that contain “stanford university” A field selection (on doc format) and a phrase query Field selection – use inverted index of field values  docids Organized by field name Use compression etc as before

Parametric index support Optional – provide richer search on field values – e.g., wildcards Find books whose Author field contains s*trup Range search – find docs authored between September and December Inverted index doesn’t work (as well) Use techniques from database range search Use query optimization heuristics as before

Field retrieval In some cases, must retrieve field values E.g., ISBN numbers of books by s*trup Maintain forward index – for each doc, those field values that are “retrievable” Indexing control file specifies which fields are retrievable

Zones A zone is an identified region within a doc E.g., Title, Abstract, Bibliography Generally culled from marked-up input or document metadata (e.g., powerpoint) Contents of a zone are free text Not a “finite” vocabulary Indexes for each zone - allow queries like sorting in Title AND smith in Bibliography AND recur* in Body Not queries like “all papers whose authors cite themselves” Why?

Zone indexes – simple view Title AuthorBody etc.

So we have a database now? Not really. Databases do lots of things we don’t need Transactions Recovery (our index is not the system of record; if it breaks, simple reconstruct from the original source) Indeed, we never have to store text in a search engine – only indexes We’re focusing on optimized indexes for text-oriented queries, not a SQL engine.

Scoring

Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding of their needs and the corpus Applications can consume 1000’s of results Not good for (the majority of) users with poor Boolean formulation of their needs Most users don’t want to wade through 1000’s of results – cf. altavista

Scoring We wish to return in order the documents most likely to be useful to the searcher How can we rank order the docs in the corpus with respect to a query? Assign a score – say in [0,1] for each doc on each query Begin with a perfect world – no spammers Nobody stuffing keywords into a doc to make it match queries More on this in 276B under web search

Linear zone combinations First generation of scoring methods: use a linear combination of Booleans: E.g., Score = 0.6* + 0.3* + 0.1* Each expression such as takes on a value in {0,1}. Then the overall score is in [0,1]. For this example the scores can only take on a finite set of values – what are they?

Linear zone combinations In fact, the expressions between <> on the last slide could be any Boolean query Who generates the Score expression (with weights such as 0.6 etc.)? In uncommon cases – the user through the UI Most commonly, a query parser that takes the user’s Boolean query and runs it on the indexes for each zone Weights determined from user studies and hard-coded into the query parser

Exercise On the query bill OR rights suppose that we retrieve the following docs from the various zone indexes: bill rights bill rights bill rights Abstract Title Body Compute the score for each doc based on the weightings 0.6,0.3,0.1

General idea We are given a weight vector whose components sum up to 1. There is a weight for each zone/field. Given a Boolean query, we assign a score to each doc by adding up the weighted contributions of the zones/fields. Typically – users want to see the K highest- scoring docs.

Index support for zone combinations In the simplest version we have a separate inverted index for each zone Variant: have a single index with a separate dictionary entry for each term and zone E.g., bill.abs bill.title bill.body Of course, compress zone names like abstract/title/body.

Zone combinations index The above scheme is still wasteful: each term is potentially replicated for each zone In a slightly better scheme, we encode the zone in the postings: At query time, accumulate contributions to the total score of a document from the various postings, e.g., bill 1.abs, 1.body2.abs, 2.body 3.title As before, the zone names get compressed.

Score accumulation As we walk the postings for the query bill OR rights, we accumulate scores for each doc in a linear merge as before. Note: we get both bill and rights in the Title field of doc 3, but score it no higher. Should we give more weight to more hits? bill 1.abs, 1.body2.abs, 2.body 3.title rights 3.title, 3.body5.title, 5.body

Scoring: density-based Zone combinations relied on the position of terms in a doc – title, author etc. Obvious next: idea if a document talks about a topic more, then it is a better match This applies even when we only have a single query term. A query should then just specify terms that are relevant to the information need Document relevant if it has a lot of the terms Boolean syntax not required – more web-style