ISP433/633 Week 3 Query Structure and Query Operations.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modern Information Retrieval Chapter 5 Query Operations.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Vector Space Model : TF - IDF
Modern Information Retrieval Chapter 4 Query Languages.
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
CS 430: Information Discovery
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

ISP433/633 Week 3 Query Structure and Query Operations

Outline More on weight assignment Exercise (as a review) BREAK Query structure Query operations –Boolean query parse –Vector query reformulation

Vector Space Model Example D 1 = “computer information retrieval” D 2 = “computer, computer information” Q 1 = “information, information retrieval” ComputerInformationRetrieval D1111 D2210 Q1021 vectors

More on Weight Assignment Boolean Model: binary weight Term Frequency as weight(freq) –Raw frequency of a term inside a document Problem with using raw freq –Zipf’s law Non-distinguishing terms have high frequency –Document length matters

Zipf’s law Rank x Frequency  Constant

Zipf’s law Linear scale Log scale

Inverse Document Frequency (idf) Inverse of the proportion of documents that have the term among all the documents in the collection –deal with Zipf’s law idf i =log(N/n i ) –N: total number of documents in the collection –n i : the number of documents that have term i

Benefit of idf idf provides high values for rare words and low values for common words

Normalized Term Frequency (tf) Deal with document length The most frequent term m in document j –freq m, j Term i’s normalized term frequency in document j: –tf i, j = freq i, j / freq m, j For query: –tf i, q = * freq i, q / freq m, q

tf*idf as weight Assign a tf * idf weight to each term in each document

Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” Q = “information, retrieval” Compute the tf*idf weight for each term in D 2 and Q BREAK

Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries

Patten Match Words Prefixes Suffixes Substrings Ranges Regular expressions Structured queries (e.g., XQuery to query XML, Z39.50)

Boolean Query Processing The query must be parsed to determine what the: –Search Words –Optional field or index qualifications –Boolean Operators Are and how they relate to one-another Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen –These produce code to be compiled into programs.

Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

Parse Result (Query Tree) Z39.50 queries… Oper: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right

Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

Relevance feedback Popular query reformulation strategy Used for –Query expansion –Term re-weighting Type –Manual –Automatic Scope –Local –Global

Vector Model Dr: set of relevant documents identified by user Dn: set of non-relevant documents identified Vec q : vector of original query Vec q’ : vector of expanded query A common strategy of query reformulation is: –Vec q’ = Vec q + (sum of Vec Dr ) – (sum of Vec Dn )

Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant What should be the reformulated Q’?