Why the interest in Queries?

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Query Models Use Types What do search engines do.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management.
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Chapter 5: Information Retrieval and Web Search
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Web- and Multimedia-based Information Systems Lecture 2.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Internet Research – Illustrated, Fourth Edition Unit B.
Information Retrieval
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
1 i206: Lecture 3: Boolean Logic, Logic Circuits Marti Hearst Spring 2012.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Why indexing? For efficient searching of a document
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Truth Table to Statement Form
Logic You will learn three common logical operations that will be applied in much of the course (spreadsheets, databases, web searches and both programming.
Expressions and Assignment
Review for midterm.
Query Models Use Types What do search engines do.
Text Based Information Retrieval
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Query Models Use Types What do search engines do.
Thanks to Bill Arms, Marti Hearst
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Why the interest in Queries? Queries are ways we interact with IR systems Nonquery methods? Types of queries?

Issues with Query Structures Matching Criteria Given a query, what document is retrieved? In what order?

Types of Query Structures Query Models (languages) – most common Boolean Queries Extended-Boolean Queries Natural Language Queries Vector queries Others?

Simple query language: Boolean Earliest query model Terms + Connectors (or operators) terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT

Simple query language: Boolean Geek-speak Variations are still used in search engines!

Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

Problems with Boolean Queries How do you express your need in a Boolean Query???? (geekspeak) No good way to weight terms for significance Want music by Beethoven, preferably a sonata Query?

Problems with Boolean Queries Incorrect interpretation of Boolean connectives AND and OR Example - Seeking Saturday entertainment Queries: Dinner AND sports AND symphony Dinner OR sports OR symphony Dinner AND sports OR symphony

Order of precedence of operators Example of query. Is A AND B the same as B AND A Why?

Sample Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

Satisfaction of Boolean Query (Cat OR Dog) AND (Collar OR Leash) Each of the following combinations works: Cat x x x x Dog x x x x x Collar x x x x Leash x x x x Others?

Satisfaction of Boolean Query (Cat OR Dog) AND (Collar OR Leash) None of the following combinations work: Cat x x Dog x x Collar x x Leash x x

Boolean Logic B A

Order of Preference Define order of preference Infix notation EX: a OR b AND c Infix notation Parenthesis evaluated 1st with left to right precedence of operators Next NOT’s are applied Then AND’s Then OR’s a OR b AND c becomes a OR (b AND c)

Infix Notation Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators (a AND b AND c AND d) Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a

DNFs and CNFs All queries can be rewritten as Disjunctive Normal Forms (DNFs) Conjunctive Normal Forms (CNFs) DNF Constituents: Terms (words or phrases) Conjuncts (terms joined by ANDs) Disjuncts (conjuncts joined by ORs) Ex: (A AND B) OR (A AND NOTC) CNF Constituents: Disjuncts (terms joined by ORs) Conjuncts (disjuncts joined by ANDs) Ex: (A OR B) AND (A OR NOTC)

Effect of CNFs All complex Boolean queries can be simplified Why do reference librarians like CNFs? AND’s reduce the size of the set returned and are easily expandable

Boolean Logic t1 t2 m5 m3 m6 m1 = t1 t2 t3 m2 = t1 t2 t3 m3 = t1 t2 t3 D9 D2 D1 m5 m3 m6 m1 = t1 t2 t3 D11 D4 m2 = t1 t2 t3 D5 m3 = t1 t2 t3 D3 m1 D6 m4 = t1 t2 t3 m2 m4 D10 m5 = t1 t2 t3 m6 = t1 t2 t3 m7 m8 m7 = t1 t2 t3 D8 D7 m8 = t1 t2 t3 t3

Boolean Searching Cracks Width Beams measurement Prestressed concrete “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Pseudo-Boolean Queries A new notation, from web search +cat dog +collar leash Does not mean the same thing! Need a way to group combinations. Phrases: “stray cat” AND “frayed collar” +“stray cat” + “frayed collar”

Information need Collections Pre-process text input Query Index Parse Rank

Result Sets Run a query, get a result set Two choices Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents

Information need Collections Pre-process text input Query Index Parse Rank Reformulated Query Re-Rank

Ordering (ranking) of Retrieved Documents Pure Boolean has no ordering Term is there or it’s not In practice: order chronologically order by total number of “hits” on query terms What if one term has more hits than others? Is it better to have one of each term or many of one term?

Boolean Query - Summary Advantages simple queries are easy to understand relatively easy to implement Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial systems until the WWW

Vector Space Model Documents and queries are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating point values Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Queries Vocabulary (dog, house, white) Queries: dog (1,0,0) house and dog (1,1,0) dog and house (1,1,0) Show 3-D space plot

Documents (queries) in Vector Space

Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

Vector Query Problems Significance of queries Can different values be placed on the different terms – eg. 2dog 1house Scaling – size of vectors Number of words in the dictionary? 100,000

Proximity Searches Proximity: terms occur within K positions of one another pen w/5 paper A “Near” function can be more vague near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations “United Nations” “Bill Clinton” Phrase Variants “retrieval of information” “information retrieval”

Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata restrict by: date range internet domain (.edu .com .berkeley.edu) author size limit number of documents returned

Natural Language Queries The “Holy Grail” of information retrieval Issues in Natural Language Processing syntax semantics pragmatics speech understanding speech generation

Search engine query models

Search engine query models