Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Models Use Types What do search engines do.

Similar presentations


Presentation on theme: "Query Models Use Types What do search engines do."— Presentation transcript:

1 Query Models Use Types What do search engines do

2 What we have covered What is IR Web crawling Evaluation
Tokenization and properties of text This time Query models

3 A Typical Web Search Engine
Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

4 Online vs offline processing
Query Engine Index Users Interface On-line Indexer Crawler Off-line Web Online vs offline processing

5 A Typical Web Search Engine
Queries Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

6 Why the interest in Queries?
Queries are ways we interact with IR systems Expression of an information need Nonquery methods? Types of queries?

7 Issues with Query Structures
Matching and ranking criteria Given a query, what documents are retrieved? In what order (rank)?

8 Types of Query Structures
Query Models (languages) – most common Boolean Queries Extended-Boolean Queries Vector space Boolean Vector queries Natural Language Queries Others?

9 Databases vs. IR Other issues Interaction with system Results we get
Queries we’re posing What we’re retrieving IR Databases Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction is important. One-shot queries. Issues downplayed. Concurrency, recovery, atomicity are all critical.

10 Simple query language: Boolean
Earliest query model Terms + Connectors (or operators) terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT Ex: Beethoven AND sonata

11 Truth Tables – Boolean Logic
Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

12 Problems with Boolean Queries
Ranking? Incorrect interpretation of Boolean connectives AND and OR Example - Seeking Saturday entertainment Queries: Dinner AND sports AND symphony Dinner OR sports OR symphony Dinner AND sports OR symphony

13 Order of precedence of operators
Example of query. Is A AND B the same as B AND A Why?

14 Sample Boolean Queries
Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

15 Satisfaction of Boolean Query
(Cat OR Dog) AND (Collar OR Leash) Each of the following column combinations works: Cat x x x x Dog x x x x x Collar x x x x Leash x x x x Others?

16 Satisfaction of Boolean Query
(Cat OR Dog) AND (Collar OR Leash) None of the following column combinations work: Cat x x Dog x x Collar x x Leash x x

17 Boolean Logic B A

18 Order of Preference Define order of preference Infix notation
EX: a OR b AND c Infix notation Parenthesis evaluated 1st with left to right precedence of operators Next NOT’s are applied Then AND’s Then OR’s a OR b AND c becomes a OR (b AND c)

19 Infix Notation Usually expressed as INFIX operators in IR
((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators (a AND b AND c AND d) Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a

20 DNFs and CNFs All queries can be rewritten as
Disjunctive Normal Forms (DNFs) Conjunctive Normal Forms (CNFs) DNF Constituents: Terms (words or phrases) Conjuncts (terms joined by ANDs) Disjuncts (conjuncts joined by ORs) Ex: (A AND B) OR (A AND NOTC) CNF Constituents: Disjuncts (terms joined by ORs) Conjuncts (disjuncts joined by ANDs) Ex: (A OR B) AND (A OR NOTC)

21 Effect of CNFs All complex Boolean queries can be simplified
Why do reference librarians like CNFs? AND’s reduce the size of the set returned and are easily expandable So do minus’s

22 Boolean Logic t1 t2 m5 m3 m6 m1 = t1 t2 t3 m2 = t1 t2 t3 m3 = t1 t2 t3
D9 D2 D1 m5 m3 m6 m1 = t1 t2 t3 D11 D4 m2 = t1 t2 t3 D5 m3 = t1 t2 t3 D3 m1 D6 m4 = t1 t2 t3 m2 m4 D10 m5 = t1 t2 t3 m6 = t1 t2 t3 m7 m8 m7 = t1 t2 t3 D8 D7 m8 = t1 t2 t3 t3

23 Boolean Searching Cracks Width Beams measurement Prestressed concrete
“Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

24 Pseudo-Boolean Queries
A new notation, from web search +cat dog +collar leash + means this term must appear in the document Does not mean the same thing! Need a way to group combinations. Phrases: “stray cat” AND “frayed collar” +“stray cat” + “frayed collar”

25 Information need Collections Pre-process text input Query Index Parse Rank

26 Result Sets Run a query, get a result set Two choices
Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S documents (S1 AND Sundance) ->S2 898 documents

27 Information need Collections Pre-process text input Query Index Parse Rank Reformulated Query Re-Rank

28 Ordering (ranking) of Retrieved Documents
Pure Boolean has no ordering Term is there or it’s not In practice: order chronologically order by total number of “hits” on query terms What if one term has more hits than others? Is it better to have one of each term or many of one term?

29 Boolean Query - Summary
Advantages simple queries are easy to understand relatively easy to implement Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial systems until the WWW

30 Vector Space Model Queries treated as small documents
Documents and queries are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

31 Document Vectors Documents are represented as “bags of words”
Words are terms with no order Represented as vectors when used computationally A vector is like an array of floating point values Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

32 Queries Vocabulary (dog, house, white) Queries: dog (1,0,0)
house and dog (1,1,0) dog and house (1,1,0) Show 3-D space plot

33 Documents (queries) in Vector Space

34 Documents in 3D Space Assumption: Documents that are “close together”
in space are similar in meaning.

35 Vector Query Problems Significance of queries
Can different values be placed on the different terms – eg. 2dog 1house Scaling – size of vectors Number of words in the dictionary? 100,000

36 Proximity Searches Proximity: terms occur within K positions of one another pen w/5 paper A “Near” function can be more vague near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations “United Nations” “Bill Clinton” Phrase Variants “retrieval of information” “information retrieval” Proximity - wikipedia

37 Filters/field limiters
Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata restrict by: date range internet domain (.edu .com .berkeley.edu) author size limit number of documents returned

38 Natural Language Queries
The “Holy Grail” of information retrieval Issues in Natural Language Processing syntax semantics pragmatics speech understanding speech generation

39 What do search engines do?
Tags Title Meta Term frequency and location Popularity Others

40 What do search engines do?
Collection of various methods, sometimes called pseudo-Boolean quotes, minus, plus pseudo AND truth in vs in truth stop words?

41 What does Google do? Basic search Search operators 43

42 UC Berkeley Search Engine Guide

43 UC Berkeley Search Engine Guide

44 Old:Search Engine Query Differences

45 Older: Search engine query models

46 Search query string The portion of a dynamic URL that contains the search parameters when a dynamic Web site is searched. Query strings do not exist until a user plugs the variables into a database search, at which point the search engine will create the dynamic URL with the query string based on the results. Query strings typically contain ? and % characters. 51

47 Search query string 52

48 Search query strings 53

49 Search query string 54

50 Searches are supported through a wide range of Query options
Lucene Basics Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards Many, many more

51 QueryParser syntax examples
Query expression Document matches if… java Contains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit java AND junit Contains both java and junit in the default field title:ant Contains the term ant in the title field title:extreme –subject:sports Contains extreme in the title and not sports in subject (agile OR extreme) AND java Boolean expression matches title:”junit in action” Phrase matches in title title:”junit action”~5 Proximity matches (within 5) in title java* Wildcard matches java~ Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches

52 Types of Query Structures
Query Models (languages) – most common Boolean Queries Old model Vector queries Very common - in all search engines to some extent Web queries Search engines Probabilistic models Mostly research (Indri) Holy grail of search Natural Language Queries


Download ppt "Query Models Use Types What do search engines do."

Similar presentations


Ads by Google