Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.

Similar presentations


Presentation on theme: "Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19."— Presentation transcript:

1 Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19

2 Marti A. Hearst SIMS 202, Fall 1997 Last Time n Finding Out About n Intro to Standard Information Retrieval n Intro to Boolean Queries

3 Marti A. Hearst SIMS 202, Fall 1997 Finding Out About n Three phases: n Asking of a question n Construction of an answer n Assessment of the answer n Part of an iterative process

4 Marti A. Hearst SIMS 202, Fall 1997 Finding Out About is an Iterative Process Repositories Workspace Goals

5 Marti A. Hearst SIMS 202, Fall 1997 Information Retrieval: A Restricted Form of FOA n The system has available only pre-existing, “canned” text passages. n Its response is limited to selecting from these passages and presenting them to the user. n It must select, say, 10 or 20 passages out of millions or billions!

6 Marti A. Hearst SIMS 202, Fall 1997 Query Languages n Express the user’s information need n Components: n query language n program to interpret the language n collection to compare the interpreted query against

7 Marti A. Hearst SIMS 202, Fall 1997 Types of Query Languages n Boolean n Natural language (free style) n Hybrid structured and free text n Form-based n SQL (for database queries)

8 Marti A. Hearst SIMS 202, Fall 1997 Today n More on Boolean Queries n Database Queries n IR vs. Database Queries

9 Marti A. Hearst SIMS 202, Fall 1997 Basic Boolean Queries n Components: n terms (operands) n connectors (operators) n AND n OR n NOT

10 Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries n Cat n Cat OR Dog n Cat AND Dog n (Cat AND Dog) n (Cat AND Dog) OR Collar n (Cat AND Dog) OR (Collar AND Leash) n (Cat OR Dog) AND (Collar OR Leash)

11 Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries n Usually expressed as INFIX operators in IR n ((a AND b) OR (c AND b)) n NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) n AND and OR can be n-ary operators n (a AND b AND c AND d) n Some rules n NOT(a) AND NOT(b) = NOT(a OR b) n NOT(a) OR NOT(b)= NOT(a AND b) n NOT(NOT(a)) = a

12 Information need Index Pre-process Parse Collections Rank Query text input

13 Marti A. Hearst SIMS 202, Fall 1997 Result Sets n Run a query, get a result set n Two choices n Reformulate query, run on entire collection n Reformulate query, run on result set n Example: Dialog query n (Redford AND Newman) n -> S1 1450 documents n (S1 AND Sundance) n ->S2 898 documents

14 Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

15 Marti A. Hearst SIMS 202, Fall 1997 Ordering of Retrieved Documents n Pure Boolean has no ordering n In practice: n order chronologically n order by total number of “hits” on query terms n What if one term has more hits than others? n Is it better to one of each term or many of one term? n Fancier methods have been investigated n p-norm is most famous n usually impractical to implement n usually hard for user to understand

16 Marti A. Hearst SIMS 202, Fall 1997 Boolean n Advantages n simple queries are easy to understand n relatively easy to implement n Disadvantages n difficult to specify what is wanted n too much returned, or too little n ordering not well determined n Dominant language in commercial systems until the WWW

17 Marti A. Hearst SIMS 202, Fall 1997 Faceted Boolean Query n Strategy: break query into facets (polysemous with earlier meaning of facets) n conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 n each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

18 Marti A. Hearst SIMS 202, Fall 1997 Faceted Boolean Query n Query still fails if one facet missing n Alternative: Coordination level ranking n Order results in terms of how many facets (disjuncts) are satisfied n Also called Quorum ranking, Overlap ranking, and Best Match n Problem: Facets still undifferentiated n Alternative: assign weights to facets

19 Marti A. Hearst SIMS 202, Fall 1997 Proximity Searches n Proximity: terms occur within K positions of one another n pen w/5 paper n A “Near” function can be more vague n near(pen, paper) n Sometimes order can be specified n Also, Phrases and Collocations n “United Nations” “Bill Clinton” n Phrase Variants n “retrieval of information” “information retrieval”

20 Marti A. Hearst SIMS 202, Fall 1997 Filters n Filters: Reduce set of candidate docs n Often specified simultaneous with query n Usually restrictions on metadata n restrict by: n date range n internet domain (.edu.com.berkeley.edu) n author n size n limit number of documents returned

21 Marti A. Hearst SIMS 202, Fall 1997 SQL: Database Query Language n Somewhat like Boolean n Geared towards relational model n A typical combination: SELECT X FROM Y WHERE Z

22 Marti A. Hearst SIMS 202, Fall 1997 SQL basics n SELECT: lists columns of tables n FROM: which tables to use n WHERE: which rows to include n Example: get customer 1234’s bill: SELECT fee FROM accounts WHERE customer_number = 1234

23 Marti A. Hearst SIMS 202, Fall 1997 Simple SQL Join n Given two relations: n accounts (customer name and fee) n customers (customer name and address) n Retrieve all customers names, addresses, and fees: SELECT customer_num, address, fee FROM accounts, customers WHERE customers.customer_num =accounts.customer_num n See reader for more details

24 Marti A. Hearst SIMS 202, Fall 1997 IR vs. Database Systems (modified from van Rijsbergen)

25 Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: How many SCSI drives did Compaq buy last year ? n Database: n A natural query on a purchasing relation. n Text Collection: n More difficult to search for n Less likely to find a reliable answer n Interpretation needed on results.

26 Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: What is the best SCSI disk drive to buy ? n Database: n Complex query combining information from many relations. Probably can’t be written. n Text Collection: n Find usenet articles with people’s opinions n Interpret and use text fragments that may have been written for other purposes.

27 Marti A. Hearst SIMS 202, Fall 1997 Next Time: Assessing the Answer n How well do the results answer the question? n How relevant are they to the user?

28 Marti A. Hearst SIMS 202, Fall 1997 Assignment -- Text Search! n Get a Lexis-Nexis account from Roberta n Use it in the lab or at home n Many collections are accessible n News n Public Interest n Legal n Get to know the interface n Do some searches n Answer some questions


Download ppt "Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19."

Similar presentations


Ads by Google