Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.

Slides:



Advertisements
Similar presentations
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Advertisements

The Relational Model and Relational Algebra Nothing is so practical as a good theory Kurt Lewin, 1945.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Query Models Use Types What do search engines do.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management.
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Overview of Search Engines
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Chapter 13: Query Processing
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Query Models Use Types What do search engines do.
What is Information Retrieval (IR)?
Why the interest in Queries?
Query Models Use Types What do search engines do.
Thanks to Bill Arms, Marti Hearst
Document Clustering Matt Hughes.
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19

Marti A. Hearst SIMS 202, Fall 1997 Last Time n Finding Out About n Intro to Standard Information Retrieval n Intro to Boolean Queries

Marti A. Hearst SIMS 202, Fall 1997 Finding Out About n Three phases: n Asking of a question n Construction of an answer n Assessment of the answer n Part of an iterative process

Marti A. Hearst SIMS 202, Fall 1997 Finding Out About is an Iterative Process Repositories Workspace Goals

Marti A. Hearst SIMS 202, Fall 1997 Information Retrieval: A Restricted Form of FOA n The system has available only pre-existing, “canned” text passages. n Its response is limited to selecting from these passages and presenting them to the user. n It must select, say, 10 or 20 passages out of millions or billions!

Marti A. Hearst SIMS 202, Fall 1997 Query Languages n Express the user’s information need n Components: n query language n program to interpret the language n collection to compare the interpreted query against

Marti A. Hearst SIMS 202, Fall 1997 Types of Query Languages n Boolean n Natural language (free style) n Hybrid structured and free text n Form-based n SQL (for database queries)

Marti A. Hearst SIMS 202, Fall 1997 Today n More on Boolean Queries n Database Queries n IR vs. Database Queries

Marti A. Hearst SIMS 202, Fall 1997 Basic Boolean Queries n Components: n terms (operands) n connectors (operators) n AND n OR n NOT

Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries n Cat n Cat OR Dog n Cat AND Dog n (Cat AND Dog) n (Cat AND Dog) OR Collar n (Cat AND Dog) OR (Collar AND Leash) n (Cat OR Dog) AND (Collar OR Leash)

Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries n Usually expressed as INFIX operators in IR n ((a AND b) OR (c AND b)) n NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) n AND and OR can be n-ary operators n (a AND b AND c AND d) n Some rules n NOT(a) AND NOT(b) = NOT(a OR b) n NOT(a) OR NOT(b)= NOT(a AND b) n NOT(NOT(a)) = a

Information need Index Pre-process Parse Collections Rank Query text input

Marti A. Hearst SIMS 202, Fall 1997 Result Sets n Run a query, get a result set n Two choices n Reformulate query, run on entire collection n Reformulate query, run on result set n Example: Dialog query n (Redford AND Newman) n -> S documents n (S1 AND Sundance) n ->S2 898 documents

Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

Marti A. Hearst SIMS 202, Fall 1997 Ordering of Retrieved Documents n Pure Boolean has no ordering n In practice: n order chronologically n order by total number of “hits” on query terms n What if one term has more hits than others? n Is it better to one of each term or many of one term? n Fancier methods have been investigated n p-norm is most famous n usually impractical to implement n usually hard for user to understand

Marti A. Hearst SIMS 202, Fall 1997 Boolean n Advantages n simple queries are easy to understand n relatively easy to implement n Disadvantages n difficult to specify what is wanted n too much returned, or too little n ordering not well determined n Dominant language in commercial systems until the WWW

Marti A. Hearst SIMS 202, Fall 1997 Faceted Boolean Query n Strategy: break query into facets (polysemous with earlier meaning of facets) n conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 n each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

Marti A. Hearst SIMS 202, Fall 1997 Faceted Boolean Query n Query still fails if one facet missing n Alternative: Coordination level ranking n Order results in terms of how many facets (disjuncts) are satisfied n Also called Quorum ranking, Overlap ranking, and Best Match n Problem: Facets still undifferentiated n Alternative: assign weights to facets

Marti A. Hearst SIMS 202, Fall 1997 Proximity Searches n Proximity: terms occur within K positions of one another n pen w/5 paper n A “Near” function can be more vague n near(pen, paper) n Sometimes order can be specified n Also, Phrases and Collocations n “United Nations” “Bill Clinton” n Phrase Variants n “retrieval of information” “information retrieval”

Marti A. Hearst SIMS 202, Fall 1997 Filters n Filters: Reduce set of candidate docs n Often specified simultaneous with query n Usually restrictions on metadata n restrict by: n date range n internet domain (.edu.com.berkeley.edu) n author n size n limit number of documents returned

Marti A. Hearst SIMS 202, Fall 1997 SQL: Database Query Language n Somewhat like Boolean n Geared towards relational model n A typical combination: SELECT X FROM Y WHERE Z

Marti A. Hearst SIMS 202, Fall 1997 SQL basics n SELECT: lists columns of tables n FROM: which tables to use n WHERE: which rows to include n Example: get customer 1234’s bill: SELECT fee FROM accounts WHERE customer_number = 1234

Marti A. Hearst SIMS 202, Fall 1997 Simple SQL Join n Given two relations: n accounts (customer name and fee) n customers (customer name and address) n Retrieve all customers names, addresses, and fees: SELECT customer_num, address, fee FROM accounts, customers WHERE customers.customer_num =accounts.customer_num n See reader for more details

Marti A. Hearst SIMS 202, Fall 1997 IR vs. Database Systems (modified from van Rijsbergen)

Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: How many SCSI drives did Compaq buy last year ? n Database: n A natural query on a purchasing relation. n Text Collection: n More difficult to search for n Less likely to find a reliable answer n Interpretation needed on results.

Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: What is the best SCSI disk drive to buy ? n Database: n Complex query combining information from many relations. Probably can’t be written. n Text Collection: n Find usenet articles with people’s opinions n Interpret and use text fragments that may have been written for other purposes.

Marti A. Hearst SIMS 202, Fall 1997 Next Time: Assessing the Answer n How well do the results answer the question? n How relevant are they to the user?

Marti A. Hearst SIMS 202, Fall 1997 Assignment -- Text Search! n Get a Lexis-Nexis account from Roberta n Use it in the lab or at home n Many collections are accessible n News n Public Interest n Legal n Get to know the interface n Do some searches n Answer some questions