Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advanced Google Becoming a Power Googler. (c) Thomas T. Kaun 2005 How Google Works PageRank: The number of pages link to any given page. “Importance”
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
IR Models: Structural Models
Internet Resources Discovery (IRD) Search Engines Quality.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Information Retrieval in Practice
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Search Engine Architecture
Text Based Information Retrieval
Augmenting (personal) IR
Multimedia Information Retrieval
ITE 130 Web Searching.
موضوع پروژه : بازیابی اطلاعات Information Retrieval
InfoTrac/PowerSearch Interface Enhancements
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Web Mining Research: A Survey
Information Retrieval and Web Design
Presentation transcript:

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content, representation dictates what the index must do. Varies from single keywords through specialized query languages to exemplar documents.

Documents as Queries “Find other documents like this one.” Query is itself a document; it can go through same sort of pre-processing (e.g., stop word removal, stemming). Characteristics of queries mimic those of documents.

Query Term Distribution in SavvySearch

Keyword Queries Query is composed of a set of keywords. Retrieve document that best matches keywords. Advantage: easy to use, supports fast indexing Disadvantage: coarse, easy to lead astray (e.g., words with multiple meanings), difficult to express complex information need

Boolean Queries combine queries (keywords) with Boolean operators: OR: “children OR kids” AND: “windows AND software” BUT: “unix BUT solaris” no NOT! Advantage: more precise queries Disadvantage: does not support ranking, less intuitive

Phrase Search Supplement single terms with phrases: exact sequence of terms Requires index that tracks proximity of terms or stores both singletons and phrases Extension is context: where proximity between terms is stated (e.g., “adele w/2 howe” to CiteSeer)

Query Language: Web Search Engines I Altavista Advanced Search: Form: all of these words this exact phrase any of these words none of these words Boolean Expression: AND OR AND NOT NEAR Date, File Type, Location

Query Language: Web Search Engines II Google Advanced Search: Find Results: with all of the words with the exact phrase with at least one of the words without the words Language, File Format, Date, Occurrences, Domain, SafeSearch Page-Specific Search: Find pages similar to the page Find pages that link to the page Topic Specific Searches

Typical Query Behavior on WWW Query term distribution obeys Zipf’s Law (quite skewed, although skew does drift). Length is ? terms. Few users exploit full power of query languages; most enter terms without operations and do not use advanced search interfaces. Change in behavior?

Natural Language Augmented Boolean approach: Treat query as document. Rank documents by how well they match the constraints of the query and return those above a certain threshold. NLP approach: Interpret semantics in a limited way to constrain query (e.g., “who” indicates a person)

NL Example: AskJeeves

Advanced Querying Pattern Matching: combinations of syntactic features, e.g., regular expressions, wild-card queries Structural Queries: forms, hypertext and hierarchies typically supports iterative querying as in guided browsing (e.g., WebGlimpse or Letitzia)

Advanced Querying: Letizia Recommends new pages based on user’s browsing preferences Infers interests by observing user behavior: save bookmark, follow link, spend time on page… Models documents as list of keywords Figure from

Caching An astonishing number of people submit the same queries (e.g., “Harry Potter”). Just as single word usage is skewed (Zipf’s Law) so is query submission on WWW. Can exploit this by caching results for oft repeated queries.

Single vs. On-Going Queries: Filtering Find new documents from information stream that satisfy a static information need User profile represents interests; threshold represents how closely documents must match. User may provide the query or it may be learned through relevance feedback.

Filtering Process Documents Stream User 1 Profile User 2 Profile Docs Filtered for User 2 Docs for User 1 from MIR text