LBSC 690 Session #9 Unstructured Information: Search Engines Jimmy Lin The iSchool University of Maryland Wednesday, October 29, 2008 This work is licensed.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
LBSC 796/INFM 718R: Week 3 Boolean and Vector Space Models Jimmy Lin College of Information Studies University of Maryland Monday, February 13, 2006.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
INFM 603: Information Technology and Organizational Context Jimmy Lin The iSchool University of Maryland Thursday, November 7, 2013 Session 10: Information.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
LBSC 690: Week 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, April 16, 2007.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFO 624 Week 3 Retrieval System Evaluation
Information Retrieval Interaction CMSC 838S Douglas W. Oard April 27, 2006.
LBSC 690: Session 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, November 19, 2007.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Advance Information Retrieval Topics Hassan Bashiri.
WMES3103: INFORMATION RETRIEVAL WEEK 10 : USER INTERFACES AND VISUALIZATION.
LBSC 690 Information Retrieval and Search
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Modern Information Retrieval Computer engineering department Fall 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Search Engine Architecture
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Information Retrieval Techniques Israr Hanif M.Phil QAU Islamabad Ph D (In progress) COMSATS.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Data Mining Chapter 6 Search Engines
Search Engine Architecture
Structure of IR Systems
Presentation transcript:

LBSC 690 Session #9 Unstructured Information: Search Engines Jimmy Lin The iSchool University of Maryland Wednesday, October 29, 2008 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

The iSchool University of Maryland Take-Away Messages Search engines provide access to unstructured textual information Searching is fundamentally about bridging the gap between words and meaning Information seeking is an iterative process in which the search engine plays an important role

The iSchool University of Maryland You will learn about… Dimensions of information seeking Why searching for relevant information is hard Boolean and ranked retrieval How to assess the effectiveness of search systems

Information Retrieval Satisfying an information need “Scratching an information itch” What you search for!

User Process System Information

The iSchool University of Maryland What types of information? Text (documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services Our focus today on textual information…

The iSchool University of Maryland Types of Information Needs Retrospective “Searching the past” Different queries posed against a static collection Time invariant Prospective “Searching the future” Static query posed against a dynamic collection Time dependent

The iSchool University of Maryland Retrospective Searches (I) Topical search Open-ended exploration Identify positive accomplishments of the Hubble telescope since it was launched in Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Who makes the best chocolates? What technologies are available for digital reference desk services?

The iSchool University of Maryland Retrospective Searches (II) Known item search Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? “Factoid” What countries export oil? Name U.S. cities that have a “Shubert” theater. “List” Who is Aaron Copland? What is a quasar? “Definition” Find Jimmy Lin’s homepage. What’s the ISBN number of “Modern Information Retrieval”?

The iSchool University of Maryland Prospective “Searches” Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins

The iSchool University of Maryland Scope of Information Needs The right thing A few good things Everything

The iSchool University of Maryland Relevance How well information addresses your needs Harder to pin down than you think! Complex function of user, task, and context Types of relevance: Topical relevance: is it about the right thing? Situational relevance: is it useful?

The iSchool University of Maryland The Information Retrieval Cycle Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource source reselection query reformulation, vocabulary learning, relevance feedback

The iSchool University of Maryland The Central Problem in Search Searcher Author Concepts Query Terms Document Terms Do these represent the same concepts? “tragic love story”“fateful star-crossed romance”

Ambiguity Synonymy Polysemy Morphology Paraphrase Anaphora Pragmatics

The iSchool University of Maryland How do we represent documents? Remember: computers don’t “understand” anything! “Bag of words” representation: Break a document into words Disregard order, structure, meaning, etc. of the words Simple, yet effective!

The iSchool University of Maryland Boolean Text Retrieval Keep track of which documents have which terms Queries specify constraints on search results a AND b: document must have both terms “a” and “b” a OR b: document must have either term “a” or “b” NOT a: document must not have term “a” Boolean operators can be arbitrarily combined Results are not ordered!

The iSchool University of Maryland Index Structure The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the is for to of quick brown fox over lazy dog back now time all good men come jump aid their party Term Document 1Document 2 Stopword List

The iSchool University of Maryland Boolean Searching dog AND fox Doc 3, Doc 5 dog NOT fox Empty fox NOT dog Doc 7 dog OR fox Doc 3, Doc 5, Doc 7 good AND party Doc 6, Doc 8 good AND party NOT over Doc 6

The iSchool University of Maryland Extensions Stemming (“truncation”) Technique to handle morphological variations Store word stems: love, loving, loves …  lov Proximity operators More precise versions of AND Store a list of positions for each word in each document

The iSchool University of Maryland Why Boolean Retrieval Works Boolean operators approximate natural language AND can specify relationships between concepts good party OR can specify alternate terminology excellent party NOT can suppress alternate meanings Democratic party

The iSchool University of Maryland Why Boolean Retrieval Fails Natural language is way more complex AND “discovers” nonexistent relationships Terms in different paragraphs, chapters, … Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, …

The iSchool University of Maryland Strengths and Weaknesses Strengths Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Implementations are fast and efficient Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many hits or none When do you stop reading? All documents in the result set are considered “equally good” What about partial matches? Documents that “don’t quite match” the query may be useful also

The iSchool University of Maryland Ranked Retrieval Paradigm Pure Boolean systems provide no ordering of results … but some documents are more relevant than others! “Best-first” ranking can be superior Select n documents Put them in order, with the “best” ones first Display them one screen at a time Users can decided when they want to stop reading “Best-first”? Easier said than done!

Extending Boolean retrieval: Order results based on number of matching terms a AND b AND c What if multiple documents have the same number of matching terms? What if no single document matches the query?

The iSchool University of Maryland Similarity-Based Queries 1. Treat both documents and queries as “bags of words” Assign a weight to each word 2. Find the similarity between the query and each document Compute similarity based on weights of the words 3. Rank order the documents by similarity Display documents most similar to the query first Surprisingly, this works pretty well!

The iSchool University of Maryland Term Weights Terms tell us about documents If “rabbit” appears a lot, the document is likely to be about rabbits Documents tell us about terms Almost every document contains “the” Term weights incorporate both factors “Term frequency”: higher the better “Document frequency”: lower the better

The iSchool University of Maryland The Information Retrieval Cycle Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource source reselection query reformulation, vocabulary learning, relevance feedback

The iSchool University of Maryland Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource Indexing Index Acquisition Collection

Spiders, Crawlers, and Robots: Oh My!

The iSchool University of Maryland The Information Retrieval Cycle Source Selection Search Query Selection Ranked List Examination Documents Delivery Documents Query Formulation Resource source reselection query reformulation, vocabulary learning, relevance feedback

The iSchool University of Maryland Search Output What now? User identifies relevant documents for “delivery” User issues new query based on content of result set What can the system do? Assist the user to identify relevant documents Assist the user to identify potentially useful query terms

The iSchool University of Maryland Selection Interfaces One dimensional lists What to display? title, source, date, summary, ratings,... What order to display? retrieval status value, date, alphabetic,... How much to display? number of hits Other aids? related terms, suggested queries, … Two+ dimensional displays Clustering, projection, contour maps, VR Navigation: jump, pan, zoom

The iSchool University of Maryland Query Enrichment Relevance feedback User designates “more like this” documents System adds terms from those documents to the query Manual reformulation Initial result set leads to better understanding of the problem domain New query better approximates information need Automatic query suggestion

The iSchool University of Maryland Example Interfaces Google: keyword in context Cuil: different approach to result presentation Microsoft Live: query refinement suggestions Exalead: faceted refinement Vivisimo/Clusty: clustered results Kartoo: cluster visualization WebBrain: structure visualization Grokker: “map view” PubMed: related article search

The iSchool University of Maryland Evaluating IR Systems User-centered strategy Recruit several users Observe each user working with one or more retrieval systems Measure which system works the “best” System-centered strategy Given documents, queries, and relevance judgments Try several variant of the retrieval method Measure which variant is more effective

The iSchool University of Maryland Good Effectiveness Measures Capture some aspect of what the user wants Have predictive value for other situations Easily replicated by other researchers Easily compared

The iSchool University of Maryland Which is the Best Rank Order? = relevant document A. B. C. D. E. F.

The iSchool University of Maryland Precision and Recall RelevantNot relevant RetrievedAB Not retrievedCD Collection size = A+B+C+D Relevant = A+C Retrieved = A+B When is precision important? When is recall important? Precision = A / (A+B) Recall = A / (A+C)

The iSchool University of Maryland Another View RelevantRetrieved Relevant + Retrieved Not Relevant + Not Retrieved Space of all documents

The iSchool University of Maryland Precision and Recall Precision How much of what was found is relevant? Often of interest, particularly for interactive searching Recall How much of what is relevant was found? Particularly important for law, patents, and medicine

The iSchool University of Maryland Ranked Retrieval Query Ranked List Documents Abstract Evaluation Model Evaluation Measure of Effectiveness Relevance Judgments

The iSchool University of Maryland User Studies Goal is to account for interface issues By studying the interface component By studying the complete system Formative evaluation Provide a basis for system development Summative evaluation Designed to assess effectiveness

The iSchool University of Maryland Quantitative User Studies Select independent variable(s) E.g., what info to display in selection interface Select dependent variable(s) E.g., time to find a known relevant document Run subjects in different orders Average out learning and fatigue effects Compute statistical significance Null hypothesis: independent variable has no effect

The iSchool University of Maryland Qualitative User Studies Direct observation Think-aloud protocols

The iSchool University of Maryland Objective vs. Subjective Data Subjective self-assessment Which did they think was more effective? Preference Which interface did they prefer? Why? Often at odds with objective measures!

The iSchool University of Maryland Take-Away Messages Search engines provide access to unstructured textual information Searching is fundamentally about bridging words and meaning Information seeking is an iterative process in which the search engine plays an important role

The iSchool University of Maryland You have learned about… Dimensions of information seeking Why searching for relevant information is hard Boolean and ranked retrieval How to assess the effectiveness of retrieval systems