Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Chapter 5: Introduction to Information Retrieval
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
LBSC 796/INFM 718R: Week 1 Introduction to Information Retrieval Jimmy Lin College of Information Studies University of Maryland Monday, January 30, 2006.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
© 2010 Bennett, McRobb and Farmer1 Use Case Description Supplementary material to support Bennett, McRobb and Farmer: Object Oriented Systems Analysis.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Search Session 12 LBSC 690 Information Technology.
Information Retrieval Interaction CMSC 838S Douglas W. Oard April 27, 2006.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Advance Information Retrieval Topics Hassan Bashiri.
© Lethbridge/Laganière 2001 Chapter 7: Focusing on Users and Their Tasks1 7.1 User Centred Design (UCD) Software development should focus on the needs.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Search Engines Session 11 LBSC 690 Information Technology.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
The Big Six Theory Information Literacy
Overview of Search Engines
Unit 2: Engineering Design Process
Search Engines and Information Retrieval Chapter 1.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Evaluation and the Retrieval Process.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 Introduction to Software Engineering Lecture 1.
Systems Analysis and Design in a Changing World, Fourth Edition
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Information Retrieval Techniques Israr Hanif M.Phil QAU Islamabad Ph D (In progress) COMSATS.
Structure of IR Systems INST 734 Module 1 Doug Oard.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Evidence from Metadata INST 734 Doug Oard Module 8.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Learning Objectives Understand the concepts of Information systems.
1 CS 430: Information Discovery Lecture 5 Ranking.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Search Strategies Session 7 INST 301 Introduction to Information Science.
#1 Make sense of problems and persevere in solving them How would you describe the problem in your own words? How would you describe what you are trying.
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.
Seminar on Information Retrieval 정보검색론특강
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Chapter 1: Introduction
Using computers to search electronic databases
Information Retrieval and Web Search
Information Retrieval and Web Search
Thanks to Bill Arms, Marti Hearst
Introduction into Knowledge and information
Introduction to Information Retrieval
Search Engine Architecture
Structure of IR Systems
Presentation transcript:

Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard

Agenda Teaching theater orientation The structure of interactive IR systems Course overview

What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5,

Information Retrieval Systems Information –What is “information”? Retrieval –What do we mean by “retrieval”? –What are different types information needs? Systems –How do computer systems fit into the human information seeking process?

Information Hierarchy Data InformationKnowledge Wisdom More refined and abstract

Information Hierarchy Data –The raw material of information Information –Data organized and presented in a particular manner Knowledge –“Justified true belief” –Information that can be acted upon Wisdom –Distilled and integrated knowledge –Demonstrative of high-level “understanding”

What do We Mean by “Information?” How is it different from “data”? –Information is data in context Databases contain data and produce information IR systems contain and provide information How is it different from “knowledge”? –Knowledge is a basis for making decisions Many “knowledge bases” contain decision rules

A (Facetious) Example Data –98.6 º F, 99.5 º F, º F, 101 º F, … Information –Hourly body temperature: 98.6 º F, 99.5 º F, º F, 101 º F, … Knowledge –If you have a temperature above 100 º F, you most likely have a fever Wisdom –If you don’t feel well, go see a doctor

What types of information? Text Structured documents (e.g., XML) Images Audio (sound effects, songs, etc.) Video Programs Services

What Do We Mean by “Retrieval?” Find something that you want –The information need may or may not be explicit Known item search –Find the class home page Answer seeking –Is Lexington or Louisville the capital of Kentucky? Directed exploration –Who makes videoconferencing systems?

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge

Systems: The Memex

What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge

Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search

Types of Information Needs Retrospective (“Retrieval”) –“Searching the past” –Different queries posed against a static collection –Time invariant Prospective (“Filtering”) –“Searching the future” –Static query posed against a dynamic collection –Time dependent

Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization

Divide and Conquer Strategy: use encapsulation to limit complexity Approach: –Define interfaces (input and output) for each component –Define the functions performed by each component –Study each component in isolation –Repeat the process within components as needed –Make sure that this decomposition makes sense Result: a hierarchical decomposition

Process/System Co-Design

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Study the IR black box in isolation –Simple behavior: in goes query, out comes documents –Optimize the choice of documents that come out Where to Start? Search Query Ranked List

The IR Black Box Documents Query Hits

Inside The IR Black Box Documents Query Hits Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

What about databases? What are examples of databases? –Banks storing account information –Retailers storing inventories –Universities storing student grades What exactly is a (relational) database? –Think of them as a collection of tables –They model some aspect of “the world”

A (Simple) Database Example Student Table Department TableCourse Table Enrollment Table

Database Queries What would you want to know from a database? –What classes is John Arrow enrolled in? –Who has the highest grade in LBSC 690? –Who’s in the history department? –Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest address?

Databases vs. IR Other issues Interaction with system Results we get Queries we’re posing What we’re retrieving IRDatabases Issues downplayed.Concurrency, recovery, atomicity are all critical. Interaction is important.One-shot queries. Sometimes relevant, often not. Exact. Always correct in a formal sense. Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model.

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [ ]

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Term Document 1Document 2 Stopword List

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects

Problems with “Free Text” Search Homonymy –Terms may have many unrelated meanings –Polysemy (related meanings) is less of a problem Synonymy –Many ways of saying (nearly) the same thing Anaphora –Alternate ways of referring to the same thing

Two Ways of Searching Write the document using terms to convey meaning Author Content-Based Query-Document Matching Document Terms Query Terms Construct query from terms that may appear in documents Free-Text Searcher Retrieval Status Value Construct query from available concept descriptors Controlled Vocabulary Searcher Choose appropriate concept descriptors Indexer Metadata-Based Query-Document Matching Query Descriptors Document Descriptors

Problems with Controlled Vocabulary New concepts Users and indexers may think differently Using thesauri effectively requires training

Behavior Category Minimum Scope

Some Examples Read/Ignored, Saved/Deleted, Replied to (Stevens, 1993) Reading time (Morita & Shinoda, 1994; Konstan et al., 1997) Hypertext Link (Brin & Page, 1998)

Estimating Authority from Links Authority Hub

Problems with Observed Behavior Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?

Putting It All Together Free TextBehaviorMetadata Topicality Quality Reliability Cost Flexibility

The Big Picture Four Factors, working together –User –Process –System –Collection

Course Goals Appreciate IR system capabilities and limitations Understand IR system design & implementation –For a broad range of applications and media Evaluate IR system performance Identify current IR research problems

Course Design Text/readings provide background and detail –At least one recommended reading is required Class provides organization and direction –We will not cover every important detail Assignments and project provide experience –The TA can help with the project Final exam helps focus your effort

Assumed Background Everyone: –LBSC 690 or INFM 603 or equivalent –Comfortable with learning about technology MIM Students: –Basic systems analysis, scripting languages –Some programming is helpful MLS students: –LBSC 650 and LBSC 670 –LBSC 750 or a subject access course is helpful

Grading Assignments (20%) –Mastery of concepts and experience using tools Term project (50%) –Options are described on course Web page Final exam (30%) –In-class exam

Handy Things to Know Classes will be videotaped –Available outside my office Office hours: 5 PM Mondays –Or schedule by , or ask after class Everything is on the Web –At Doug is most easily reached by

Some Things to Do This Week Assignment 1 –Due at 6 PM next Monday!! At least skim the readings before class –Don’t fall behind! Explore the Web site –Start thinking about the term project