SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202
Marti Hearst UCB SIMS 202 Search and Retrieval Outline of Part II of SIMS 202 Overview: Finding Out About Overview: Finding Out About Standard Information Retrieval Models Standard Information Retrieval Models Evaluation of IR Systems Evaluation of IR Systems IR Systems (Implementation Issues) IR Systems (Implementation Issues) Web Specific Issues Web Specific Issues Search Strategies and Tactics Search Strategies and Tactics User Interface Issues User Interface Issues Search on Metadata Search on Metadata Search on Hypertext Search on Hypertext
Marti Hearst UCB SIMS 202 Human Aspects Finding Out About Finding Out About types of information needs types of information needs specifying information needs (queries) specifying information needs (queries) the process of information access the process of information access search strategies search strategies “sensemaking” “sensemaking” Relevance Relevance User Interface User Interface
Marti Hearst UCB SIMS 202 Finding Out About (This discussion is drawn from Belew’s manuscript) Three phases: Three phases: Asking of a question Asking of a question Construction of an answer Construction of an answer Assessment of the answer Assessment of the answer Part of an iterative process Part of an iterative process
Marti Hearst UCB SIMS 202 Information Retrieval Revised Task Statement: Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval. This set of assumptions underlies the field of Information Retrieval.
Information need Index Pre-process Parse Collections Rank Query text input
Marti Hearst UCB SIMS 202 Query Languages Express the user’s information need Express the user’s information need Components: Components: query language query language program to interpret the language program to interpret the language document collection to retrieve documents from that suit the interpreted query document collection to retrieve documents from that suit the interpreted query
Marti Hearst UCB SIMS 202 Types of Query Languages Boolean Boolean Natural language (free style) Natural language (free style) Hybrid structured (metadata) and free text Hybrid structured (metadata) and free text Form-based Form-based SQL (for database queries) SQL (for database queries)
Marti Hearst UCB SIMS 202 Boolean Queries How queries are satisfied How queries are satisfied Boolean logic Boolean logic meaning of AND, OR, NOT meaning of AND, OR, NOT deMorgan’s law deMorgan’s law precedence ordering precedence ordering Variations Variations faceted boolean faceted boolean proximity operators proximity operators phrases phrases filters/segments filters/segments
Marti Hearst UCB SIMS 202 Evaluation of IR Systems Why, What, and How ? Why, What, and How ? Relevance Relevance Measuring Effectiveness Measuring Effectiveness Precision and Recall Precision and Recall F-measure F-measure Cutoff levels Cutoff levels TREC TREC Blair & Maron study Blair & Maron study
Marti Hearst UCB SIMS 202 Ranking Algorithms As opposed to Boolean As opposed to Boolean How they work How they work The vector document representation The vector document representation Assigning weights to terms Assigning weights to terms why do it why do it tf*idf measure tf*idf measure Similarity measures Similarity measures vector space similarity measure vector space similarity measure how do ranking algorithms behave? how do ranking algorithms behave?
Marti Hearst UCB SIMS 202 Web Search Engines Ranking algorithms Ranking algorithms Web crawling algorithms Web crawling algorithms How web search differs from other kinds of search How web search differs from other kinds of search
Marti Hearst UCB SIMS 202 IR Systems Inverted Files/Indexes Inverted Files/Indexes How documents are converted to inverted indexes How documents are converted to inverted indexes How the files are used for ranking documents How the files are used for ranking documents The Cheshire II system The Cheshire II system Using Lexis/Nexis Using Lexis/Nexis
Marti Hearst UCB SIMS 202 Relevance Feedback Modify existing query based on relevance judgments Modify existing query based on relevance judgments add terms and/or add terms and/or reweight terms reweight terms Automatic or allow users to select from automated list Automatic or allow users to select from automated list Rocchio algorithm Rocchio algorithm How it effects search outcome How it effects search outcome
Marti Hearst UCB SIMS 202 Information Seeking Behavior Search tactics Search tactics Search strategies Search strategies Theories or Models Theories or Models Bates Bates O’Day and Jeffries O’Day and Jeffries Russell et al. Russell et al. How information is used after it is found How information is used after it is found
Marti Hearst UCB SIMS 202 User Interfaces Why important, the role of the interface Why important, the role of the interface How to show the relationship between query, collection, and retrieval results How to show the relationship between query, collection, and retrieval results TileBars TileBars How to support the process of search How to support the process of search Sketchtrieve Informal Interface Sketchtrieve Informal Interface DLITE (only on videotape) DLITE (only on videotape)
Marti Hearst UCB SIMS 202 Metadata in Search What is metadata for? What is metadata for? Pros and cons of search using controlled vocabulary Pros and cons of search using controlled vocabulary Pros and cons of search using uncontrolled vocabulary Pros and cons of search using uncontrolled vocabulary Combining metadata and uncontrolled vocabulary in search Combining metadata and uncontrolled vocabulary in search Convert free text into controlled vocab Convert free text into controlled vocab Organizing result sets (Cat-a-Cone) Organizing result sets (Cat-a-Cone)
Marti Hearst UCB SIMS 202 Hypertext and Search Components of a hypertext system Components of a hypertext system Browsing vs. search on hypertext Browsing vs. search on hypertext General tendencies for searching hypertext General tendencies for searching hypertext Egan et al study (Superbook) Egan et al study (Superbook) Campagnoni & Ehrlich study Campagnoni & Ehrlich study
Marti Hearst UCB SIMS 202 Things we didn’t get to Search Issues Search Issues Source Selection Source Selection Genre Genre Quality/Verity Quality/Verity Collaborative Filtering (in reader part II) Collaborative Filtering (in reader part II) Question Answering Question Answering Multilingual Search (in reader part II) Multilingual Search (in reader part II) Machine Learning (in reader part II) Machine Learning (in reader part II) AI/Language Analysis (in reader part I) AI/Language Analysis (in reader part I)
Marti Hearst UCB SIMS 202 Some Follow-on Courses 240 Principles of Information Retrieval (Larson Sp 98) 240 Principles of Information Retrieval (Larson Sp 98) 257 Database Management (Larson Sp 98) 257 Database Management (Larson Sp 98) 247 Information Visualization (Hearst Sp 98) 247 Information Visualization (Hearst Sp 98) 213 User Interface Design and Development (Hearst Sp 99) 213 User Interface Design and Development (Hearst Sp 99) 214 Needs Assessment and Evaluation of Information Sysetms 214 Needs Assessment and Evaluation of Information Sysetms 245 Organization of Information in Collections 245 Organization of Information in Collections