Search Session 12 LBSC 690 Information Technology.

Slides:



Advertisements
Similar presentations
Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Search Engines and Information Retrieval
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Evidence from Behavior LBSC 796/INFM 719R Douglas W. Oard Session 7, October 22, 2007.
Information Retrieval in Practice
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Search Engines Session 11 LBSC 690 Information Technology.
Information Retrieval
1 CS/INFO 430 Information Retrieval Lecture 23 Usability 1.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search Engine Optimization HOW AND WHY Introduction to SEO SEO stands for “Search Engine Optimization” and often refers to the ability to easily locate.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Search Engines and Information Retrieval Chapter 1.
It is a clinical search engine developed by Elsevier Health Books* - over 1,100 of Elsevier's medical and surgical reference books Journals* - over 500.
INFO624 - Week 4 Query Languages and Query Operations Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
NCBI/WHO PubMed/Hinari Course Introduction Session #1, Sept 13, 2005 Session #2, Sept 14, 2005 Internet Concepts and Scientific Literature Resources Ho.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Search Engine Architecture
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
Information Retrieval
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Characteristics of Information on the Web Dania Bilal IS 530 Spring 2006.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Search Engines Session 5 INST 301 Introduction to Information Science.
Search Strategies Session 7 INST 301 Introduction to Information Science.
1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Searching the Web for academic information Ruth Stubbings.
Search Engine Optimization
Information Retrieval in Practice
Search Engine Architecture
Personalized Social Image Recommendation
Introduction to Information Retrieval
Search Engine Architecture
Structure of IR Systems
Introduction to Search Engines
Presentation transcript:

Search Session 12 LBSC 690 Information Technology

Agenda The search process Information retrieval Recommender systems Evaluation

Information “Retrieval” Find something that you want –The information need may or may not be explicit Known item search –Find the class home page Answer seeking –Is Lexington or Louisville the capital of Kentucky? Directed exploration –Who makes videoconferencing systems?

Document Delivery BrowseSearch Query Document SelectExamine Information Retrieval Paradigm

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Ways of Finding Text Searching metadata –Using controlled or uncontrolled vocabularies Free text –Characterize documents by the words the contain Social filtering –Exchange and interpret personal ratings

“Exact Match” Retrieval Find all documents with some characteristic –Indexed as “Presidents -- United States” –Containing the words “Clinton” and “Peso” –Read by my boss A set of documents is returned –Hopefully, not too many or too few –Usually listed in date or alphabetical order

Ranked Retrieval Put most useful documents near top of a list –Possibly useful documents go lower in the list Users can read down as far as they like –Based on what they read, time available,... Provides useful results from weak queries –Untrained users find exact match harder to use

Similarity-Based Retrieval Assume “most useful” = most similar to query Weight terms based on two criteria: –Repeated words are good cues to meaning –Rarely used words make searches more selective Compare weights with query –Add up the weights for each query term –Put the documents with the highest total first

Simple Example: Counting Words : Nuclear fallout contaminated Texas. 2: Information retrieval is interesting. 3: Information retrieval is complicated nuclear fallout Texas contaminated interesting complicated information retrieval Documents: Query: recall and fallout measures for information retrieval Query

Discussion Point: Which Terms to Emphasize? Major factors –Uncommon terms are more selective –Repeated terms provide evidence of meaning Adjustments –Give more weight to terms in certain positions Title, first paragraph, etc. –Give less weight each term in longer documents –Ignore documents that try to “spam” the index Invisible text, excessive use of the “meta” field, …

“Okapi” Term Weights TF componentIDF component

Index Quality Crawl quality –Comprehensiveness, dead links, duplicate detection Document analysis –Frames, metadata, imperfect HTML, … Document extension –Anchor text, source authority, category, language, … Document restriction (ephemeral text suppression) –Banner ads, keyword spam, …

Indexing Anchor Text A type of “document expansion” –Terms near links describe content of the target Works even when you can’t index content –Image retrieval, uncrawled links, …

Queries on the Web (1999) Low query construction effort –2.35 (often imprecise) terms per query –20% use operators –22% are subsequently modified Low browsing effort –Only 15% view more than one page –Most look only “above the fold” One study showed that 10% don’t know how to scroll!

Types of User Needs Informational (30-40% of AltaVista queries) –What is a quark? Navigational –Find the home page of United Airlines Transactional –Data: What is the weather in Paris? –Shopping:Who sells a Viao Z505RX? –Proprietary:Obtain a journal article

Searching Other Languages Search Translated Query Selection Ranked List Examination Document Use Document Query Formulation Query Translation Query Query Reformulation MTTranslated “Headlines”English Definitions

Speech Retrieval Architecture Automatic Search Boundary Tagging Interactive Selection Content Tagging Speech Recognition Query Formulation

Rating-Based Recommendation Use ratings as to describe objects –Personal recommendations, peer review, … Beyond topicality: –Accuracy, coherence, depth, novelty, style, … Has been applied to many modalities –Books, Usenet news, movies, music, jokes, beer, …

Using Positive Information

Using Negative Information

Problems with Explicit Ratings Cognitive load on users -- people don’t like to provide ratings Rating sparsity -- needs a number of raters to make recommendations No ways to detect new items that have not rated by any users

Implicit Evidence for Ratings

Click Streams Browsing histories are easily captured –Send all links to a central site –Record from and to pages and user’s cookie –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com

Estimating Authority from Links Authority Hub

Information Retrieval Types Source: Ayse Goker

Hands On: Try Some Search Engines Web Pages (using spatial layout) – Images (based on image similarity) – Multimedia (based on metadata) – Movies (based on recommendations) – Grey literature (based on citations) –

Evaluation What can be measured that reflects the searcher’s ability to use a system? (Cleverdon, 1966) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall –Precision Effectiveness

Relevant Retrieved Measures of Effectiveness

Precision-Recall Curves Source: Ellen Voorhees, NIST

Affective Evaluation Measure stickiness through frequency of use –Non-comparative, long-term Key factors (from cognitive psychology): –Worst experience –Best experience –Most recent experience Highly variable effectiveness is undesirable –Bad experiences are particularly memorable

Other Web Search Quality Factors Spam suppression –“Adversarial information retrieval” –Every source of evidence has been spammed Text, queries, links, access patterns, … “Family filter” accuracy –Link analysis can be very helpful

Summary Search is a process engaged in by people Human-machine synergy is the key Content and behavior offer useful evidence Evaluation must consider many factors