Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
SIMS 213: User Interface Design & Development Marti Hearst Thurs, March 3, 2005.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
SIMS 213: User Interface Design & Development Marti Hearst Thurs, Feb 26, 2004.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Designing Information Architectures Marti Hearst UC Berkeley SIMS April 28, 1999.
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998.
1 SIMS 247: Information Visualization and Presentation Marti Hearst March 3, 2004.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Presented by Zeehasham Rasheed
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Information Retrieval
ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Adapting the TileBar Interface for Visualizing Resource Usage Session 602 Adapting the TileBar Interface for Visualizing Resource Usage Session 602 Larry.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Document Collections cs5984: Information Visualization Chris North.
Search Engine Architecture
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Supporting document use through interactive visualization of metadata Visual Interfaces to Digital Libraries JCDL 28/06/2001 Mischa Weiss-Lijn.
G. Marchionini, Univ. of Maryland Electronic Environments Cost Trends: Hardware cost < Software cost < Information cost < People time Virtuality (transcend.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Christopher Hirt Daniel Wells
Information Organization: Overview
Federated & Meta Search
Thanks to Bill Arms, Marti Hearst
Information Retrieval
Visualizing Document Collections
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Information Organization: Overview
Information Retrieval and Web Design
Presentation transcript:

Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27

Marti A. Hearst SIMS 202, Fall 1997 Today n Search using Metadata n Comparing search with controlled vocabulary vs. free-text n A GUI for browsing and search on metadata + free text n More Videos!

Marti A. Hearst SIMS 202, Fall 1997 What is Metadata for? n “Normalizing” natural language n distinguish homonyms n group synonyms together n Organizing information n for search n for browsing

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 What Categories Do n Summarize a document according to pre-defined main topics n Compress the many ways of representing a concept into one n Identify which subset of attributes are salient for a collection

Marti A. Hearst SIMS 202, Fall 1997 Controlled Vocabularies n Assign metadata from a pre-defined set of allowable categories or descriptors n Some studies confuse human-assigned categories and controlled vocabularies n Could be human-assigned but from an uncontrolled set n Could be computer-assigned from a controlled set

Marti A. Hearst SIMS 202, Fall 1997 Controlled Vocabulary (Svenonius 86) n Original uses of metadata were for classification and organization n Computers allow for search n initially, search over subject codes n more recently, free text search n still more recent: free text search on full text n Controlled vocab is seen in contrast to free text search on titles, abstracts, or body

Marti A. Hearst SIMS 202, Fall 1997 Problems with UnControlled Vocabularies n If terms extracted from titles n Titles may not be informative n Docs on same topic may not be expressed using the same vocabulary n insects vs. entomology n free trade vs. tariff n Additionally, if terms extracted from full text n term co-occurrence may be incidental n passing references n many more candidate terms

Marti A. Hearst SIMS 202, Fall 1997 Problems with Controlled Vocabularies n Too vague or high-level n Potentially out of date n Expensive to build n Difficult to search with n Most not designed for search n How to locate categories of interest?

Marti A. Hearst SIMS 202, Fall 1997 Category Search and Browsing Massicotte 88 (cited in Drabenstott & Weller 96) “The problem we are faced with is the undue display length of a browse list under a given search term. … indexes will continue to expand at an ever-increasing rate. This factor alone will eventually make the alphabetical index less and less viable as a method of searching.” How to make use of all that category information?

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Usually, the two methods n retrieve different sets of documents n controlled vocab -> higher recall n free text -> higher precision n Studies usually find it’s best to use both

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Controlled vocab -> higher recall n Once you locate the right category, you can retrieval all docs within that category n all insects! n all insects + bugs + vermin

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Controlled vocab -> lower precision n accuracy traded off for consistency n limited number of categories n free text can be more precise n just two specific insects n insect name + what it eats n Blair & Maron using free text got high precision (~70%) and low recall (~25%)

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n A contradiction: (Markey et al. 80) n Eric dataset and descriptors (c.v.) n 165 free text queries n 1 in 8 free text queries could not be expressed with descriptors n C.V. produced higher precision and lower recall -- contradicting most other studies

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Why do the Markey et al. results differ from most other studies? n Perhaps Eric descriptors are sparse n Contradiction implies a need for more investigation

Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n General agreement: n Usually the two approaches retrieve different sets of (relevant) documents n Implication: n Need ranking algorithms that combine the two n Strategies: n Automatically map query words into c.v. n Modified relevance feedback: (Srinivasan 96) n find some good documents n find more docs that share their category labels (as opposed to those docs that share their free text terms)

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 How to Use Text Categories n Mapping query words to controlled vocabulary n lots of research on this n helps in some cases, hurts in others n Organizing retrieval results (new!) n problems: n too many categories/document n too many documents/category n the right categories aren’t there n Idea: address difficulties by devising a better user interface.

Marti A. Hearst SIMS 202, Fall 1997 Example: MeSH and MedLine n MeSH Medical Category Hierarchy n ~18,000 labels n manually assigned n ~8 labels/article on average n avg depth: 4.5, max depth 9 n Top Level Categories: anatomydiagnosisrelated disc animalspsychtechnology diseasebiologyhumanities drugsphysics

Marti A. Hearst SIMS 202, Fall 1997 Multiple Categories per Document DrugSymptom Anatomy D1S1A1 D2S2A2 D3S3A3 Medical articles contain combinations of these concept types

Marti A. Hearst SIMS 202, Fall 1997 [D1 S3 A1] [D3 S2 S3] [D1 D2 S2 A2] … Dx Sx Ax Dx Sx A1Dx S1 AxD1 Sx Ax Dx S1 A1D1 S1 AxD1 Sx A1 D1 S1 A1 How to Group the Category Types?

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Large Category Sets n Problems for User Interfaces n Too many categories to browse n Too many docs per category n Docs belong to multiple categories n Need to integrate search n Need to show the documents

Marti A. Hearst SIMS 202, Fall 1997 Grateful Med Query Specification

Marti A. Hearst SIMS 202, Fall 1997 Grateful Med Category SubTree

Marti A. Hearst SIMS 202, Fall 1997 Using Grateful Med n Problems: n Does not integrate category selection with viewing of categories n Only a few categories visible at a time, with little context n Does not show relationship of retrieved documents to the category structure

Marti A. Hearst SIMS 202, Fall 1997 Cat-a-Cone: (Hearst & Karadi 97) Multiple Simultaneous Categories n Key Ideas: n Separate documents from category labels n Show both simultaneously n Link the two for iterative feedback n Distinguish between: n Searching for Documents vs. n Searching for Categories

Marti A. Hearst SIMS 202, Fall 1997 Collection Retrieved Documents search Category Hierarchy browse query terms

Marti A. Hearst SIMS 202, Fall 1997 Collection Retrieved Documents search Category Hierarchy browse query terms

Marti A. Hearst SIMS 202, Fall 1997 Cat-a-Cone (Hearst & Karadi 97) n Catacomb: (definition 2b, online Websters) “A complex set of interrelated things” n Makes use of earlier PARC work on 3D+animation: Rooms Henderson and Card 86 IV: Cone Tree Robertson, Card, Mackinlay 93 Web Book Card, Robertson, York 96

Marti A. Hearst SIMS 202, Fall 1997 ConeTree for Category Labels n Browse/explore category hierarchy n by search on label names n by growing/shrinking subtrees n by spinning subtrees n Affordances n learn meaning via ancestors, siblings n disambiguate meanings n all cats simultaneously viewable

Marti A. Hearst SIMS 202, Fall 1997 Virtual Book for Result Sets n Categories on Page (Retrieved Document) linked to Categories in Tree n Flipping through Book Pages causes some Subtrees to Expand and Contract n Most Subtrees remain unchanged n Book can be Stored for later Re-Use

Marti A. Hearst SIMS 202, Fall 1997 Example Query Patient Query on Breast Cancer dataset: “‘Do I have to have radiation if I have a mastectomy, and what would be the effects?” How does the user know which categories?

Marti A. Hearst SIMS 202, Fall 1997 Interactive Category Hierarchy n Smoothly interlink: n search over categories n search over document contents n browsing of categories n browsing of retrieved documents

Marti A. Hearst SIMS 202, Fall 1997 Improvements over Grateful Med Integrate category selection with viewing of categories Integrate category selection with viewing of categories Show all categories + context Show all categories + context Show relationship of retrieved documents to the category structure Show relationship of retrieved documents to the category structure

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study n H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS, to appear n Comparison: Kohonen Map and Yahoo n Task: n “Window shop” for interesting home page n Repeat with other interface n Results: n Starting with map could repeat in Yahoo (8/11) n Starting with Yahoo unable to repeat in map (2/14)

Marti A. Hearst SIMS 202, Fall 1997 Concept Landscapes Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Participants liked: n Correspondence of region size to number of documents in region n Overview (but also wanted zoom) n Ease of jumping from one topic to another n Multiple routes to topics n Use of category and subcategory labels

UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Participants wanted: n hierarchical organization n other ordering of concepts (alphabetical) n integration of browsing and search n corresponce of color to meaning n more meaningful labels n labels at same level of abstraction n fit more labels in the given space n combined keyword and category search n multiple category assignment (sports+entertain)

Marti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Cat-a-cone n contains most of the desired properties n lacks the disliked properties

Marti A. Hearst SIMS 202, Fall 1997 Summary: Cat-a-Cone n Interface that smoothly integrates n search over multiple categories n search over document contents n browsing of multiple categories n browsing of retrieved documents n Iterative, Interactive n Retain partial results in a workspace