Download presentation
Presentation is loading. Please wait.
Published byTerence Spencer Modified over 9 years ago
1
Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27
2
Marti A. Hearst SIMS 202, Fall 1997 Today n Search using Metadata n Comparing search with controlled vocabulary vs. free-text n A GUI for browsing and search on metadata + free text n More Videos!
3
Marti A. Hearst SIMS 202, Fall 1997 What is Metadata for? n “Normalizing” natural language n distinguish homonyms n group synonyms together n Organizing information n for search n for browsing
4
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 What Categories Do n Summarize a document according to pre-defined main topics n Compress the many ways of representing a concept into one n Identify which subset of attributes are salient for a collection
5
Marti A. Hearst SIMS 202, Fall 1997 Controlled Vocabularies n Assign metadata from a pre-defined set of allowable categories or descriptors n Some studies confuse human-assigned categories and controlled vocabularies n Could be human-assigned but from an uncontrolled set n Could be computer-assigned from a controlled set
6
Marti A. Hearst SIMS 202, Fall 1997 Controlled Vocabulary (Svenonius 86) n Original uses of metadata were for classification and organization n Computers allow for search n initially, search over subject codes n more recently, free text search n still more recent: free text search on full text n Controlled vocab is seen in contrast to free text search on titles, abstracts, or body
7
Marti A. Hearst SIMS 202, Fall 1997 Problems with UnControlled Vocabularies n If terms extracted from titles n Titles may not be informative n Docs on same topic may not be expressed using the same vocabulary n insects vs. entomology n free trade vs. tariff n Additionally, if terms extracted from full text n term co-occurrence may be incidental n passing references n many more candidate terms
8
Marti A. Hearst SIMS 202, Fall 1997 Problems with Controlled Vocabularies n Too vague or high-level n Potentially out of date n Expensive to build n Difficult to search with n Most not designed for search n How to locate categories of interest?
9
Marti A. Hearst SIMS 202, Fall 1997 Category Search and Browsing Massicotte 88 (cited in Drabenstott & Weller 96) “The problem we are faced with is the undue display length of a browse list under a given search term. … indexes will continue to expand at an ever-increasing rate. This factor alone will eventually make the alphabetical index less and less viable as a method of searching.” How to make use of all that category information?
10
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Usually, the two methods n retrieve different sets of documents n controlled vocab -> higher recall n free text -> higher precision n Studies usually find it’s best to use both
11
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Controlled vocab -> higher recall n Once you locate the right category, you can retrieval all docs within that category n all insects! n all insects + bugs + vermin
12
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Controlled vocab -> lower precision n accuracy traded off for consistency n limited number of categories n free text can be more precise n just two specific insects n insect name + what it eats n Blair & Maron using free text got high precision (~70%) and low recall (~25%)
13
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n A contradiction: (Markey et al. 80) n Eric dataset and descriptors (c.v.) n 165 free text queries n 1 in 8 free text queries could not be expressed with descriptors n C.V. produced higher precision and lower recall -- contradicting most other studies
14
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n Why do the Markey et al. results differ from most other studies? n Perhaps Eric descriptors are sparse n Contradiction implies a need for more investigation
15
Marti A. Hearst SIMS 202, Fall 1997 Free Text vs. Controlled Vocab n General agreement: n Usually the two approaches retrieve different sets of (relevant) documents n Implication: n Need ranking algorithms that combine the two n Strategies: n Automatically map query words into c.v. n Modified relevance feedback: (Srinivasan 96) n find some good documents n find more docs that share their category labels (as opposed to those docs that share their free text terms)
16
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 How to Use Text Categories n Mapping query words to controlled vocabulary n lots of research on this n helps in some cases, hurts in others n Organizing retrieval results (new!) n problems: n too many categories/document n too many documents/category n the right categories aren’t there n Idea: address difficulties by devising a better user interface.
17
Marti A. Hearst SIMS 202, Fall 1997 Example: MeSH and MedLine n MeSH Medical Category Hierarchy n ~18,000 labels n manually assigned n ~8 labels/article on average n avg depth: 4.5, max depth 9 n Top Level Categories: anatomydiagnosisrelated disc animalspsychtechnology diseasebiologyhumanities drugsphysics
18
Marti A. Hearst SIMS 202, Fall 1997 Multiple Categories per Document DrugSymptom Anatomy D1S1A1 D2S2A2 D3S3A3 Medical articles contain combinations of these concept types
19
Marti A. Hearst SIMS 202, Fall 1997 [D1 S3 A1] [D3 S2 S3] [D1 D2 S2 A2] … Dx Sx Ax Dx Sx A1Dx S1 AxD1 Sx Ax Dx S1 A1D1 S1 AxD1 Sx A1 D1 S1 A1 How to Group the Category Types?
20
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Large Category Sets n Problems for User Interfaces n Too many categories to browse n Too many docs per category n Docs belong to multiple categories n Need to integrate search n Need to show the documents
21
Marti A. Hearst SIMS 202, Fall 1997 Grateful Med Query Specification
22
Marti A. Hearst SIMS 202, Fall 1997 Grateful Med Category SubTree
23
Marti A. Hearst SIMS 202, Fall 1997 Using Grateful Med n Problems: n Does not integrate category selection with viewing of categories n Only a few categories visible at a time, with little context n Does not show relationship of retrieved documents to the category structure
24
Marti A. Hearst SIMS 202, Fall 1997 Cat-a-Cone: (Hearst & Karadi 97) Multiple Simultaneous Categories n Key Ideas: n Separate documents from category labels n Show both simultaneously n Link the two for iterative feedback n Distinguish between: n Searching for Documents vs. n Searching for Categories
25
Marti A. Hearst SIMS 202, Fall 1997 Collection Retrieved Documents search Category Hierarchy browse query terms
26
Marti A. Hearst SIMS 202, Fall 1997 Collection Retrieved Documents search Category Hierarchy browse query terms
27
Marti A. Hearst SIMS 202, Fall 1997 Cat-a-Cone (Hearst & Karadi 97) n Catacomb: (definition 2b, online Websters) “A complex set of interrelated things” n Makes use of earlier PARC work on 3D+animation: Rooms Henderson and Card 86 IV: Cone Tree Robertson, Card, Mackinlay 93 Web Book Card, Robertson, York 96
28
Marti A. Hearst SIMS 202, Fall 1997 ConeTree for Category Labels n Browse/explore category hierarchy n by search on label names n by growing/shrinking subtrees n by spinning subtrees n Affordances n learn meaning via ancestors, siblings n disambiguate meanings n all cats simultaneously viewable
29
Marti A. Hearst SIMS 202, Fall 1997 Virtual Book for Result Sets n Categories on Page (Retrieved Document) linked to Categories in Tree n Flipping through Book Pages causes some Subtrees to Expand and Contract n Most Subtrees remain unchanged n Book can be Stored for later Re-Use
30
Marti A. Hearst SIMS 202, Fall 1997 Example Query Patient Query on Breast Cancer dataset: “‘Do I have to have radiation if I have a mastectomy, and what would be the effects?” How does the user know which categories?
31
Marti A. Hearst SIMS 202, Fall 1997 Interactive Category Hierarchy n Smoothly interlink: n search over categories n search over document contents n browsing of categories n browsing of retrieved documents
32
Marti A. Hearst SIMS 202, Fall 1997 Improvements over Grateful Med Integrate category selection with viewing of categories Integrate category selection with viewing of categories Show all categories + context Show all categories + context Show relationship of retrieved documents to the category structure Show relationship of retrieved documents to the category structure
33
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study n H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS, to appear n Comparison: Kohonen Map and Yahoo n Task: n “Window shop” for interesting home page n Repeat with other interface n Results: n Starting with map could repeat in Yahoo (8/11) n Starting with Yahoo unable to repeat in map (2/14)
34
Marti A. Hearst SIMS 202, Fall 1997 Concept Landscapes Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search
35
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Participants liked: n Correspondence of region size to number of documents in region n Overview (but also wanted zoom) n Ease of jumping from one topic to another n Multiple routes to topics n Use of category and subcategory labels
36
UWMS Data Mining WorkshopMarti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Participants wanted: n hierarchical organization n other ordering of concepts (alphabetical) n integration of browsing and search n corresponce of color to meaning n more meaningful labels n labels at same level of abstraction n fit more labels in the given space n combined keyword and category search n multiple category assignment (sports+entertain)
37
Marti A. Hearst SIMS 202, Fall 1997 Comparison Study (cont.) n Cat-a-cone n contains most of the desired properties n lacks the disliked properties
38
Marti A. Hearst SIMS 202, Fall 1997 Summary: Cat-a-Cone n Interface that smoothly integrates n search over multiple categories n search over document contents n browsing of multiple categories n browsing of retrieved documents n Iterative, Interactive n Retain partial results in a workspace
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.