Download presentation
Presentation is loading. Please wait.
1
Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica
2
Marti Hearst, Taxonomy Bootcamp ‘06 Outline Faceted Metadata Definition Advantages Flamenco: Search Interface Design using Faceted Metadata Castanet: (Semi) Automated Tool for Creation of Category Systems Comparison to State-of-the-Art Alternatives Conclusions
3
Marti Hearst, Taxonomy Bootcamp ‘06 Focus: Search and Navigation of Large Collections Image Collections E-Government Sites Shopping Sites Digital Libraries
4
Marti Hearst, Taxonomy Bootcamp ‘06 Study by Vividence in 2001 on 69 Sites 70% eCommerce 31% Service 21% Content 2% Community Poorly organized search results Frustration and wasted time Poor information architecture Confusion Dead ends "back and forthing" Forced to search Problems with Site Search
5
Marti Hearst, Taxonomy Bootcamp ‘06 What we want to Achieve Integrate browsing and searching seamlessly Support exploration and learning Avoid dead-ends, “pogo’ing”, and “lostness”
6
Marti Hearst, Taxonomy Bootcamp ‘06 Main Idea Use hierarchical faceted metadata Design the interface to: Allow flexible navigation Provide previews of next steps Organize results in a meaningful way Support both expanding and refining the search
7
Marti Hearst, Taxonomy Bootcamp ‘06 The Problem With Hierarchy Most things can be classified in more than one way. Most organizational systems do not handle this well. Example: Animal Classification otter penguin robin salmon wolf cobra bat Skin Covering Locomotion Diet robin bat wolf penguin otter, seal salmon robin bat salmon wolf cobra otter penguin seal robin penguin salmon cobra bat otter wolf
8
Marti Hearst, Taxonomy Bootcamp ‘06 Inflexible Force the user to start with a particular category What if I don’t know the animal’s diet, but the interface makes me start with that category? Wasteful Have to repeat combinations of categories Makes for extra clicking and extra coding Difficult to modify To add a new category type, must duplicate it everywhere or change things everywhere The Problem with Hierarchy
9
Marti Hearst, Taxonomy Bootcamp ‘06 The Problem With Hierarchy start furscalesfeathers swimflyrun slither furscalesfeathersfurscalesfeathers fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects salmonbatrobinwolf …
10
Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets Facets are a way of labeling data A kind of Metadata (data about data) Can be thought of as properties of items Facets vs. Categories Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items
11
Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets Create INDEPENDENT categories (facets) Each facet has labels (sometimes arranged in a hierarchy) Assign labels from the facets to every item Example: recipe collection Course Main Course Cooking Method Stir-fry Cuisine Thai Ingredient Bell Pepper Curry Chicken
12
Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets Break out all the important concepts into their own facets Sometimes the facets are hierarchical Assign labels to items from any level of the hierarchy Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple
13
Marti Hearst, Taxonomy Bootcamp ‘06 Using Facets Now there are multiple ways to get to each item Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple Fruit > Pineapple Dessert > Cake Preparation > Bake Dessert > Dairy > Sherbet Fruit > Berries > Strawberries Preparation > Freeze
14
Marti Hearst, Taxonomy Bootcamp ‘06 Example: Nobel Prize Winners Collection (Before and After Facets)
15
Marti Hearst, Taxonomy Bootcamp ‘06 Only One Way to View Laureates
16
Marti Hearst, Taxonomy Bootcamp ‘06 First, Choose Prize Type
17
Marti Hearst, Taxonomy Bootcamp ‘06 Next, view the list! The user must first choose an Award type (literature), then browse through the laureates in chronological order. No choice is given to, say organize by year and then award, or by country, then decade, then award, etc.
18
Marti Hearst, Taxonomy Bootcamp ‘06 Flamenco Interface: Using Hierarchical Faceted Metadata
19
Marti Hearst, Taxonomy Bootcamp ‘06 Opening View Select literature from PRIZE facet
20
Marti Hearst, Taxonomy Bootcamp ‘06 Group results by YEAR facet
21
Marti Hearst, Taxonomy Bootcamp ‘06 Select 1920’s from YEAR facet
22
Marti Hearst, Taxonomy Bootcamp ‘06 Current query is PRIZE > literature AND YEAR: 1920’s. Now remove PRIZE > literature
23
Marti Hearst, Taxonomy Bootcamp ‘06 Now Group By YEAR > 1920’s
24
Marti Hearst, Taxonomy Bootcamp ‘06 Hierarchy Traversal: Group By YEAR > 1920’s, and drill down to 1921
25
Marti Hearst, Taxonomy Bootcamp ‘06 Select an individual item
26
Marti Hearst, Taxonomy Bootcamp ‘06 Use Endgame to expand out
27
Marti Hearst, Taxonomy Bootcamp ‘06 Use Endgame to expand out
28
Marti Hearst, Taxonomy Bootcamp ‘06 Or use “More like this” to find similar items
29
Marti Hearst, Taxonomy Bootcamp ‘06 Start a new search using keyword “California”
30
Marti Hearst, Taxonomy Bootcamp ‘06 Note that category structure remains after the keyword search
31
Marti Hearst, Taxonomy Bootcamp ‘06 The query is now a keyword ANDed with a facet subhierarchy
32
Marti Hearst, Taxonomy Bootcamp ‘06 Using Facets The system only shows the labels that correspond to the current set of items Start with all items and all facets The user then selects a label within a facet This reduces the set of items (only those that have been assigned to the subcategory label are displayed) This also eliminates some subcategories from the view.
33
Marti Hearst, Taxonomy Bootcamp ‘06 Advantages of Facets Can’t end up with empty results sets (except with keyword search) Helps avoid feelings of being lost. Easier to explore the collection. Helps users infer what kinds of things are in the collection. Evokes a feeling of “browsing the shelves” Is preferred over standard search for collection browsing in usability studies. (Interface must be designed properly)
34
Marti Hearst, Taxonomy Bootcamp ‘06 Advantages of Facets Seamless to add new facets and subcategories Seamless to add new items. Helps with “categorization wars” Don’t have to agree exactly where to place something Interaction can be implemented using a standard relational database. May be easier for automatic categorization
35
Marti Hearst, Taxonomy Bootcamp ‘06 Information previews Use the metadata to show where to go next More flexible than canned hyperlinks Less complex than full search Help users see and return to previous steps Reduces mental work Recognition over recall Suggests alternatives More clicks are ok only if (J. Spool) The “scent” of the target does not weaken If users feel they are going towards, rather than away, from their target.
36
Marti Hearst, Taxonomy Bootcamp ‘06 Facets vs. Hierarchy Early Flamenco studies compared allowing multiple hierarchical facets vs. just one facet. Multiple facets was preferred and more successful.
37
Marti Hearst, Taxonomy Bootcamp ‘06 Limitation of Facets Do not naturally capture MAIN THEMES Facets do not show RELATIONS explicitly Aquamarine Red Orange Door Doorway Wall Which color associated with which object? Photo by J. Hearst, jhearst.typepad.com
38
Marti Hearst, Taxonomy Bootcamp ‘06 Terminology Clarification Facets vs. Attributes Facets are shown independently in the interface Attributes just associated with individual items E.g., ID number, Source, Affiliation However, can always convert an attribute to a facet Facets vs. Labels Labels are the names used within facets These are organized into subhierarchies Synonyms There should be alternate names for the category labels Currently (in Flamenco) this is done with subcategories E.g., Deer has subcategories “stag”, “fawn”, “doe”
39
Marti Hearst, Taxonomy Bootcamp ‘06 Usability Study Results
40
Marti Hearst, Taxonomy Bootcamp ‘06 Flamenco Usability Studies Usability studies done on 3 collections: Recipes (epicurious): 13,000 items Architecture Images: 40,000 items Fine Arts Images: 35,000 items Conclusions: Users like and are successful with the dynamic faceted hierarchical metadata, especially for browsing tasks Very positive results, in contrast with studies on earlier iterations.
41
Marti Hearst, Taxonomy Bootcamp ‘06 Most Recent Usability Study Participants & Collection 32 Art History Students ~35,000 images from SF Fine Arts Museum Study Design Within-subjects Each participant sees both interfaces Balanced in terms of order and tasks Participants assess each interface after use Afterwards they compare them directly Data recorded in behavior logs, server logs, paper-surveys; one or two experienced testers at each trial. Used 9 point Likert scales. Session took about 1.5 hours; pay was $15/hour
42
Marti Hearst, Taxonomy Bootcamp ‘06 Post-Interface Assessments All significant at p<.05 except “simple” and “overwhelming”
43
Marti Hearst, Taxonomy Bootcamp ‘06 Post-Test Comparison 1516 230 129 428 823 624 283 131 229 FacetedBaseline Overall Assessment More useful for your tasks Easiest to use Most flexible More likely to result in dead ends Helped you learn more Overall preference Find images of roses Find all works from a given period Find pictures by 2 artists in same media Which Interface Preferable For:
44
How to Create Facet Hierarchies? Our Approach: Castanet
45
Marti Hearst, Taxonomy Bootcamp ‘06 Example: Recipes (3500 docs)
46
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)
47
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)
48
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)
49
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)
50
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)
51
Marti Hearst, Taxonomy Bootcamp ‘06
52
Our Approach: Leverage the structure of WordNet
53
Marti Hearst, Taxonomy Bootcamp ‘06 Our Approach Leverage the structure of WordNet Documents WordNet Get hypernym paths Select terms Build tree Compress tree Divide into facets
54
Marti Hearst, Taxonomy Bootcamp ‘06 1. Select Terms Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
55
Marti Hearst, Taxonomy Bootcamp ‘06 2. Build Core Tree Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
56
Marti Hearst, Taxonomy Bootcamp ‘06 2. Build Core Tree (cont.) Merge hypernym paths to build a tree sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet frozen dessert sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae
57
Marti Hearst, Taxonomy Bootcamp ‘06 3. Augment Core Tree Attach to Core tree the terms with more than one sense Favor the more common path over other alternatives Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
58
Marti Hearst, Taxonomy Bootcamp ‘06 Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date Choose this path since it has more items assigned
59
Marti Hearst, Taxonomy Bootcamp ‘06 4. Compress Tree Rule 1: Eliminate a parent with fewer than k children unless it is the root or its distribution is larger than 0.1*max dist ice cream sundae dessert sundae frozen dessert sherbet,sorbet sherbet parfait dessert frozen dessert sundae parfait sherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
60
Marti Hearst, Taxonomy Bootcamp ‘06 4. Compress Tree (cont.) Rule 2: Eliminate a child whose name appears within the parent’s name sundae dessert frozen dessert parfait sherbet dessert sundaeparfaitsherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
61
Marti Hearst, Taxonomy Bootcamp ‘06 5. Divide into Facets Divide into facets
62
Marti Hearst, Taxonomy Bootcamp ‘06 5. Divide into Facets (Remove top levels) sugar syrup entity substance,matter food,nutriment ingredient,fixings food stuff,food product sweetening herb flavorer parsley oregano sugar syrup sweetening herb flavorer parsley oregano Rule 1: Eliminate very general categories (e.g., entity, abstraction). If no paths are longer than threshold t, then done. Else: Divide into facets Rule 2: Undo first step. Then eliminate all top levels until the maximum length of any path in the resulting hierarchy is t.
63
Marti Hearst, Taxonomy Bootcamp ‘06 Disambiguation Ambiguity in: Word senses Paths up the hypernym tree Sense 1 for word “tuna” organism, being => plant, flora => vascular plant => succulent => cactus => tuna Sense 2 for word “tuna” organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna 2 paths for same word2 paths for same sense
64
Marti Hearst, Taxonomy Bootcamp ‘06 How to Select the Right Senses and Paths? First: build core tree (1) Create paths for words with only one sense (2) Use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree Then: add remaining terms to the core tree.
65
Marti Hearst, Taxonomy Bootcamp ‘06 Optional Step: Domains To disambiguate, use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. A better collection has been developed by Magnini 2000 Assigns a domain to every noun synset Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree
66
Marti Hearst, Taxonomy Bootcamp ‘06 Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3
67
Castanet Evaluation
68
Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Evaluation This is a tool for information architects, so people of this type did the evaluation We compared output on Recipes Biomedical journal titles We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)
69
Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)
70
Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)
71
Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)
72
Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)
73
Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)
74
Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)
75
Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)
76
Marti Hearst, Taxonomy Bootcamp ‘06 Evaluation Method Information architects assessed the category systems For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels Then comment on overall properties Meaningful? Systematic? Likely to use in your work?
77
Marti Hearst, Taxonomy Bootcamp ‘06 Evaluation Results Results on recipes collection for “Would you use this system in your work?” Yes in some cases or yes definitely: Pine (Castanet): 29/34 Oak (LDA): 0/18 Birch (Subsumption): 6/16 Results on quality of categories:
78
Marti Hearst, Taxonomy Bootcamp ‘06 Opportunities for Tagging New opportunity: Tagging, folksonomies (flickr de.lici.ous) People are created facets in a decentralized manner They are assigning multiple facets to items This is done on a massive scale This leads naturally to meaningful associations
79
Marti Hearst, Taxonomy Bootcamp ‘06 Conclusions Flexible application of hierarchical faceted metadata is a proven approach for navigating large information collections. Midway in complexity between simple hierarchies and deep knowledge representation. Currently in use on e-commerce sites; spreading to other domains Systems are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect, seems like it will be a useful tool for Information Architects.
80
Marti Hearst, Taxonomy Bootcamp ‘06 Acknowledgements Flamenco Team Brycen Chun, Ame Elliott, Jennifer English, Kevin Li, Rashmi Sinha, Emilia Stoica, Kirsten Swearingen, Ka- Ping Yee Castanet Emilia Stoica Funding This work supported in part by NSF (IIS-9984741)
81
For more information: flamenco.berkeley.edu Thank you! Marti Hearst
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.