Download presentation
Presentation is loading. Please wait.
1
Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley
2
Focus: Search and Navigation of Large Collections Image Collections E-Government Sites Shopping Sites Digital Libraries
3
Study by Vividence in 2001 on 69 Sites 70% eCommerce 31% Service 21% Content 2% Community Poorly organized search results Frustration and wasted time Poor information architecture Confusion Dead ends "back and forthing" Forced to search Problems with Site Search
4
The Problem With Hierarchy Most things can be classified in more than one way. Most organizational systems do not handle this well. Example: Animal Classification otter penguin robin salmon wolf cobra bat Skin Covering Locomotion Diet robin bat wolf penguin otter, seal salmon robin bat salmon wolf cobra otter penguin seal robin penguin salmon cobra bat otter wolf
5
The Problem With Hierarchy start furscalesfeathers swimflyrun slither furscalesfeathersfurscalesfeathers fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects salmonbatrobinwolf …
6
The Idea of Facets Facets are a way of labeling data A kind of Metadata (data about data) Can be thought of as properties of items Facets vs. Categories Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items
7
The Idea of Facets Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot Meat Chicken Vegetables pepper Fruit Apricot Flavor gingerroot
8
Using Facets Now there are multiple ways to get to each item Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple Fruit > Pineapple Dessert > Cake Preparation > Bake Dessert > Dairy > Sherbet Fruit > Berries > Strawberries Preparation > Freeze
9
Castanet Semi-automatic algorithm for creating hierarchical faceted metadata Carves out a structure from the hypernym (IS-A) relations within WordNet Produces surprisingly good results for a wide range of subjects e.g., arts, medicine, recipes, math, news, bibliographical records
10
WordNet Challenges A word may have more than one sense - Fine granularity of word sense distinctions e.g., newspaper (#1) - daily publication on folded sheets newspaper (#3) - physical object - Ambiguity for the same sense tuna #1 cactus #2 fish food fish bony fish
11
WordNet Challenges (cont.) The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes) Sparse coverage of proper names and noun phrases (not addressed)
12
Algorithm Goals Build a set of facet hierarchies Balance depth and breadth Avoid “skinny” paths Don’t go too deep or too broad Choose understandable labels Disambiguate words Currently a word can take on only one sense
13
Our Approach Documents Select terms WordNet Build core tree Augment core tree Remove top level categories Compress Tree Divide into facets
14
1. Select Terms Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
15
2. Build Core Tree Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
16
2. Build Core Tree (cont.) Merge hypernym paths to build a tree sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet frozen dessert sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae
17
3. Augment Core Tree Attach to Core tree the terms with more than one sense Favor the more common path over other alternatives Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
18
Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date Choose this path since it has more items assigned
19
Optional Step: Domains To disambiguate, use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. A better collection has been developed by Magnini 2000 Assigns a domain to every noun synset Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree
20
Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3
21
4. Compress Tree Rule 1: Eliminate a parent with fewer than k children unless it is the root or its distribution is larger than 0.1*max dist ice cream sundae dessert sundae frozen dessert sherbet,sorbet sherbet parfait dessert frozen dessert sundae parfait sherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
22
4. Compress Tree (cont.) Rule 2: Eliminate a child whose name appears within the parent’s name sundae dessert frozen dessert parfait sherbet dessert sundaeparfaitsherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree
23
5. Divide into Facets Divide into facets
24
5. Divide into Facets (Remove top levels) sugar syrup entity substance,matter food,nutriment ingredient,fixings food stuff,food product sweetening herb flavorer parsley oregano sugar syrup sweetening herb flavorer parsley oregano Rule 1: Manually eliminate the top t levels (t =4 for recipe collection). Divide into facets Rule 2: For each resulting tree, test if it has more than n children (n =2) If yes, stop. If not, delete the root and test again.
25
Example: Recipes (3500 docs)
26
Castanet Output (shown in Flamenco)
27
Castanet Output
32
Castanet Evaluation This is a tool for information architects, so people of this type did the evaluation We compared output on Recipes Biomedical journal titles We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)
33
Subsumption Output
37
LDA Output
40
Evaluation Method Information architects assessed the category systems For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels Then comment on overall properties Meaningful? Systematic? Likely to use in your work?
41
Evaluation (cont.) Sample questions for top level categories: - Would you add/remove/rename any category ? - Did this category match your expectations ? Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ? General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?
42
Evaluation Results Results on recipes collection for “Would you use this system in your work?” # “Yes in some cases” or “yes, definitely”: Castanet: 29/34 LDA: 0/18 Subsumption: 6/16 Baseline: 25/34 Average response to questions about quality (4 = “strongly agree”)
43
Evaluation Results Average responses for top-level categories 4= no changes, 1 = change many Average responses for 2 subcategories
44
Needed Improvements Take spelling variations and morphological variants into account Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to categories.
45
Opportunities for Tagging New opportunity: Tagging, folksonomies (flickr, de.lici.ous) People are creating facets in a decentralized manner They are assigning multiple facets to items This is done on a massive scale This leads naturally to meaningful associations
46
Conclusions Flexible application of hierarchical faceted metadata is a proven approach for navigating large information collections. Midway in complexity between simple hierarchies and deep knowledge representation. Currently in use on e-commerce sites; spreading to other domains Systems are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect, seems like it will be a useful tool for Information Architects.
47
Conclusions Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet. The method has been tested on various domains: medicine, recipes, math, news, arts, bibliographical records Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.
48
Learn More Funding This work supported in part by NSF (IIS-9984741) For more information: Stoica, E., Hearst, M., and Richardson, M., Automating Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007 See http://flamenco.berkeley.edu
49
Motivation Want to assign labels from multiple hierarchies
50
Inflexible Force the user to start with a particular category What if I don’t know the animal’s diet, but the interface makes me start with that category? Wasteful Have to repeat combinations of categories Makes for extra clicking and extra coding Difficult to modify To add a new category type, must duplicate it everywhere or change things everywhere The Problem with Hierarchy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.