Download presentation
Presentation is loading. Please wait.
1
Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley
2
Motivation Want to assign items labels from multiple hierarchies
3
Motivation Description: 19th c. paint horse; saddle and hackamore; spurs; bandana on rider; old time cowboy hat; underchin thong; flying off. Nature Animal Mammal Horse Occupations Cowboy Clothing Hats Cowboy Hat Media Engraving Wood Eng. Location North America America
4
Use in Browsing Interfaces like Flamenco
6
How to Obtain the Hierarchies? Goal: Help an information architect get started Currently they do it all by hand! Assume they will do some editing Nearly automated Multiple hierarchies (facets) Automatically assign items to multiple hierarchies
7
Related Work Automated text categorization LOTS of work on this Assumes that a set of categories is already created To be intuitive, a categorization should contain sets of IS-A relations (hierarchical) Rosenfeld and Morville, (2002) Pratt, Hearst, and Fagan (1999) Current automated approaches contain only associative relations
8
Examples of Associative Relations Hofmann 1999 Collection: Machine learning abstracts Top-level categories: learn, paper, base, model, new train Problem: These are not intuitive categories for machine learning Sanderson and Croft 1999 Collection: Medical texts Top level categories: disease, post polio, serious disease, dengue, infection control, immunology, … Problem: These are at different levels of generality
9
Examples of Associative Relations Schuetze 1993 Collection: Arts descriptions Sample Groupings: carriage cart horse ride walk passing horseback wagon men chicken rider bald balding head facing hand faced arm hat haired glove long Problem: Terms are associated with one another, but are not organized into hierarchies that can be navigated.
10
Our Approach Leverage the structure of WordNet Documents WordNet Get hypernym paths Select terms Build tree Compress tree
11
1. Select Terms red blue Select well distributed terms from collection Documents WordNet Get hypernym paths Select terms Build tree Comp. tree
12
2. Get Hypernym Path red blue chromatic color abstraction property visual property color red, redness abstraction property visual property color blue, blueness chromatic color Get hypernym path for each term Documents WordNet Get hypernym paths Select terms Build tree Comp. tree
13
3. Build Tree red blue chromatic color abstraction property visual property color red, redness abstraction property visual property color blue, blueness chromatic color red blue abstraction property visual property color red, redness chromatic color blue, blueness Documents WordNet Get hypernym paths Select terms Build tree Comp. tree Merge hypernym paths to build a tree
14
4. Compress Tree Documents WordNet Get hypernym paths Select terms Build tree Comp. tree Eliminate a parent with fewer than n children unless it is the root or its distribution is larger than 0.1*max dist red, redness color red chromatic color blue, blueness blue green, greenness green red color chromatic color blue
15
4. Compress Tree (cont.) Eliminate a child whose name appears within parent’s red color chromatic color blue green color redbluegreen Documents WordNet Get hypernym paths Select terms Build tree Comp. tree
16
5. Remove top Levels Top levels of WordNet are too general, e.g. Entity Substance, matter Abstraction
17
Disambiguation Ambiguity in: Word senses Paths up the hypernym tree Sense 1 for word “tuna” organism, being => plant, flora => vascular plant => succulent => cactus => tuna Sense 2 for word “tuna” organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna 2 paths for same word2 paths for same sense
18
How to Select the Right Senses and Paths? (This part is not in the paper.) Solution: Modify the algorithm First: build core tree (1) Create paths for words with only one sense (2) Use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or he may add his own Paths for terms that match the selected domains are added to the core tree Then: add remaining terms to the core tree.
19
Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3
20
Enrich Core Tree For each new term t Q(t) 0 ; // set of candidate paths for each path p of t compute the fraction f p (t) of nodes in p that are shared with a path in the core tree if ( f p (t) > thresh ) Q(t) = Q(t) U {p} if ( Q(t) = {} ) chose first sense of t else among all p ’s in Q(t), chose path in core tree with most items assigned
21
Enrich Core Tree entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish device fondue, fondu conductor semiconductor diode light-emitting diode (led) Core tree Toaster with led indicators Chip (p1) Chip (p2) entity entity substance,matter object food, nutrient artifact nutriment instrumentality dish device snack food conductor chip semiconductor chip
22
Enrich Core Tree entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor diode chip light-emitting diode (led) Core tree Chip (p1) Chip (p2) f p1 (Chip) = 5/7 Q = {p1}
23
Enrich Core Tree entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor diode chip light-emitting diode (led) Core tree Chip (p1) Chip (p2) f p1 (Chip) = 5/7 f p2 (Chip) = 7/8 Q = {p1, p2}
24
Enrich Core Tree (cont’d) entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish (1699) device fondue, fondu (40) conductor semiconductor (45) diode light-emitting diode (led) Core tree snack food chip Chose this path since it has more items assigned chip
25
Results on a Recipes/ Kitchen Appliances Data Set
27
Discussion This is very simple, but works very well Why hasn’t this been done before? Because WordNet did not have enough coverage?
28
Conclusions Can nearly-automatically build a set of hierarchies by finding IS-A relations between terms using WordNet The method has been tested on various domains: medicine, mathematics, recipes, news, arts User study in progress Limitations: The ontology has to be appropriate for the target domain No disambiguation between nouns, verbs, and adjectives
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.