Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept.

Slides:



Advertisements
Similar presentations
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
Sensible Searching: Making Search Engines Work Dr Shaun Ryan CEO S.L.I. Systems
Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica.
Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Measuring Information Architecture CHI 01 Panel Position Statement Marti Hearst UC Berkeley.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Social Tagging and Search Marti Hearst UC Berkeley.
Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Measuring Information Architecture Marti Hearst UC Berkeley.
Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.
Measuring Information Architecture Marti Hearst UC Berkeley.
Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica.
A metadata-based approach Marti Hearst Associate Professor BT Visit August 18, 2005.
Yahoo Visit Day Joint Reseach Opportunities Marti Hearst UC Berkeley School of Information.
Measuring Information Architecture Marti Hearst UC Berkeley.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Faceted Metadata in Search Interfaces Marti Hearst UC Berkeley School of Information This Research Supported by NSF IIS
Transforming Tags to (Faceted) Tagsonomies Marti Hearst UC Berkeley School of Information This Research Supported by NSF IIS
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Sunday May 4 – 5 PM Bradford, Hlava, McNaughton
Overview of Search Engines
Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Information retrieval thur jan data…. framework for today’s lecture…
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Unit 2 — Building Web Part B) Designing the Web. Phase 1: Planning a Web Site Like an architect designing a building, adequately planning your Web site.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Information retrieval wed sept data…. -start at 6.45.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Querying Structured Text in an XML Database By Xuemei Luo.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
Red–black trees.  Define the red-black tree properties  Describe and implement rotations  Implement red-black tree insertion  We will skip red-black.
WDO-It! 101 Workshop: Creating an abstraction of a process UTEP’s Trust Laboratory NDR HP MP.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
MetaLib 4 User Guide. 2 MetaLib 4 Access MetaLib at: – MetaLib may be used at two different levels –
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
XP New Perspectives on Macromedia Dreamweaver MX 2004 Tutorial 5 1 Adding Shared Site Elements.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Spatial Data Management
B/B+ Trees 4.7.
NLP Support for Faceted Navigation in Scholarly Collections
Chapter 25: Advanced Data Types and New Applications
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Taxonomies & Classification for Organizing Content
Personalizing Search on Shared Devices
Document Clustering Matt Hughes.
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept. of Mathematical Sciences, NMSU

Focus: Browse Large Datasets Standard search interface - query box + retrieved results – not suited for browsing and navigation User interfaces need to group and organize the results

How do we Create Faceted Hierarchies? Goals: Help an information architect to create the hierarchy Currently they do it all by hand! Balance depth and breadth Avoid “skinny” paths Don’t go too deep or too broad Choose understandable labels Disambiguate between word senses

Related Work Automated text categorization LOTS of work on this Assumes that a set of categories is already created Little if any work on building facet hierarchies

Castanet Carves out a structure from the hypernym (IS-A) relations within WordNet Semi-automatic algorithm for creating hierarchical faceted metadata Produces surprisingly good results for a wide range of subjects e.g., recipes, medicine, math, news, fine arts image description

WordNet Challenges A word may have more than one sense - Fine granularity of word sense distinctions e.g., newspaper (#1) - daily publication on folded sheets newspaper (#3) - physical object - Ambiguity for the same sense tuna #1 cactus #2 fish food fish bony fish

WordNet Challenges (cont.) The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes) Sparse coverage of proper names and noun phrases (not addressed)

Our Approach Documents Select terms WordNet Build core tree Augment core tree Remove top level categories Compress Tree Divide into facets

1. Select Terms Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

2. Build Core Tree Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

2. Build Core Tree (cont.) Merge hypernym paths to build a tree sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet frozen dessert sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae

3. Augment Core Tree Attach to Core tree the terms with more than one sense Favor the more common path over other alternatives Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date Sunday berries date Choose this path since it has more items assigned ? ?

Optional Step: Domains To disambiguate, use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. A better collection has been developed by Magnini (2000) Assigns a domain to every noun synset Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree

Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3

4. Compress Tree Rule 1: Eliminate a parent with fewer than k children unless it is the root or its distribution is larger than 0.1*max dist ice cream sundae dessert sundae frozen dessert sherbet,sorbet sherbet parfait dessert frozen dessert sundae parfait sherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

4. Compress Tree (cont.) Rule 2: Eliminate a child whose name appears within the parent’s name sundae dessert frozen dessert parfait sherbet dessert sundaeparfaitsherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

5. Divide into Facets Divide into facets

5. Divide into Facets (Remove top levels) sugar syrup entity substance,matter food,nutriment ingredient,fixings food stuff,food product sweetening herb flavorer parsley oregano sugar syrup sweetening herb flavorer parsley oregano Rule 1: Eliminate the top t levels (t =4 for recipe collection). Divide into facets Rule 2: For each resulting tree, test if it has at least n children (n =2) If yes, stop. If not, delete the root and repeat. Manual cleaning: remove facets that don’t make sense

Example: Recipes (13,500 docs)

Castanet Output (shown in Flamenco)

Castanet Output

Castanet Evaluation This is a tool for information architects (IA), so people of this type did the evaluation Each IA compared Castanet to other state-of- the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99) Baseline: most frequent terms in the collection Datasets 13,000 recipes from Southwestcooking.com

Subsumption Output

LDA Output

Evaluation Method For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels Then comment on overall properties Meaningful? Systematic? Likely to use in your work? L C S C } 16 } 18

Evaluation (cont.) Sample questions for top level categories: - Would you add/remove/rename any category ? - Did this category match your expectations ? Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ? General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

Evaluation Results “Would you use this system in your work?” “yes definitely”, “yes, in some cases” Castanet 85% LDA 0 % Subsumption 37% Baseline74% Average response to questions about quality (4 = “strongly agree”, 3 = “agree somewhat”, 2 = “disagree somewhat”, 1 = “strongly disagree”)

Evaluation Results Average responses for top-level categories (4= “no changes”, 3 = “one or two”, 2 = “a few”, 1 = “many”) Average responses for 2 subcategories

Needed Improvements Take spelling variations and morphological variants into account Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to categories.

Conclusions Flexible application of hierarchical faceted metadata is a proven approach for navigating large information collections. Midway in complexity between simple hierarchies and deep knowledge representation. Currently in use on e-commerce sites; spreading to other domains Systems are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect, seems like it will be a useful tool for Information Architects.

Conclusions Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet. The method has been tested on various domains: medicine, recipes, math, news, description of images Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work. Future work Apply to tags (flickr, delicious)

Learn More Funding This work supported in part by NSF (IIS ) For more information: Stoica, E., Hearst, M., and Richardson, M., Automating Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007 See