Extracting Knowledge-Bases from Machine- Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266.

Slides:



Advertisements
Similar presentations
EECE 310: Software Engineering Modular Decomposition, Abstraction and Specifications.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Building a Large- Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 System modeling 2.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Bruce Porter (University of Texas) Peter Clark (Boeing) and Colleagues Building KB’s by Assembling Components: An early evaluation of the approach.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
COMP205 Comparative Programming Languages Part 1: Introduction to programming languages Lecture 2: Structure of programs and programming languages as communication.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Wynne Harlen Fibonacci European Training Session, March 21 st 2012.
Basic Concept of Data Coding Codes, Variables, and File Structures.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Simple Program Design Third Edition A Step-by-Step Approach
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
What You Need before You Deploy Master Data Management Presented by Malcolm Chisholm Ph.D. Telephone – Fax
Reading. How do you think we read? -memorizing words on the page -extracting just the meanings of the words -playing a mental movie in our heads of what.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1.The COBUILD approach to grammar is simple and direct.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
Chapter 6: NavigationCopyright © 2004 by Prentice Hall 6. Navigation Design Site-level navigation: making it easy for the user to get around the site Page-level.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Metadata and Versioning VIF workshop 22 nd April
Taken from Schulze-Kremer Steffen Ontologies - What, why and how? Cartic Ramakrishnan LSDIS lab University of Georgia.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Adoption of RDA-DFT Terminology and Data Model to the Description and Structuring of Atmospheric Data Aaron Addison, Rudolf Husar, Cynthia Hudson-Vitale.
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
1 SY DE 542 Navigation and Organization Prototyping Basics Feb 28, 2005 R. Chow
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Algebra Simplifying and collecting like terms. Before we get started! Believe it or not algebra is a fairly easy concept to deal with – you just need.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Levels of Linguistic Analysis
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1 CS 430 Database Theory Winter 2005 Lecture 3: A Fifty Minute Introduction to Data Modeling.
Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.
 At the end of the class students should:  distinguish between data and information.  explain the characteristics and forms of Information Processing.
Natural Language Processing Vasile Rus
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Achieving Semantic Interoperability of Cancer Registries
Tagging documents made easy, using machine learning
Illustrations of different approaches Peter Clark and John Thompson
Survey of Knowledge Base Content
European Network of e-Lexicography
Statistical NLP: Lecture 9
Levels of Linguistic Analysis
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Informational Text Project
Information Retrieval
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Extracting Knowledge-Bases from Machine- Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp As (mis-)interpreted by Peter Clark

The Postulates of MRD Work P1: MRDs contain information that is useful for NLP e.g.:

The Postulates of MRD Work P1: MRDs contain information that is useful for NLP P2: This info is relatively easy to extract from MRDs e.g., extraction of hypernyms (generalizations):  Dipper isa Ladle isa Spoon isa Utensil

But… Not much to show for it so far (1993) –handful of limited and imperfect taxonomies –few studies on the quality of knowledge in MRDs –few studies on extracting more complex info

Complaints… P1: useful info in MRDs: –C1a: 50%-70% of info in dictionaries is “garbled” –C1b: sense definitions  concept usage (“real concepts”) –C1c: some types of knowledge simply not there P2: Info can be easily extracted Most successes have been for hypernyms only –C2a: MRD formats are a nightmare to deal with –C2b: A virtually open-ended set of ways of describing facts –C2c: Bootstrapping: Need a KB to build a KB from a MRD

C1a: MRD information is “garbled” Multiple people, multiple years effort Space restrictions, syntactic restrictions Particular problem 1: –Attachment of terms too high (21%-34%) e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers” occurs fairly randomly –Categories less clear at top levels “fork” and “spoon” is ok, but “implement” and “utensil” = ? Sometimes no word there to refer to a concept –leads to circular definitions

C1a: MRD information is “garbled”

Particular problem 2: –Categories less clear at top levels “fork” and “spoon” is ok, but “implement” and “utensil” = ? Leads to disjuncts e.g. “implement or utensil” Sometimes no word there to refer to a concept –leads to circular definitions –leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”) C1a: MRD information is “garbled”

Particular problem 3: –And hypernyms are relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc. C1a: MRD information is “garbled”

Ambiguity of word senses, e.g., –87% of words in a sample fit > 1 word sense Word senses don’t reflect actual use Word sense distinctions differ between MRDs –level of detail –way lines are drawn between senses –no definitive set of distinctions C1b: sense definitions  concept usage (“real concepts”)

C1c: some types of knowledge simply not there no broad contextual or world knowledge, e.g., –no connection between “lawn” and “house”, or between “ash” and “tobacco” –“restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet] No mention that it’s a commercial business, e.g., for “the waitress collected the check.”

C2a: MRD formats are a nightmare to deal with Ambiguities / inconsistencies in typesetter format Complex grammars for entries Conventions are inconsistent, e.g. bracketing for –“Canopic jar, urn, or vase” vs. –“Junggar Pendi, Dzungaria, or Zungaria” Need a lot of hand pre-processing –not much general value to this –is a vast task in itself –not many processed dictionaries available

C2b: A virtually open-ended set of ways of describing facts But… There is “virtually an open-ended set of phrases…”

C2c: Bootstrapping: Need a KB to build a KB Need knowledge to do NLP on MRDs! –e.g. “carry by means of a handle” vs. “carry by means of a wagon” But undisambiguated hierarchy is unusable, e.g., –“saucepan” isa “pan” isa “leaf”  need to build your KB before you even start on the MRD

Synthesis Underlying postulate of P1 and P2: –P0: Large KBs cannot be built by hand Counterexamples: –Cyc –Dictionaries themselves! And besides… –KBs are too hard to extract from MRDs –don’t contain all the knowledge needed But: MRD contributions: –understanding the structure of dictionaries –convergance of NLP, lexicography, and electronic publishing interests

Ways forward… Combining Knowledge Sources: –One dictionary has 55%-70% of “problematic cases” [of incompleteness], but 5 dictionaries reduced this to 5% Also should combine knowledge from corpora as a means of “filling out” KBs Prediction: –KBs built by people, using corpora and text extraction technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)

Ways forward… MRDs will become encoded more consistently Better analysis needed of the types of knowledge needed for NLP –perhaps don’t need the kind of precision in a KB Exploitation of associational information –Very useful for sense disambiguation (e.g., Harabagiu)

Ways forward… Lexicographers increasingly interested in using lexical databases for their work Could create a NLP-like KB directly –Create explicit semantic links between word entries –Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)

Ways forward…

Lexicographers increasingly interested in using lexical databases for their work Could create a NLP-like KB directly –Create explicit semantic links between word entries –Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) –Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated) –Ensure consistency of sense division e.g., “cup” and “bowl” have two senses (literal and metonymic) but “glass” only has one (literal)  could spot this inconsistency