Download presentation
Presentation is loading. Please wait.
Published byTrevor Golden Modified over 9 years ago
1
Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University)
2
CARDGIS2 Context: CARDGIS Project Sources: –Energy Info. Adminstration (quarterly CD ROM). –Bureau of Labor Statistics (http://stats.bls.gov). –Census Bureau (CD ROM for 1992 data). –California Energy Commission (weekly data at http://energy.ca.gov). Enable access to multiple, heterogeneous Federal agency data sources through single interface using standardized nomenclature, while accounting for semantic variability.
3
CARDGIS3 System Architecture Sources Integrated Ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor User phase: Compose query Ontology Construction - DB analysis - text analysis Construction phase: Deploy DBs Extend ontol. Query Processor - reformulation - cost optimization RST Access phase: Create DB query Retrieve data
4
CARDGIS4 So What is an Ontology? Desiderata: –‘anchor points’ for terminology variants (salary, income…), –wide coverage, –some degree of taxonomic organization for inference/program behavior control. Terminological (not domain) ontology.
5
CARDGIS5 Taxonomy, multiple superclass links. Approx. 90,000 items. Top level: Penman Upper Model (ISI). Body: WordNet (Princeton), rearranged. Used at ISI for machine translation, text summarization, database access. http://vigor.isi.edu:8002/sensus2/ ISI’s SENSUS Ontology
6
CARDGIS6 3 Ways of Building Ontologies 1. Combine existing knowledge resources: ontology alignment. + + 2. Learn from texts and Web: extract word families for thousands of concepts. 3. Parse dictionary definitions: extract information and place into ontology.
7
CARDGIS7 1. Cross-Ontology Alignment 1. Text Matches –concept names (cognates; reward for delimiter confluence...) –textual definitions (string matching, demorphing, stop words...) [Knight & Luk 94, Dalianis & Hovy 98] 2. Hierarchy Matches –shared superconcepts, to filter ambiguity [Knight & Luk 94] –semantic distance [Agirre et al. 94] 3. Data Item and Form Matches –inter-concept relations [Ageno et al. 94; Rigau & Agirre 95] –slot-filler restrictions [Okumura & Hovy 94] Why create a new Ontology? — Merge and re- use existing ones! Problem: automatically find corresp. concepts.
8
CARDGIS8 Cross-Ontology Alignment Results Ontologies: –SENSUS Upper Model (350) –CYC top region (2400) [Lenat; Lehmann 96] –MIKROKOSMOS (4790 concepts) [Mahesh 96] –SENSUS top region (6768) Recall (how many links were missed?): difficult to count! … 32.4 mill pairs Precision (how many suggested links are correct?): –0.252 (strict) –0.517 (lenient) After 5 runs: correct: 244 (= 3.6%) –883 suggestions near miss: 256 (= 3.8%) (= 13% of SENSUS candidates)wrong: 383 (= 5.6%) 1996 1997
9
CARDGIS9 2. The Websucker Corpus –Training set WSJ 1987: 16,137 texts (32 topics). –Test set WSJ 1988: 12,906 texts (31 topics). –Texts indexed into categories by humans. Signature data –300 terms each, using tf.idf. –Word forms: single words, demorphed words, multi-word phrases. How many terms in signatures? –5,10,15, …, 300 terms.
10
CARDGIS10 Pollution on the Web Cleanup: try various methods: tf.idf, 2, Latent Semantic Analysis...
11
CARDGIS11 3. Dictionary Extraction Babel n 2 [ SENT [ NP OR [ NP A/DT place/NN ] [ NP scene/NN ] ] [ PP of/IN [ NP AND [ NP noise/NN ] [ NP confusion/NN ] ] ] ] ;/: [ SENT [ NP a/DT confused/JJ mixture/NN ] [ PP of/IN [ NP sounds/NNS ] ],/, as/IN [ PP of/IN [ NP languages/NNS ] ] ]./. Step 1: find unencumbered dictionary (Webster 1913). Step 2: reformat and then parse entries (http://www.isi.edu/natural-language/dpp/). Step 3: identify individual propositions and their heads. Step 5: place entries into ontology (not yet done). Step 4: convert preps to semantic relations (EM alg).
12
CARDGIS12 Identify propositions and their parts: Impression: “A communicating [of a mold or trait] [by an external force or influence]” Reflection: “The return [of light or sound waves] [by or as if by a mirror]” by = AGENT or PATH? communication by force; return by mirror; return by road of = OWNER or NUMBER-PART or SOURCE or …? the car of Joe; 1 of 15 people smoke; man of La Mancha Apply EM algorithm to disambiguate. Disambiguating Extracted Info.
13
CARDGIS13 Dictionary Extraction Results Ambiguity reduction Readings Instances 60 1 48 1 36 1 24 1 18 7 12 8 10 2 6 764 5 12 4 20 3 108 2 310 1 902 Evaluation for sentence #1: "As a prefix to english words." 0.000000000621871299: NIL relation<abst PHRASAL speech_act Score: 1/1 = 1 Evaluation for sentence #13: "Gives up to underwriters." 0.000000041080864587: create,make NIL RECIPIENT capitalist<so 0.000000038652300894: transmit_thou NIL RECIPIENT capitalist<so Score: 1/2 = 0.5 Evaluation for sentence #14: "Gives all claim to the property." 0.000000002594561718: emit,utter human_action PHRASAL possessn>tr 0.000000002564569212: chnge_pos human_action PHRASAL possessn>tr 0.000000002451809783: create,make human_action PHRASAL possessn>tr 0.000000002368122454: cogitate human_action PHRASAL possessn>tr 0.000000002366411877: utilize human_action PHRASAL possessn>tr 0.000000002307022303: transmit_thou human_act PHRASAL possessn>tr 0.000000002177555675: transfer>comm human_act PHRASAL possessn>tr 0.000000002049017956: chnge>go_mad human_act PHRASA possessn>tr Score: 1/8 = 0.125
14
CARDGIS14 The Future: Terminology Standard? Reasons for terminology standardization: 1. Non-duplication similar domain models built for many applications 2. Consistency across experts within domain, and across domains 3. Efficient model building complex: many decisions required simultaneously ANSI Ad Hoc Group on Ontology Standards (NCITS): draw together Ontology work worldwide IBM (Santa Teresa), Stanford, ISI, CYC, TextWise, EDR, CSLI, NMSU, Lawrence Livermore, OnTek, Government... Meetings: 3/96, 9/96, 3/97, 11/97, 1/98, (6/98)…
15
CARDGIS15 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.