January 2004 ADC’04 - 1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by.

Slides:



Advertisements
Similar presentations
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Advertisements

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
OMG Architecture Ecosystem SIG Federal CIO Council Data Architecture Subcommittee May 2011 Cory Casanave.
1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science.
1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu Department of Computer Science S.W. Liddle ‡
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Conceptual Model Based Semantic Web Services Muhammed J. Al-Muhammed David W. Embley Stephen W. Liddle Brigham Young University Sponsored in part by NSF.
Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Dynamic Matchmaking between Messages and Services in Multi-Agent Systems Muhammed Al-Muhammed May 3, 2004 Support in part by NSF.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Formal Ontology and Information Systems Nicola Guarino (FOIS’98) Presenter: Yihong Ding CS652 Spring 2004.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Dynamic Matchmaking between Messages and Services in Multi-Agent Systems Muhammed Al-Muhammed David W. Embley Brigham Young University Supported in part.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
fleckvelter gonsity (ld/gg) hepth (gd) burlam falder multon repeat: 1.understand table 2.generate mini-ontology 3.match with growing.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W.
Dynamic Matchmaking between Messages and Services in Multi-Agent Systems Muhammed Al-Muhammed Supported in part by NSF.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.
Practical RDF Chapter 1. RDF: An Introduction
1 The BT Digital Library A case study in intelligent content management Paul Warren
Introduction to ArcGIS for Environmental Scientists Module 1 – Data Visualization Chapter 4 - Layouts.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Semantic Web - an introduction By Daniel Wu (danielwujr)
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.
Using and modifying plan constraints in Constable Jim Blythe and Yolanda Gil Temple project USC Information Sciences Institute
OWL Representing Information Using the Web Ontology Language.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
Managing Semi-Structured Data. Is the web a database?
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
An Ontological Approach to Financial Analysis and Monitoring.
Information Architecture & Design Week 9 Schedule - Web Research Papers Due Now - Questions about Metaphors and Icons with Labels - Design 2- the Web -
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
International Workshop 28 Jan – 2 Feb 2011 Phoenix, AZ, USA Ontology in Model-Based Systems Engineering Henson Graves 29 January 2011.
Discovering Computers 2011: Living in a Digital World Chapter 3
Extracting and Structuring Web Data
Knowledge Management Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
What Do You Want—Semantic Understanding?
David W. Embley Brigham Young University Provo, Utah, USA
Extracting and Structuring Web Data
CSc4730/6730 Scientific Visualization
Automating Schema Matching for Data Integration
Presentation transcript:

January 2004 ADC’ What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science Foundation

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ Grand Challenge Semantic Understanding Can we quantify & specify the nature of this grand challenge?

January 2004 ADC’ Grand Challenge Semantic Understanding “If ever there were a technology that could generate trillions of dollars in savings worldwide …, it would be the technology that makes business information systems interoperable.” (Jeffrey T. Pollock, VP of Technology Strategy, Modulant Solutions)

January 2004 ADC’ Grand Challenge Semantic Understanding “The Semantic Web: … content that is meaningful to computers [and that] will unleash a revolution of new possibilities … Properly designed, the Semantic Web can assist the evolution of human knowledge …” (Tim Berners-Lee, …, Weaving the Web)

January 2004 ADC’ Grand Challenge Semantic Understanding “20 th Century: Data Processing “21 st Century: Data Exchange “The issue now is mutual understanding.” (Stefano Spaccapietra, Editor in Chief, Journal on Data Semantics)

January 2004 ADC’ Grand Challenge Semantic Understanding “The Grand Challenge [of semantic understanding] has become mission critical. Current solutions … won’t scale. Businesses need economic growth dependent on the web working and scaling (cost: $1 trillion/year).” (Michael Brodie, Chief Scientist, Verizon Communications)

January 2004 ADC’ Why Semantic Understanding?  Because we’re overwhelmed with data Point and click too slow “Give me what I want when I want it.”  Because it’s the key to revolutionary progress Automated interoperability and knowledge sharing Negotiation in e-business Large-scale, in-silico experiments in e-science We succeed in managing information if we can “[take] data and [analyze] it and [simplify] it and [tell] people exactly the information they want, rather than all the information they could have.” - Jim Gray, Microsoft Research

January 2004 ADC’ What is Semantic Understanding? Understanding: “To grasp or comprehend [what’s] intended or expressed.’’ Semantics: “The meaning or the interpretation of a word, sentence, or other language form.” - Dictionary.com

January 2004 ADC’ Can We Achieve Semantic Understanding? “A computer doesn’t truly ‘understand’ anything.” But computers can manipulate terms “in ways that are useful and meaningful to the human user.” - Tim Berners-Lee Key Point: it only has to be good enough. And that’s our challenge and our opportunity! …

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ Information Value Chain Meaning Knowledge Information Data Translating data into meaning

January 2004 ADC’ Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

January 2004 ADC’ Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

January 2004 ADC’ Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

January 2004 ADC’ Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

January 2004 ADC’ Data  Attribute-Value Pairs Fundamental for information Thus, fundamental for knowledge & meaning

January 2004 ADC’ Data  Attribute-Value Pairs Fundamental for information Thus, fundamental for knowledge & meaning  Data Frame Extensive knowledge about a data item ̶ Everyday data: currency, dates, time, weights & measures ̶ Textual appearance, units, context, operators, I/O conversion Abstract data type with an extended framework

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ ? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

January 2004 ADC’ ? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

January 2004 ADC’ ? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

January 2004 ADC’ ? Olympus C-750 Ultra Zoom Sensor Resolution4.2 megapixels Optical Zoom10 x Digital Zoom4 x Installed Memory16 MB Lens ApertureF/8-2.8/3.7 Focal Length min6.3 mm Focal Length max63.0 mm

January 2004 ADC’ Digital Camera Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

January 2004 ADC’ ? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

January 2004 ADC’ ? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

January 2004 ADC’ ? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

January 2004 ADC’ ? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

January 2004 ADC’ Car Advertisement Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

January 2004 ADC’ ? Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

January 2004 ADC’ ? Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

January 2004 ADC’ Airline Itinerary Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

January 2004 ADC’ ? Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

January 2004 ADC’ ? Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

January 2004 ADC’ World Cup Soccer Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

January 2004 ADC’ ? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

January 2004 ADC’ ? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

January 2004 ADC’ ? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

January 2004 ADC’ Treadmill Workout Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

January 2004 ADC’ ? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

January 2004 ADC’ ? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

January 2004 ADC’ ? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

January 2004 ADC’ Maps PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,100 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ Information Extraction Ontologies SourceTarget Information Extraction Information Exchange

January 2004 ADC’ What is an Extraction Ontology?  Augmented Conceptual-Model Instance Object & relationship sets Constraints Data frame value recognizers  Robust Wrapper (Ontology-Based Wrapper) Extracts information Works even when site changes or when new sites come on-line

January 2004 ADC’ Extraction Ontology: Example Car [-> object]; Car [0:1] has Year [1:*]; Car [0:1] has Make [1:*]; … Car [0:*] has Feature [1:*]; PhoneNr [1:*] is for Car [0:1]; Year matches [4] constant {extract “\d{2}”; context “\b’[4-9]\d\b”; …} … Mileage matches [8] keyword {\bmiles\b”, “\bmi\b.”, …} …

January 2004 ADC’ Extraction Ontologies: An Example of Semantic Understanding  “Intelligent” Symbol Manipulation  Gives the “Illusion of Understanding”  Obtains Meaningful and Useful Results

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ A Variety of Applications  Information Extraction  High-Precision Classification  Schema Mapping  Semantic Web Creation  Agent Communication  Ontology Generation

January 2004 ADC’ Application #1 Information Extraction

January 2004 ADC’ '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Constant/Keyword Recognition Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

January 2004 ADC’ Heuristics  Keyword proximity  Subsumed and overlapping constants  Functional relationships  Nonfunctional relationships  First occurrence without constraint violation

January 2004 ADC’ Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Keyword Proximity   '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

January 2004 ADC’ Subsumed/Overlapping Constants '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

January 2004 ADC’ Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Functional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

January 2004 ADC’ Nonfunctional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

January 2004 ADC’ First Occurrence without Constraint Violation '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

January 2004 ADC’ Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Database-Instance Generator insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “ ”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”)

January 2004 ADC’ Application #2 High-Precision Classification

January 2004 ADC’ An Extraction Ontology Solution

January 2004 ADC’ Document 1: Car Ads Document 2: Items for Sale or Rent Density Heuristic

January 2004 ADC’ Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 Expected Values Heuristic Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4

January 2004 ADC’ Vector Space of Expected Values OV______ D1D2 Year Make Model Mileage Price Feature PhoneNr D1: D2: ov D1 D2

January 2004 ADC’ Grouping Heuristic Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {

January 2004 ADC’ Grouping Car Ads Year Make Model Price Year Model Year Make Model Mileage Year Model Mileage Price Year … Grouping: Sale Items Year Mileage Mileage Year Price Year Price Year Price … Grouping: Expected Number in Group = floor(∑ Ave ) = 4 (for our example) Sum of Distinct 1-Max Object Sets in each Group Number of Groups * Expected Number in a Group 1-Max *4 = *4 = 0.500

January 2004 ADC’ Application #3 Schema Mapping

January 2004 ADC’ Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

January 2004 ADC’ Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

January 2004 ADC’ Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto

January 2004 ADC’ Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,

January 2004 ADC’ Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

January 2004 ADC’ Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto

January 2004 ADC’ Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

January 2004 ADC’ Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

January 2004 ADC’ Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table

January 2004 ADC’ Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,

January 2004 ADC’ Application #4 Semantic Web Creation

January 2004 ADC’ The Semantic Web  Make web content accessible to machines  What prevents this from working? Lack of content Lack of tools to create useful content Difficulty of converting the web to the Semantic Web

January 2004 ADC’ Converting Web to Semantic Web

January 2004 ADC’ Superimposed Information

January 2004 ADC’ Application #5 Agent Communication

January 2004 ADC’ The Problem Requiring these assumptions precludes agents from interoperating on the fly “T he holy grail of semantic integration in architectures” is to “allow two agents to generate needed mappings between them on the fly without a priori agreement and without them having built-in knowledge of any common ontology.” [Uschold 02] Agents must: 1- share ontologies, 2- speak the same language, 3- pre-agree on message format.

January 2004 ADC’ Solution Agents must: 1- share ontologies, 2- speak the same language, 3- pre-agree on message format. Eliminate all assumptions - Dynamically capturing a message’s semantics - Matching a message with a service - Translating (developing mutual understanding) This requires:

January 2004 ADC’ MatchMaking System (MMS)

January 2004 ADC’ Application #6 Ontology Generation

January 2004 ADC’ TANGO: Table Analysis for Generating Ontologies  Recognize and normalize table information  Construct mini-ontologies from tables  Discover inter-ontology mappings  Merge mini-ontologies into a growing ontology

January 2004 ADC’ Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 30%

January 2004 ADC’ Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 30%

January 2004 ADC’ Discover Mappings

January 2004 ADC’ Merge

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ Limitations and Pragmatics  Data-Rich, Narrow Domain  Ambiguities ~ Context Assumptions  Incompleteness ~ Implicit Information  Common Sense Requirements  Knowledge Prerequisites  …

January 2004 ADC’ Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

January 2004 ADC’ Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

January 2004 ADC’ Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

January 2004 ADC’ Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.) Ambiguous Whom do we trust? (How do they count?)

January 2004 ADC’ Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.) Important qualification

January 2004 ADC’ Dow Jones Industrial Average High Low Last Chg 30 Indus Transp Utils Stocks , Graphics, Icons, …

January 2004 ADC’ Dow Jones Industrial Average High Low Last Chg 30 Indus Transp Utils Stocks , Reported on same date Weekly Daily Implicit information: weekly stated in upper corner of page; daily not stated.

January 2004 ADC’ “Mad Cow” hurts Utah jobs “Utah stands to lose 1,200 jobs from Asian countries’ import bans on beef products,...” Common sense: a cow can’t hurt jobs.

January 2004 ADC’ “Mad Cow” hurts Utah jobs “Utah stands to lose 1,200 jobs from Asian countries’ import bans on beef products,...” Knowledge required for understanding: Mad Cow disease discovered in Washington. Washington state (not DC), which is in the western US. Humans can get the disease by eating contaminated beef. Utah is in the western US. Beef cattle are regionally linked(somehow?) People in Asian countries don’t want to get sick.

January 2004 ADC’ Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

January 2004 ADC’ Some Key Ideas  Data, Information, and Knowledge  Data Frames Knowledge about everyday data items Recognizers for data in context  Ontologies Resilient Extraction Ontologies Shared Conceptualizations  Limitations and Pragmatics

January 2004 ADC’ Some Research Issues  Building a library of open source data recognizers  Creating a corpora of test data for extraction, integration, table understanding, …  Precisely finding and gathering relevant information Subparts of larger data Scattered data (linked, factored, implied) Data behind forms in the hidden web  Improving concept matching Indirect matching Calculations and unit conversions  …

January 2004 ADC’ Some Research Challenges  Automating ontology construction  Converting web data to Semantic Web data  Accommodating different views  Developing effective personal software agents  …