Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
CSE 636 Data Integration Data Integration Approaches.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Data Integration An Overview. What is Information Integration and Why is it important Some of the upcoming slides are from William Cohen’s tutorial on.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
Information Technology in Organizations
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.
Peer-to-Peer Databases David Andersen Advanced Databases.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Announcements. Data Management Chapter 12 Traditional File Approach  Structure Field  Record  File  Fixed All records have common fields, and a field.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Mining the Biomedical Research Literature Ken Baclawski.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Semantic Mappings for Data Mediation
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Integration Approaches
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Statistical Schema Matching across Web Query Interfaces
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Big Data The huge amount of data being collected and stored about individuals, items, and activities and to the process of drawing useful information from.
Database management concepts
Database management concepts
Database Design Hacettepe University
Yannis Papakonstantinou Associate Prof., CSE, UCSD
Chen Li Information and Computer Science
Composing Mappings among Data Sources
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Context-Aware Internet
Course Instructor: Supriya Gupta Asstt. Prof
Presentation transcript:

Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003

Mediated Schema OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

Motivation and Activity Application areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing. Several data sharing architectures: Virtual data integration, warehousing, message- passing, web-services. Many research projects: Mine: Information Manifold, Tukwila, LSD, Piazza. EII : a new industry buzzword.

Today ’ s Agenda Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age of problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. AI is more vital than ever for progress here!

Mediation Languages Goal: Mediated Schema Source Language for Specifying Semantic Relationships (not full FOL) Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’ Assume: data at the sources is structure (or seems so).

Global-as-View (GAV) Mediated Schema Source R1R2R3R4R5 Title, Actor, … Actor(x,y) :- R1(x,y,z) Actor(x,y) :- R2(x,z), R3(z,y)

Local-as-View (LAV,GLAV) Mediated Schema Source R1R2R3R4R5 Title, Actor … R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970 R5(x,y,z) :- Movie(x,y,”French”)

Mediation Languages: Summary A lot of nice theory and practical algorithms. Careful choice of expressive power mattered. Algorithms for answering queries using views are in every commercial DBMS. Description Logics – also an attractive formalism for mediation. Bottleneck is coming up with the mapping expressions.

Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Adaptive Query Processing Problem: no stats, network unstable Cannot ‘ Plan and then execute ’ Need to adapt plan during execution. Ideas already in Ingres (1976) (early database system) Interleaving planning and execution (AI) Key question: when and granularity of adaptation: For every tuple? Materialization points? See [Ives et al. 2002] for our solution.

Convergent Query Processing [Ives et al., 2002] (I  O  S) I OS  O1S1O1S1 I 1 O 1 S 1 IOIO I 0 O 0 S 0 I 0 O 0 “Cleanup” query plan Join In-stock, Orders, Shipping I 2 O 2 S 2 I2S2I2S2

XML Query Processing XML facilitates integration. Mediator query processor may manipulate XML directly. Challenges: XML is not flat, but nested; Path queries. Can be irregular; doesn ’ t adhere to a strict schema. Progress: Defining and optimizing XQuery. Going back and forth: XML to relational.

The Commercial World Some startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Integration technology in different layers: E.g., reporting companies want it (Actuate) Progress: analysts have buzzword -- EII. Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical?

What Worked? Performance was not an issue. Tools, tools, tools For managing sources and creating mediated schemas. XML query processing was needed. Concordance: need common keys to join sources: Active research area!

Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Limitations of Mediated Schema Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

Peer Data-Management PDMS: a network of peers (data sources) Peers can: Export base data, or combinations of data Serve as logical mediators for other peers A peer can be both a server and a client. Semantic relationships are specified locally (between small sets of peers). This is a Semantic Web (different angle)

Network of Mappings (Piazza) UWStanford DBLP Roma Paris CiteSeer Vienna GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’

Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most convenient. Queries are posed using the peer ’ s schema. Answers come from anywhere in the system. Infrastructure for Semantic Web applications This is not P2P file sharing. Data has rich semantics Membership is not as dynamic.

Schema Mediation for PDMS UWStanford DBLP Roma Paris CiteSeer Vienna GAV, LAV GLAV Q Q’Q’ Q’Q’ Q ’’ When can LAV and GAV be combined to form such a network structure? (semantics not yet obvious. [ICDE-03], [WWW-03 for XML]

Efficient Query Answering UWStanford DBLP Roma Paris CiteSeer Vienna Q Q’Q’ Q’Q’ Q ’’ Problems: redundant paths expensive reformulation. Possible solution: Pre-compose some paths

Mapping Composition [Jayant Madhavan and Halevy, VLDB 2003] Incredibly subtle! In general, composition can be an infinite set of GLAV formulas. Results: Finite in many cases Even when infinite, often has finite, useful encoding. Hence, compositions can usually be pre- optimized.

Other Research Issues UWStanford DBLP Saarbruecken Leipzig CiteSeer Berlin Q Q’Q’ Q’Q’ Q ’’ Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Handling inconsistencies

PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (Ives, U. Penn)

Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Schema/Ontology Matching Schema heterogeneity: a key roadblock for information integration Different data sources speak their own schema Mapping is key to any data sharing architecture Mediator Consumer Data Source Hotel, Gaststätte Brauerei, Kathedrale Lodges, Restaurants Beaches, Volcanoes Hotel, Restaurant, AdventureSports, HistoricalSites

Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually … BooksAndMusic(x:Title, … ) = Books(x:Title, … )  CDs(x:Album, … ) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B

Typical Approaches Multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime  Integer, addresses have similar formats Schema structure All books have similar attributes Use domain knowledge Combine multiple techniques to exploit all available evidence In isolation, techniques are incomplete or brittle

Philosophy of Solutions Effective schema matching requires a principled combination of techniques. Like human experts, the matcher should improve over time LSD: Mapping data sources to a mediated schema. Use a few mappings as training examples to learn hypotheses for elements of the mediated schema. See [Doan et al., SIGMOD-2001, MLJ-2003] Next step: corpus-based matching.

Corpus-Based Matching Collection of schemas and mappings Reuse extracted information to match new schemas CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Books and Inventory Schemas Identify common concepts and patterns Books, Authors, Publishers, … Books  Title, Author, Price, Publisher

Mapping Knowledge Base Data Instances Learner Name Learner Data Type Learner Description Learner Structure Learner NL:… DIL:… DTL:… DL:… SL:… ML:… Meta Learner C1C1 NL:… DIL:… DTL:… DL:… SL:… ML:… CNCN Learners: Learners: extract knowledge from schemas and mappings Schemas and mappings: Schemas and mappings: accumulated over time Learned models: Learned models: for each unique element in any schema. Mapping Knowledge Base

Preliminary results: Corpus is useful

With and without the corpus

Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.

Corpus vs. Traditional KR A large corpus of uncoordinated knowledge fragments vs. Carefully designed knowledge base Can a corpus offer a more attractive solution for some KR problems?

Pause: KR vs. Corpus Knowledge base: Hard to engineer, brittle at the boundaries Only one way of saying things. Corpus: “ Easier ” to build, coverage not predefined. Many views of the domain. See proceedings for full argument.

Corpus-based KR Contents: Schemas, ontologies, meta-data, data, queries, mappings. Collect statistics on the corpus: How often does a word appear as a relation name? When it does, what tend to be the attribute names? What other tables are there? Support a KR-style interface on the corpus (OKBC-like)

Other Applications of C-B-KR Question answering on the web Focused crawling Natural language interfaces to DB ’ s Schema and ontology authoring Semantic query optimization. Whenever we need knowledge to help us rank multiple answers/plans.

Example Queries How are two terms related? GPA(studentID, $value), Student(studentID, GPA, address) Find different ways of saying the same: Class(Lexus, Luxury) LuxuryCar(Lexus, Toyota) When do two terms play similar roles? IJCAIReview(p1, rev2, accept) AIJReferees(round2, p3, rev4, reject)

Challenges for C-B-KR Building the corpus. How focused should the corpus be? Is human tuning needed or helpful? How do we accommodate inference? How do we leverage traditional KR?

Summary The vision: data authoring, querying and sharing by everyone. We got the plumbing to work. To go further, we need AI techniques. Challenge: cross the structure chasm: It ’ s hard to author & query structured data! PDMS: architecture for ad-hoc sharing. Ontology/schema matching is key! Are we providing the right tools? Corpus-based knowledge representation. We need benchmarks!

Some References Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01 Lenzerini tutorial. Schema matching: Rahm and Bernstein, VLDB Journal 01. Workshops: IJCAI, Semantic Web Conf. Teaching integration to undergraduates: SIGMOD Record, September, 2003.