Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.

Slides:



Advertisements
Similar presentations
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Advertisements

1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
CSE 636 Data Integration Data Integration Approaches.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
1 Knowledge Management Session 4. 2 Objectives 1.What is knowledge management? Why do businesses today need knowledge management programs and systems.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.
Data Management for Decision Support Session - 1 Prof. Bharat Bhasker.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Location: 210 Bell Hall Office Hours:
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Automatic Data Ramon Lawrence University of Manitoba
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
Knowledge Portals and Knowledge Management Tools
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Fundamentals of Information Systems, Fifth Edition
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
© 2008 IBM Corporation ® IBM Cognos Business Viewpoint Miguel Garcia - Solutions Architect.
10/18/20151 Business Process Management and Semantic Technologies B. Ramamurthy.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
When Search is not Enough Case Study: The Advertising Research Foundation Gilbane Boston November 27, 2007 Gilbane Boston November 27, 2007.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Service Service metadata what Service is who responsible for service constraints service creation service maintenance service deployment rules rules processing.
1.Registration block send request of registration to super peer via PRP. Process re-registration will be done at specific period to info availability of.
Presented by Jiwen Sun, Lihui Zhao 24/3/2004
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Working with Ontologies Introduction to DOGMA and related research.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Software Engineering Introduction.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Data Integration Approaches
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Research Directions in Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros.
Foundations of information systems : BIS 1202 Lecture 4: Database Systems and Business Intelligence.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Fundamentals of Information Systems, Sixth Edition
MANAGING KNOWLEDGE FOR THE DIGITAL FIRM
Business Process Management and Semantic Technologies
Presentation transcript:

Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Motivation & Background Data integration is a pervasive challenge faced in applications that need to query across multiple autonomous and heterogeneous data sources. Data integration is crucial in large enterprises that own a multitude of data sources. For better cooperation among agencies, each with their own data sources.

Data Integration Legacy Databases Services and Applications Enterprise Databases

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Ten-Year Best Paper Querying Heterogeneous Information Sources using Source Descriptions. VLDB96 Alon Halevy a principal member of technical staff at AT&T Bell Laboratories, and then at AT&T Laboratories. Main idea: the Information Manifold led to tremendous progress on data integration and to quite a few commercial data integration products.

The Information Manifold An implemented data integration system Goal: provide a uniform query interface to a heterogeneous collection of Web data sources Main contribution: the way it described the contents of the data sources it knew about. IM contains declarative descriptions of the contents and capabilities of the information sources. (Source Description)

An example of complex query find reviews of movie directed by Woody Allen playing in my area three web sites join! 1. a movie site containing actor and director information (IMDB) 2. movie playing sources(e.g.,777film.com) 3. movie review sites (e.g., a newspaper)

wrapper Mediated Schema Semantic mappings optimization & execution query reformulation Design timeRun time

Semantic Mappings Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mediated Schema Mapping logic Informatio n sources

Global-as-View (GAV) (Previous approaches) Source R1R2R3R4R5 CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mediated Schema Mapping:

Local-as-View (LAV) Source R1R2R3R4R5 CD: ASIN, Title, Genre, Year Artist: ASIN, Name, … Mediated Schema Mapping: Mediated View Mediated View Mediated View Mediated View Mediated View

benefits of LAV Describing information sources became easier a data integration system could accommodate new sources easily The descriptions of the information sources could be more precise describe precise constraints on the contents of the sources become easier

Query reformulation Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName CD: ASIN, Title, Genre, … Mediated Schema A query posed over CD(A,T,G) a set of queries on the data sources

Query Answering in LAV = Answering queries using views (AQUV) a problem which was earlier considered in the context of query optimization Given a set of views V 1,…,V n, And a query Q, Can we answer Q using only the answers to V 1,…,V n ?

AQUV Query optimization & Supporting physical data independence AQUV for data integration:  Not necessarily equivalent rewriting  Find maximally contained rewriting Main AQUV Algorithms:  Bucket  Inverse rules  Minicon

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Generating Schema Mappings Look at that observation:  Who’s going to write all these LAV/GAV formulas (the semantic mappings between the sources and the mediated schema)? 1.create the source descriptions 2. writing the semantic mappings  This was the main bottleneck.

Techniques for Schema Mapping semi-automatically generating schema mappings Goal: create tools that speed up the creation of the mappings and reduce the amount of human effort involved. Compare schema elements based on:  Linguistic similarities  overlaps in data values or data types  schema mapping tasks are often repetitive.

A Machine Learning Approach Map multiple schemas in the same domain to the same mediated schema. Learn from previous experience:  the manually created schema mappings as training data  generalize from them to predict mappings between unseen schemas. Mediated schema Given matches Predict new ones

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Adaptive query processing look at that observation:  Once we have mappings, how can we execute queries?  Traditional plan-then-execute doesn’t work. Root: the dynamic nature of data integration contexts

Adaptive query processing data integration system: the context is very dynamic and the optimizer has much less information than the traditional setting. Two results:  the optimizer can’t decide a good plan  a plan may be arbitrarily bad. Dynamic adjust query plan

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

XML characters for data integration XML offered a common syntactic format for sharing data among data sources. since it appeared as if data could actually be shared integration systems using XML as the underlying data Model and XML query languages (XQuery)

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Model Management Goal: provide an algebra for manipulating schemas and mappings With such an algebra:  complex operations on data sources simple sequences of operators in the algebra Some of the operators in Model Management  create & compose mappings, merge & diff models

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Peer Data Management Systems Berkeley Stanford DBLP UW (Washington) UW (Wisconsin) CiteSeer UW (Waterloo) Q Q1 Q2 Q6 Q5 Q4 Q3 LAV, GLAV

Two Additional Benefits A P2P architecture offers a truly distributed mechanism for sharing data.  Every data source only provide semantic mappings to a set of neighbors.  complex integrations emerge follows semantic paths P2P architecture is more appropriate than a single mediated schema in data sharing context.  there is never a single global mediated schema  data sharing occurs in local neighborhoods of the network.

Building on the Foundation Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Description Logics describe relationships between data sources  data sources need to be represented declaratively  the mediated schema of IM was based on Classic Description Logic Description Logics offered more flexible mechanisms for representing a mediated schema Recent work: combine the expressive power of Description Logics with the ability to manage large amounts of data.

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

The Data Integration Industry Late 90’s——commercialization Enterprise Information Integration (EII): without having to first load all the data into a central warehouse the development of the EII industry  Technologies from research labs matured enough  The needs of data management  XML Inappropriate: data warehousing solutions, ad-hoc solutions

data sources mediated schema will participate in the application build applicationsapplications query semantic mappings a query posed over the virtual schema query query reformulation a query over the data sources Execute with an engine that create plans that span multiple data sources A data integration scenario Query processing

Other EII Products XML data model and XQuery Challenge: the research on integration for XML was only in its infancy customer-relationship management Challenge: how to provide the customer-facing worker a global view of a customer whose data is residing in multiple sources, and track information from multiple sources in real time.

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Future Challenges The factors of data integration challenges:  Social: Data integration is fundamentally about getting people to collaborate and share data.  complexity of integration Data integration has been referred to as a problem as hard as AI, maybe even harder! Our goal: create tools that facilitate data integration in a variety of scenarios.

Several Specific Challenges Dataspaces: Pay-as-you-go data management Uncertainty and lineage Reusing human attention

Dataspaces database system: create the schema first! data integration system: create the semantic mappings first! fundamental shortcoming: long setup time! Dataspaces: the idea of pay-as-you-go data management

Pay-as-you-go offer some services immediately without any setup time, and improve the services as more investment is made into creating semantic relationships. A dataspace should offer keyword search over any data in any source with no setup time.

Pay-as-you-go Data Management Benefit Investment (time, cost) Dataspaces Data integration solutions Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Several Specific Challenges Dataspaces: Pay-as-you-go data management Uncertainty and lineage Reusing human attention

Uncertain data & data lineage A necessity in data integration system introspect about the certainty of the data when not automatically determine its certainty, refer the user to the lineage of the data Web search engines provide URLs along with their search results, so users can consider the URLs in the decision of which results to explore further.

Several Specific Challenges Dataspaces: Pay-as-you-go data management Uncertainty and lineage Reusing human attention

achieving tighter semantic integration among data sources Users’ any operation to data sources: Giving a semantic clue about the data or about relationships between data sources Systems that leverage these semantic clues: obtain semantic integration much faster an area for additional research and development

Outline Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

not so long ago a nice feature and an area for intellectual curiosity today a necessity Today’s economy further emphasize the need for data integration solutions. Thomas Friedman: The World is Flat. data integration time

A Framework for Deep Web Integration Developed issue Developing issue Undeveloped issue Our focuses

Q & A