ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

Distributed Data Processing
1 Data Integration June 3 rd, What is Data Integration? uniform accessmultiple autonomousheterogeneousdistributed Provide uniform access to data.
CSE 636 Data Integration Data Integration Approaches.
CHAPTER 3: DESCRIBING DATA SOURCES
CHAPTER 7 Roderick Dickson Kelli Grubb Tracyann Pryce Shakita White.
Chapter 3 Database Management
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
McGraw-Hill/Irwin Copyright © 2008, The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin Copyright © 2008 The McGraw-Hill Companies, Inc.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Location: 210 Bell Hall Office Hours:
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
Chapter 14 The Second Component: The Database.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
1 LBSC 690: Week 9 SQL, Web Forms. 2 Discussion Points Websites that are really databases Deep vs. Surface Web.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Overview of SQL Server Alka Arora.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
Fundamentals of Information Systems, Fifth Edition
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
1 XML Based Networking Method for Connecting Distributed Anthropometric Databases 24 October 2006 Huaining Cheng Dr. Kathleen M. Robinette Human Effectiveness.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Chapter 8 Data and Knowledge Management. 2 Learning Objectives When you finish this chapter, you will  Know the difference between traditional file organization.
1 Introduction to Oracle Chapter 1. 2 Before Databases Information was kept in files: Each field describes one piece of information about student Fields.
ITGS Databases.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Database Principles. Basics A database is a collection of data, along with the relationships between the data The data has to be entered into a structure,
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
3/6: Data Management, pt. 2 Refresh your memory Relational Data Model
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Integration Approaches
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
ASET 1 Amity School of Engineering & Technology B. Tech. (CSE/IT), III Semester Database Management Systems Jitendra Rajpurohit.
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Chapter 1: Introduction. 1.2 Database Management System (DBMS) DBMS contains information about a particular enterprise Collection of interrelated data.
DO YOU TRUST YOUR DATA? KNOW THE ANSWER WITH EIM! Jose Hernandez Director, Business Intelligence Dunn Solutions Group.
1 Case Study: Business Intelligence & Customer Data Customer Support Web-based Dashboard VP Marketing SQL XSLT XML Data Grid Customer Data Customer Order.
Moderator Don Pearson Chief Strategy Officer Inductive Automation.
The LIBI Federated database
Integrating Enterprise Applications Into SharePoint® Portal Server
Ch 4. The Evolution of Analytic Scalability
MANAGING DATA RESOURCES
Yannis Papakonstantinou Associate Prof., CSE, UCSD
McGraw-Hill Technology Education
Presentation transcript:

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION

Outline  Introduction: data integration as a new abstraction  Examples of data integration applications  Schema heterogeneity  Goal of data integration, why it’s a hard problem  Data integration architectures

Data Integration  Databases are great: they let us manage huge amounts of data  Assuming you’ve put it all into your schema.  In reality, data sets are often created independently  Only to discover later that they need to combine their data!  At that point, they’re using different systems, different schemata and have limited interfaces to their data.  The goal of data integration: tie together different sources, controlled by many people, under a common schema.

DBMS: it’s all about abstraction  Logical vs. Physical; What vs. How. Students:Takes: Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration: A Higher-level Abstraction Mediated Schema Query S1S2S3 …… Semantic Mappings Independence of: source & location data model, syntax semantic variations … The best of … Carreras Pavarotti Domingo The best of … Carreras Pavarotti Domingo 19.95

Outline Introduction: data integration as a new abstraction  Examples of data integration applications  Schema heterogeneity  Goal of data integration, why it’s a hard problem  Data integration architectures

Applications of Data Integration  Business  Science  Government  The Web  Pretty much everywhere

Application Area 1: Business Single Mediated View Legacy Databases Services and Applications Enterprise Databases CRM ERP Portals … EII Apps: 50% of all IT $$$ spent here!

OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Hundreds of biomedical data sources available; growing rapidly! Application Area 2: Science

Application Area 3: The Web

Hundreds of millions of high-quality tables on the Web

The Deep Web  Millions of high quality HTML forms out there  Each form has its own special interface  Hard to explore data across sites.  Goal (for some domains):  A single interface into a multitude of deep-web sources.

Create a single site to search for jobs/rentals/…

Easily traverse between the site by clicking its name

Outline Introduction: data integration as a new abstraction Examples of data integration applications  Schema heterogeneity  Goal of data integration, why it’s a hard problem  Data integration architectures

Enterprise Data Integration: FullServe Corporation Employees Training Sales Resumes Services HelpLine FullTimeEmp Hire TempEmployees Courses Enrollments Products Sales Interview CV Services Customers Contracts Calls

EuroCard Corporation Employees Credit Cards Resumes HelpLine Employees Hire Customer CustDetail Interview Calls

Examples of Heterogeneity FullServe FullTimeEmp ssn, empId, firstName middleName, lastName Hire empId, hireDate, recruiter TempEmployees ssn, hireStart, hireEnd EuroCard Employees ID, firstNameMiddleInitial, lastName Hire ID, hireDate, recruiter Find all employees (making over $100K)

Customer Call Center Sales Services Products Sales Services Customers Contracts Credit Cards Customer CustDetail Agents should have a full view of customer when they call in.

Other Reasons to Integrate Data  Create a (useful) web site for tracking services  Collaborate with third parties  E.g., create branded services  Comply with government regulations  Find “risky” employees  Business intelligence  What’s really wrong with our products?

Outline Introduction: data integration as a new abstraction Examples of data integration applications Schema heterogeneity  Goal of data integration, why it’s a hard problem  Data integration architectures

Goal of Data Integration  Uniform query access to a set of data sources  Handle:  Scale of sources: from tens to millions  Heterogeneity  Autonomy  Semi-structure

Why is it Hard?  Systems-level reasons:  Managing different platforms  SQL across multiple systems is not so simple  Distributed query processing  Logical reasons:  Schema (and data) heterogeneity  ‘Social’ reasons:  Locating and capturing relevant data in the enterprise.  Convincing people to share (data fiefdoms)  Security, privacy and performance implications.

Setting Expectations Data integration is AI-Complete.  Completely automated solutions unlikely. Goal 1:  Reduce the effort needed to set up an integration application. Goal 2:  Enable the system to perform gracefully with uncertainty (e.g., on the web)

Data Integration Smorgasbord Something for everyone:  Theory of modeling data sources  Systems aspects of data integration  Architectural issues: e.g., P2P data sharing  work: automated schema matching  Web: latest on data integration & web  Commercial products: BEA, IBM  Semantic Web: what does it have to offer?  New trends in DBMS: uncertainty, dataspaces

Outline Introduction: data integration as a new abstraction Examples of data integration applications Schema heterogeneity Goal of data integration, why it’s a hard problem  Data integration architectures

Virtual, Warehousing and in Between  Data warehousing: integrate by bringing the data into a single physical warehouse  Virtual data integration: leave the data at the sources and access it at query time.  Some differences, but semantic heterogeneity arises in both cases.  Numerous intermediate architectures.  The course illustrates data integration technology mostly through the virtual architecture.

Virtual Data Integration Architecture Mediated Schema or Warehouse Wrapper/ Extractor Wrapper/ Extractor Wrapper/ Extractor Wrapper/ Extractor RDBMS 1 2 HTML 1 XML 1 Source descriptions/ Transforms Query reformulation/ Query over materialized data

Example

Wrappers The best of … Abiteboul Pavarotti Domingo … The best of … Abiteboul Pavarotti Domingo … Send queries to data sources and transform answers into tuples (or other internal data model). (Chapter 9)

Mediation Languages Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName CD: ASIN, Title, Genre,… Artist: ASIN, name, … Mediated Schema logic Describe relationships between mediated schema and data sources (Chapter 3).

Woody Allen Comedies in NY select title, startTime from Movie, Plays where Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen” Movie: Title, director, year, genre Actors: title, actor Plays: movie, location, startTime Reviews: title, rating, description Mediated schema:

Cinemas: place, movie, start Reviews: title, date grade, review Movies: name, actors, director, genre Cinemas in NYC: cinema, title, startTime Cinemas in SF: location, movie, startingTime Movie: Title, director, year, genre Actors: title, actor Plays: movie, location, startTime Reviews: title, rating, description S1S2S3S4S5 select title, startTime from Movie, Plays where Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen” Sources S1 and S3 are relevant, sources S4 and S5 are irrelevant, and source S2 is relevant but possibly redundant.

Execution engine source wrapper source wrapper source wrapper source wrapper source wrapper Query optimizer Query reformulation Query Logical query plan Physical query plan Replanning request Query Processing Chapter 8

36 Data Warehouses – Offline Replication  Determine physical schema  Define a database with this schema  Define procedural mappings in an “ETL tool” to import the data and clean it.  Periodically copy all of the data from the data sources  Note that the sources and the warehouse are basically independent at this point Data Warehouse Query Results

37 Pros and Cons of Data Warehouses  Need to spend time to design the physical database layout, as well as logical  This actually takes a lot of effort!  Data is generally not up-to-date (lazy or offline refresh) Queries over the warehouse don’t disrupt the data sources Can run very heavy-duty computations, including data mining and cleaning

Summary of Chapter 1  Data integration: abstract away the fact that data comes from multiple sources in varying schemata.  Problem occurs everywhere: it’s key to business, science, Web and government.  Goal: reduce the effort involved in integrating.  Regardless of the architecture, heterogeneity is a key issue.  Architectures range from warehousing to virtual integration.