Data integration and transformation Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 29/09/2010.

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

Doc.: IEEE /0046r0 Submission July 2009 Ari Ahtiainen, NokiaSlide 1 A Cooperation Mechanism for Coexistence between Secondary User Networks on.
Distributed Data Processing
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
Francesca Bugiotti Università Roma Tre 1 17/12/2009.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Sharing Content and Experience in Smart Environments Johan Plomp, Juhani Heinila, Veikko Ikonen, Eija Kaasinen, Pasi Valkkynen 1.
Information capacity in schema and data translation Paolo Atzeni Based on work done with L. Bellomarini, P. Bernstein, F. Bugiotti, P. Cappellari, G. Gianforme,
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
The information integration wizard (Iwiz) project Report on work in progress Joachim Hammer Presented by Muhammed Al-Muhammed.
Lecture-8/ T. Nouf Almujally
Passage Three Introduction to Microsoft SQL Server 2000.
CSE 590DB: Database Seminar Autumn 2002: Meta Data Management Phil Bernstein Microsoft Research.
For more notes and topics visit:
Item Web 2.0 application relevant to teacher’s work.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
 The ability to develop step by step procedures for solving problems  She uses algorithmic thinking by setting up her charts.
Database System Concepts and Architecture
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10-4/11/2009.
Session-9 Data Management for Decision Support
OpenPASS Open Privacy, Access and Security Services “Quis custodiet ipsos custodes?”
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
10/18/20151 Business Process Management and Semantic Technologies B. Ramamurthy.
Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.
Ch. 9. The Cloud of Things 1Ch. 9. CoT.  Current M2M/IoT solutions are focusing on communications and integration. Future Web of Things (WoT) evolution.
1 Data Warehouses BUAD/American University Data Warehouses.
Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10/2009.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Supporting Mobile Collaboration with Service-Oriented Mobile Units
Architetture della Informazione Anno accademico Carlo Batini Methodologies for planning the evolution of data architectures 1.
SysML 2.0 Model Lifecycle Management (MLM) Working Group
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Business Process Management and Semantic Technologies
Presentation transcript:

Data integration and transformation Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 29/09/2010

P. AtzeniGII School 29/09/20102 A ten-year goal for database research The “Asilomar report” (Bernstein et al. Sigmod Record –The information utility: make it easy for everyone to store, organize, access, and analyze the majority of human information online A lot of interesting work has been done but … …integration, translation, exchange are still difficult… … 2009 has come… we are late!

P. AtzeniGII School 29/09/20103 A general framework: cooperation "The capacity of a system to interact (effectively) with other systems, possibly managed by different organizations"

P. AtzeniGII School 29/09/20104 Forms of cooperation Process-centered cooperation: –the systems offer services one another, by exchanging messages, information or documents, or by triggering activities, without making remote data explicitly visible Data-centered cooperation: –the systems offer data one another; data is distributed, heterogeneous and autonomous, and accessible from remote locations according to some co-operation agreement

P. AtzeniGII School 29/09/20105 Databases in the Internet era Databases before the Internet –An internal infrastructure, a precious resource, but usually hidden, with some controlled cooperation Internet changes the requirements –More users (not only humans), more diverse cooperating systems (distributed, heterogeneous, autonomous), more types of data "Future" Internet changes more –New devices (embedded everywhere), even more users (many “per person”), real mobility, need for personalization and adaptation –Social networks –Cloud computing applications

P. AtzeniGII School 29/09/20106 The most studied form of data-centered cooperation: integration We are interested in data-centered cooperation, often referred to as integration “The unification of related, heterogeneous data from disparate sources, for example, to enable collaboration” (Hammer & Stonebraker 2005) Some "paradigms" …

P. AtzeniGII School 29/09/20107 Multidatabase Global Manager Local mgr DB Mediator Local mgr DB Mediator Local mgr DB Mediator

P. AtzeniGII School 29/09/20108 Data Warehousing System Mediator Local manager Mediator Local manager Mediator Local manager “Integrator” DB Data Warehouse DW manager

P. AtzeniGII School 29/09/20109 Intermediate solutions in practice Mediator Local manager DB Local manager DB Integrator Mediator Local manager Mediator Local manager DB Local Manager Application

P. AtzeniGII School 29/09/ Peer-based architecture Peer mgr Local mgr DB Local mgr DB Mediator Peer mgr Local mgr DB Local mgr DB Mediator Peer mgr Local mgr DB Local mgr DB Mediator

P. AtzeniGII School 29/09/ Data is not just in databases Mail messages Social networks Web pages Spreadsheets Textual documents Palmtop devices, mobile phones Multimedia annotations (e.g., in digital photos) XML documents

Data spaces The information and data is often unstructured and not preprocessed P. AtzeniGII School 29/09/201012

P. AtzeniGII School 29/09/ The same data in the same form? Adaptivity: –Personalization: content adapted to the user upon system's decision upon user's request –Customization: structure adapted to the user according to the user's role upon user's request –Context dependence User, Device, Network, Place, Time, Rate

P. AtzeniGII School 29/09/ A general need We have data at various places, and data has to be –exchanged –replicated –migrated –integrated –adapted

P. AtzeniGII School 29/09/ A major difficulty Heterogeneity –System level –Model level –Structural (different structure for similar data) –Semantic (different meaning for the same structure) Many efforts, but current techniques are mostly manual and ad hoc

P. AtzeniGII School 29/09/ Three problems Schema and data translation Schema and data integration Data exchange

P. AtzeniGII School 29/09/ Schema and data translation Given a schema find another one with respect to some specific goal (better quality, another model, …)

GII School 29/09/ Many different models OR Relational OO … ER XSD P. Atzeni

GII School 29/09/ Many different models (and variants …) OR w/ PK, gen, ref, FK Relational … OR w/ PK, gen, ref OR w/ PK, gen, FK OR w/ PK, ref, FK OR w/ gen, ref OR w/ PK, FK OR w/ PK, ref OR w/ ref P. Atzeni

GII School 29/09/ Schema and data integration Given two or more sources, build an integrated schema or database

P. AtzeniGII School 29/09/ Data exchange Given a source and a target schema, find a transformation from the former to the latter

P. AtzeniGII School 29/09/ Data exchange, a typical approach (the Clio project) Schema Match Mapping generation Query generation Target schema Source schema

P. AtzeniGII School 29/09/ Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr )

P. AtzeniGII School 29/09/ Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr )

P. AtzeniGII School 29/09/ The process, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) SELECT P.Id, P.Name, P.Sal, A.Addr FROM Professor P, Address A WHERE A.Id = P.Id UNION ALL SELECT NULL AS Id, S.Name, p.HrRate * W.Hrs, NULL AS Addr FROM PayRate P, Student S, WorksOn W WHERE W.Name = S.Name AND S.Yr = P.Rank

P. AtzeniGII School 29/09/ A direction for the solutions Be general! –ad hoc solution could work in-the-small, but they are repetitive and time consuming do not scale are not maintainable Historical notes: –W. C. McGee: Generalization: Key to Successful Electronic Data Processing. J. ACM 1959 Indeed, databases are the result of generalization applied to secondary storage management!

P. AtzeniGII School 29/09/ Generality requires … … high-level descriptions of problems within the family of interest: –Metadata: “data about data” (formal or informal) description of structures and meaning General solutions leverage on metadata (and then operate on data as a consequence)

P. AtzeniGII School 29/09/ A wider perspective (Generic) Model Management: –A proposal by Bernstein et al (2000 +) –Includes a set of operators on schemas and mappings between schemas

P. AtzeniGII School 29/09/ Terminology: a warning Model Mgmt peopleTraditional DB people Meta-metamodelMetamodel Model Schema

P. AtzeniGII School 29/09/ Schemas and mappings More on the issue later For the time being: –Schema: a set of elements, related in some way to one another –Mapping: a set of correspondences (pair of elements) or its reification, a third schema related to the other two via two sets of correspondences

P. AtzeniGII School 29/09/ Model mgmt operators, a first set map = Match (S1, S2) S3 = Merge (S1, S2, map) S2 = Diff (S1, map) and more –map3 = Compose (map1, map2) –S2 = Select (S1, pred) –Apply (S, f) –list = Enumerate (S) –S2 = Copy (S1) –…

P. AtzeniGII School 29/09/ Match map = Match (S1, S2) –given two schemas S1, S2 –returns a mapping between them the “classical” initial step in data integration: –find the common elements of two schemas and the correspondences between them

P. AtzeniGII School 29/09/ Merge S3 = Merge (S1, S2, map) –given two schemas and a mapping between them –returns a third schema (and two mappings) the “classical” second step in data integration: –given the correspondences, find a way to obtain one schema out of two

P. AtzeniGII School 29/09/ Diff S2 = Diff (S1, map) –given a schema and a mapping from it (to some other schema, not relevant) –returns a (sub-)schema, with the elements that do not participate in the mapping

P. AtzeniGII School 29/09/ DW2 Example (Bernstein and Rahm, ER 2000) A database (a “source”), a data warehouse and a mapping between the two We want to add a source, with some similarity to the first one and update the DW DB1 DW1 DB2

P. AtzeniGII School 29/09/ DW2 Example, the "solution" DB1 DW1 DB2 m1 m2 m3 DB2’ m2 = Match(DB1,DB2) m3= Compose(m2,m1) DB2’=Diff(DB2,m3) DW2’, m4 user defined m5 = Match(DW1,DW2’) DW2 = Merge(DW,DW2’,m5) DW2’ m4 m5

P. AtzeniGII School 29/09/ Magic does not exist Operators might require human intervention: –Match is the main case Scripts involving operators might require human intervention as well (or at least benefit from it): –a full implementation of each operator might not always available –a mapping might require manual specification –incomparable alternatives might exist

P. AtzeniGII School 29/09/ The “data level” The major operators have also an extended version that operates on data, and not only on schemas Especially apparent for –Merge

P. AtzeniGII School 29/09/ We also have heterogeneity Round trip engineering (Bernstein, CIDR 2003) –A specification, an implementation –then a change to the implementation: want to revise the specification We need a translation from the implementation model to the specification one S1 I1I2 S2

P. AtzeniGII School 29/09/ Model management with heterogeneity The previous operators have to be “model generic” (capable of working on different models) We need a “translation” operator – = ModelGen (S1)

P. AtzeniGII School 29/09/ ModelGen, an additional operator = ModelGen (S1) –given a schema (in a model) –returns a schema (in a different data model) and a mapping between the two A “translation” from a model to another I should call it “SchemaGen” … We should better write – = ModelGen (S1,mod2)

P. AtzeniGII School 29/09/ S2 Round trip engineering S1 I1 m1 I2 m2 = Match (I1,I2) m3 = Compose (m1,m2) I2’= Diff(I2,m3) = Modelgen(I2’) … Match, Merge m2 m3 I2’ S2’ m4

Summary data management in the Internet world data integration schema and data translation, data exchange model management Part II P. AtzeniGII School 29/09/201043