Automating the adaptation of evolving data-intensive ecosystems Petros Manousis, Panos Vassiliadis University of Ioannina, Ioannina, Greece George Papastefanatos.

Slides:

Advertisements

Similar presentations

Chapter 9. Performance Management Enterprise wide endeavor Research and ascertain all performance problems – not just DBMS Five factors influence DB performance.

Advertisements

Serializability in Multidatabases Ramon Lawrence Dept. of Computer Science

Justification-based TMSs (JTMS) JTMS utilizes 3 types of nodes, where each node is associated with an assertion: 1.Premises. Their justifications (provided.

Lecture # 2 : Process Models

The Zebra Striped Network File System Presentation by Joseph Thompson.

6/1/2015WM-001 Planning for Measurement - copyright Paul Sorenson slide 1 Planning for Measurement WM Software Process and Quality Measurement is.

G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, T. Sellis 1,4, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)

William Stallings Data and Computer Communications 7 th Edition (Selected slides used for lectures at Bina Nusantara University) Internetworking.

Management of the Evolution of Database-Centric Information Systems Panos Vassiliadis 2, George Papastefanatos 1, Timos Sellis 1, Yannis Vassiliou 1 1.

George Papastefanatos 1, Fotini Anagnostou 1 Panos Vassiliadis 2, Yannis Vassiliou 1 (1) National Technical University of Athens

Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.

Chapter 1 and 2 Computer System and Operating System Overview

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens

Maximal Independent Set Distributed Algorithms for Multi-Agent Networks Instructor: K. Sinan YILDIRIM.

G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)

G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, K. Aggistalis 2, F. Pechlivani 2, Yannis Vassiliou 1 (1) National Technical University of Athens.

LSU 10/09/2007Project Schedule1 The Project Schedule Project Management Unit #4.

A Billiards Point of Sale Application Christopher Ulmer CS 470 Final Presentation.

Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.

H ECATAEUS A Framework for Representing SQL Constructs as Graphs George Papastefanatos 1, Kostis Kyzirakos 1, Panos Vassiliadis 2, Yannis Vassiliou 1 1.

CSC271 Database Systems Lecture # 30.

The Worlds of Database Systems Chapter 1. Database Management Systems (DBMS) DBMS: Powerful tool for creating and managing large amounts of data efficiently.

1 A Petri Net Siphon Based Solution to Protocol-level Service Composition Mismatches Pengcheng Xiong 1, Mengchu Zhou 2 and Calton Pu 1 1 College of Computing,

Lecture 9 Methodology – Physical Database Design for Relational Databases.

File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.

Database Management 9. course. Execution of queries.

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

Designing Persistency Delos NoE, Preservation Cluster Workshop: Persistency in Digital Libraries 14. February 2006, Oxford Internet Institute.

Event Management & ITIL V3

Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

UKSMA 2005 Lessons Learnt from introducing IT Measurement Peter Thomas –

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.

1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.

Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  

West Virginia University Slide 1 Copyright © K.Goseva 2010 CS 736 Software Performance Engineering Comments on Homework #1  Please revise the solution.

Lecture 11 Data Structures, Algorithms & Complexity Introduction Dr Kevin Casey BSc, MSc, PhD GRIFFITH COLLEGE DUBLIN.

Configuration Management and Change Control Change is inevitable! So it has to be planned for and managed.

E.Bertino, L.Matino Object-Oriented Database Systems 1 Chapter 5. Evolution Seoul National University Department of Computer Engineering OOPSLA Lab.

Introduction to database system What is a Database system? What is a Database system? Data System Components Data System ComponentsDataHardwareSoftwareUser.

Methodology – Physical Database Design for Relational Databases.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

Software Architecture Evaluation Methodologies Presented By: Anthony Register.

Chapter 5 : Integrity And Security  Domain Constraints  Referential Integrity  Security  Triggers  Authorization  Authorization in SQL  Views 

Reasoning about the Behavior of Semantic Web Services with Concurrent Transaction Logic Presented By Dumitru Roman, Michael Kifer University of Innsbruk,

Understanding General Software Development Lesson 3.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

UML - Development Process 1 Software Development Process Using UML.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

1 Chapter 11 Global Properties (Distributed Termination)

CS223: Software Engineering Lecture 21: Unit Testing Metric.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

1 Visual Computing Institute | Prof. Dr. Torsten W. Kuhlen Virtual Reality & Immersive Visualization Till Petersen-Krauß | GUI Testing | GUI.

MAIME: A Maintenance Manager for ETL Processes

Chapter 8: Concurrency Control on Relational Databases

The Development Process of Web Applications

Database Management System

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Maximal Independent Set

COTS testing Tor Stålhane.

Chapter 10: Process Implementation with Executable Models

Introduction to Database Management System

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Multi-Way Search Trees

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Presentation transcript:

Automating the adaptation of evolving data-intensive ecosystems Petros Manousis, Panos Vassiliadis University of Ioannina, Ioannina, Greece George Papastefanatos Research Center “Athena” \ IMIS, Athens, Greece 32nd International ER International Conference on Conceptual Modeling (ER 2013) Hong Kong, 11-13, November, 2013.

Software Evolution and Data-intensive Ecosystems Software evolution causes at least as much as 60% of the costs for the entire software lifecycle Data-intensive ecosystems are no exception: – DBA View: Databases change their internal structure, schema and semantics, due to changes on reqs. – Application View: Users / Applications change their view on collected data (e.g., reports, workflows). – DBA and development teams do not sync well all the time ER 2013

Software Evolution and Data-intensive Ecosystems Software evolution causes at least as much as 60% of the costs for the entire software lifecycle Data-intensive ecosystems are no exception: – DBA View: Databases change their internal structure, schema and semantics, due to changes on reqs. – Application View: Users / Applications change their view on collected data (e.g., reports, workflows). – DBA and development teams do not sync well all the time ER 2013 Smooth evolution Achieve ecosystem evolution without impacting the smooth operation or the semantic consistency of its components

Evolving data-intensive ecosystem ER 2013

5 ER 2013 Evolving data-intensive ecosystem Remove CS.C_NAME Add exam year

6 ER 2013 Evolving data-intensive ecosystem Remove CS.C_NAME Add exam year The impact can be syntactical (causing crashes) Syntactically invalid

7 ER 2013 Evolving data-intensive ecosystem Remove CS.C_NAME Add exam year The impact can be syntactical (causing crashes), semantic (causing info loss or inconsistencies) and related to the performance Semantically unclear Syntactically invalid

8 ER 2013 Evolving data-intensive ecosystem Remove CS.C_NAME Add exam year Which parts are affected and how?

9 ER 2013 Evolving data-intensive ecosystem Remove CS.C_NAME Add exam year Can we predetermine their reaction? Allow addition Block Deletion

Overview of solution Architecture Graphs: graph with the dependencies between data modules (i.e., relations, views or queries); module internals are also modeled as subgraphs of the Architecture Graph Evolution Events: Changes on data modules definition Policies: rules that annotate a module with a reaction for each possible event that it can withstand, in one of two possible modes: – (a) block, to veto the event and demand that the module retains its previous structure and semantics, or, – (b) propagate, to allow the event and adapt the module to a new internal structure. Given a potential change in the ecosystem – we identify which parts of the ecosystem are affected via a “change propagation” algorithm – we rewrite the ecosystem to reflect the new version in the parts that are affected and do not veto the change via a rewriting algorithm we resolve conflicts (different modules dictate conflicting reactions) via a conflict resolution algorithm ER 2013

BACKGROUND Ecosystem model, event propagation and policies ER 2013

University E/S Architecture Graph ER

DB constructs  Graph Modules ER Modules and Module Encapsulation Input part Output part Semantics part SELECT V.STUDENT_ID, S.STUDENT_NAME, AVG(V.TGRADE) AS GPA FROM V_TR V |><| STUDENT S ON STUDENT_ID WHERE V.TGRADE > 4 / 10 GROUP BY V.STUDENT_ID, S.STUDENT_NAME

DB Changes  Graph events ER Remove CS.C_NAME Add exam year

Annotation with Policies ER On attribute addition Then propagate On attribute deletion Then block

STATUS DETERMINATION: WHO IS AFFECTED AND HOW Background Status Determination Path check Rewriting Experiments and Results ER 2013

Correctness of “event flooding” ER How do we guarantee that when a change occurs at the nodes of the AG, this is correctly propagated to exactly the nodes of the graph that should learn about it? We notify exactly the nodes that should be notified The status of a node is determined independently of how messages arrive at the node Without infinite looping – i.e., termination Q V1 V2 R

Propagation mechanism ER Modules communicate with each other via a single means: the schema of a provider module notifies the input schema of a consumer module when this is necessary Two levels of propagation: Inter-module level: At the module level, we need to determine the order and mechanism to visit each module Intra-module level: within each module, we need to determine the order and mechanism to visit the module’s components and decide who is affected and how it reacts + notify consumers

Method at a glance Topologically sort the graph Visit affected modules with its topological order and process its incoming messages for it. Principle of locality: process locally the incoming messages and make sure that within each module – Affected internal nodes are appropriately highlighted – The reaction to the event is determined correctly – If the final status is not a veto, notify appropriately the next modules ER

Status Determination ER 2013

Inter-Module Level Propagation ER Add Exam Year

Inter-Module Level Propagation ER Add Exam Year 1

Inter-Module Level Propagation ER Add Exam Year 1 2 2

Intra-module processing Message arrives at a module : 1)Input schema and its attributes if applicable, are probed. 2)If the parameter of the Message has any kind of connection with the semantics tree, then the Semantics schema is probed. 3)Likewise if the parameter of the Message has any kind of connection with the output schema, then the Output schema and its attributes (if applicable) is probed. Finally, new Messages are produced for its consumers ER 2013

PATH CHECK: HANDLING POLICY CONFLICTS Background Status Determination Path check Rewriting Experiments and Results ER 2013

Conflicts: what they are and how to handle them ER R View0 View1View2 Query1Query2 R View0 n View1 n View2 n Query1 n View0 View2 Query2 BEFORE AFTER View0 initiates a change View1 and View 2 accept the change Query2 rejects the change Query1 accepts the change The path to Query2 is left intact, so that it retains it semantics View1 and Query1 are adapted View0 and View2 are adapted too, however, we need two version for each: one to serve Query2 and another to serve View1 and Query1

Path Check ER 2013

Path Check If there exists any Block Module: we travel in reverse the Architecture Graph from blocker node to initiator of change In each step, we inform the visited Module to keep current version and produce a new one adapting to the change We inform the blocker node that it should not change at all ER 2013

Path Check ER Relation R View0 View1View2 Query1Query2

Path Check ER Query2 starts Path Check algorithm Searching which of his providers sent him the message and notify him that he does not want to change Relation R View0 View1View2 Query1Query2

Path Check ER View2 is notified to keep current version for Query2 and produce new version for Query1 Relation R View0 View1View2 Query1Query2

Path Check ER View0 is notified To keep current version for Query2 and Produce new version for Query1 Relation R View0 View1View2 Query1Query2

Path Check ER We make sure that Query2 will not change since it is the blocker Relation R View0 View1View2 Query1Query2

REWRITING: ONCE WE IDENTIFIED AFFECTED PARTS AND RESOLVED CONFLICTS, HOW WILL THE ECOSYSTEM LOOK LIKE? Background Status Determination Path check Rewriting Experiments and Results 34

Rewriting ER 2013

Rewriting If there is Propagate, we perform the rewriting. If there is Block We clone the Modules that are part of a block path and were informed by Path Check and we perform the rewrite on the clones We perform the rewrite on the Module if it is not part of a block path ER 2013

Rewriting ER Relation R View0 n View1 n View2 n Query1 n View0 View2 Query2 Relation R View0 View1View2 Query1Query2 Keep current& produce new Keep only current

EXPERIMENTS AND RESULTS Background Status Determination Path check Rewriting Experiments and Results ER 2013

Experimental setup TPC-DS ecosystem in 3 variants: a)a large ecosystem, WCS, with queries using all the available fact tables,(web, catalog, store tables) b)an ecosystem CS, where the queries to WEB SALES have been removed, and c)an ecosystem S, with queries using only the STORE SALES fact table. Events Workload: taken by a real-world case study Policies : – MixtureDBA, consisting of 20% of the relation modules annotated with BLOCK policy and – MixtureAD, consisting of 15% of the query modules annotated with BLOCK policy ER 2013

HECATAEUS ER A tool for visualizing and performing what-if analysis for evolution scenarios

Effectiveness How useful is our method for the application developers and the DBA's? Assess the effort gain of a developer using the highlighting of affected modules of Hecataeus compared to the situation where he would have to perform all checks by hand – We exclude the object that initiates the sequence of events from the computation, as it would be counted in both occasions ER 2013 %AM : the percentage of useless checks the user would have made

Effectiveness On average, the effort gain is around 90% in the case of the AD mixture and 97% in the case of the DBA mixture. As the graph size increases, the benefits from the highlighting of affected modules increase too. DBA case (flooding of events is restricted early enough at the database's relations): the minimum benefit in all 51 events ranges between 60% - 84% ER 2013

Efficiency ER 2013

Lessons Learned Effort gains are significant! The annotation of few database relations significantly restricts the rewriting time (and consequently the overall execution time) If the rewriting is not constrained early enough, then the execution cost grows linearly with the size of the ecosystem ER 2013

CONCLUSIONS AND FUTURE WORK... and follow up’s not included in the paper ER 2013

Managing the evolution of ecosystems is possible We need to model the ecosystem and annotate it with evolution management techniques that dictate its reaction to future events We can highlight what is impacted and if there is a veto or not. We can handle conflicts, suggest automated rewritings and guarantee correctness We can do it fast and gain effort for all involved stakeholders ER 2013

With an Eye to the Future Automatic policy suggestion Visualization Extend Hecataeus for other changes (create an index) that change the performance of DBMS. Complex events (delete & etc) ER 2013

48 Many thanks for your attention ER 2013

AUXILIARY SLIDES ER 2013

The impact of changes & a wish-list Syntactic: scripts & reports simply crash Semantic: views and applications can become inconsistent or information losing Performance: can vary a lot We would like: evolution predictability, i.e., control of what will be affected, before changes happen s.t., we can find ways to quarantine effects ER 2013

Problem definition Changes on a database schema may cause syntactic or semantic inconsistency in its surrounding applications; is there a way to regulate the evolution of the database in a way that application needs are taken into account? If there are conflicts between the applications’ needs on the acceptance or rejection of a change in the database, is there a possibility of satisfying all the different constraints? If conflicts are eventually resolved and, for every affected module we know whether to accept or reject a change, how can we rewrite the ecosystem to reflect the new status? ER 2013

ER 2013 Policies at various nodes Remove CS.C_NAME Add exam year Allow addition Allow deletion Policies to predetermine the modules’ reaction to a hypothetical event RELATION.OUT.SELF: on ADD_ATTRIBUTE then PROPAGATE; RELATION.OUT.SELF: on DELETE_SELF then PROPAGATE; RELATION.OUT.SELF: on RENAME_SELF then PROPAGATE; RELATION.OUT.ATTRIBUTES: on DELETE_SELF then PROPAGATE; RELATION.OUT.ATTRIBUTES: on RENAME_SELF then PROPAGATE; VIEW.OUT.SELF: on ADD_ATTRIBUTE then PROPAGATE; VIEW.OUT.SELF: on ADD_ATTRIBUTE_PROVIDER then PROPAGATE; VIEW.OUT.SELF: on DELETE_SELF then PROPAGATE; VIEW.OUT.SELF: on RENAME_SELF then PROPAGATE; VIEW.OUT.ATTRIBUTES: on DELETE_SELF then PROPAGATE; VIEW.OUT.ATTRIBUTES: on RENAME_SELF then PROPAGATE; VIEW.OUT.ATTRIBUTES: on DELETE_PROVIDER then PROPAGATE; VIEW.OUT.ATTRIBUTES: on RENAME_PROVIDER then PROPAGATE; VIEW.IN.SELF: on DELETE_PROVIDER then PROPAGATE; VIEW.IN.SELF: on RENAME_PROVIDER then PROPAGATE; VIEW.IN.SELF: on ADD_ATTRIBUTE_PROVIDER then PROPAGATE; VIEW.IN.ATTRIBUTES: on DELETE_PROVIDER then PROPAGATE; VIEW.IN.ATTRIBUTES: on RENAME_PROVIDER then PROPAGATE; VIEW.SMTX.SELF: on ALTER_SEMANTICS then PROPAGATE; QUERY.OUT.SELF: on ADD_ATTRIBUTE then PROPAGATE; QUERY.OUT.SELF: on ADD_ATTRIBUTE_PROVIDER then PROPAGATE; QUERY.OUT.SELF: on DELETE_SELF then PROPAGATE; QUERY.OUT.SELF: on RENAME_SELF then PROPAGATE; QUERY.OUT.ATTRIBUTES: on DELETE_SELF then PROPAGATE; QUERY.OUT.ATTRIBUTES: on RENAME_SELF then PROPAGATE; QUERY.OUT.ATTRIBUTES: on DELETE_PROVIDER then PROPAGATE; QUERY.OUT.ATTRIBUTES: on RENAME_PROVIDER then PROPAGATE; QUERY.IN.SELF: on DELETE_PROVIDER then PROPAGATE; QUERY.IN.SELF: on RENAME_PROVIDER then PROPAGATE; QUERY.IN.SELF: on ADD_ATTRIBUTE_PROVIDER then PROPAGATE; QUERY.IN.ATTRIBUTES: on DELETE_PROVIDER then PROPAGATE; QUERY.IN.ATTRIBUTES: on RENAME_PROVIDER then PROPAGATE; QUERY.SMTX.SELF: on ALTER_SEMANTICS then PROPAGATE;

Theoretical Guarantees At the inter-module level Theorem 1 (termination). The message propagation at the inter- module level terminates. Theorem 2 (unique status). Each module in the graph will assume a unique status once the message propagation terminates. Theorem 3 (correctness). Messages are correctly propagated to the modules of the graph At the intra-module level Theorem 4 (termination and correctness). The message propagation at the intra-module level terminates and each node assumes a status ER 2013

Message initiation The Message is initiated in one of the following schemata: – Output schema and its attributes if the user wants to change the output of a module (add / delete / rename attribute). – Semantics schema if the user wants to change the semantics tree of the module. ER

Efficiency: rewritings can cost a lot! AD: as the events are allowed to flow within the ecosystem, the amount of rewriting increases with the size of the graph & dominates the overall execution (starts from a 24% - 74% for the small graph and ends to a 7% - 93% for the large graph). DBA: the times are not only significantly smaller, but also equi-balanced: 57% - 42% for the small graph (Status Determination costs more in this case) and 49% - 50% for the two other graphs ER 2013

Efficiency as the graph size increases DBA blocks early => orders of magnitude faster than AD Scale up due to policy: status determination time is scaled up by 2; rewriting time is scaled up by a factor of 10, 20, and 30 for the small, medium and large graph respectively! Rate of increase: linear increase for AD (both status determination and rewriting), very slow increase for DBA Rewritings can cost a lot! ER 2013

Rewriting If there is Propagate, we perform the rewriting. If there is Block If the change initiator is a relation we stop further processing. Otherwise: We clone the Modules that are part of a block path and were informed by Path Check and we perform the rewrite on the clones We perform the rewrite on the Module if it is not part of a block path. Within each module, all its internals are appropriately adjusted (attribute / selection conditions / … additions and removals) ER 2013