Grid Data Integration In the CMS Experiment

Slides:



Advertisements
Similar presentations
Distributed Heterogeneous Data Warehouse For Grid Analysis
Advertisements

Technical Architectures
Organizing Data & Information
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
POOL Project Status GridPP 10 th Collaboration Meeting Radovan Chytracek CERN IT/DB, GridPP, LCG AA.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
1 Introduction An organization's survival relies on decisions made by management An organization's survival relies on decisions made by management To make.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
Database System Concepts and Architecture
Intro-Part 1 Introduction to Database Management: Ch 1 & 2.
311: Management Information Systems Database Systems Chapter 3.
Introduction to the Adapter Server Rob Mace June, 2008.
Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
INFNGrid Constanza Project: Status Report A.Domenici, F.Donno, L.Iannone, G.Pucciani, H.Stockinger CNAF, 6 December 2004 WP3-WP5 FIRB meeting.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Elmasri and Navathe, Fundamentals of Database Systems, Fourth Edition Copyright © 2004 Pearson Education, Inc. Slide 2-1 Data Models Data Model: A set.
The POOL Persistency Framework POOL Project Review Introduction & Overview Dirk Düllmann, IT-DB & LCG-POOL LCG Application Area Internal Review October.
From Digital Objects to Content across eInfrastructures Content and Storage Management in gCube Pasquale Pagano CNR –ISTI on behalf of Heiko Schuldt Dept.
NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Object storage and object interoperability
LCG Distributed Databases Deployment – Kickoff Workshop Dec Database Lookup Service Kuba Zajączkowski Chi-Wei Wang.
Overview of C/C++ DB APIs Dirk Düllmann, IT-ADC Database Workshop for LHC developers 27 January, 2005.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
EJB Enterprise Java Beans JAVA Enterprise Edition
CMPE 226 Database Systems April 19 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
1 Case Study: Business Intelligence & Customer Data Customer Support Web-based Dashboard VP Marketing SQL XSLT XML Data Grid Customer Data Customer Order.
Data Resource Management Chapter 5 McGraw-Hill/IrwinCopyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Building a Data Warehouse
Databases and DBMSs Todd S. Bacastow January 2005.
The LIBI Federated database
Database Replication and Monitoring
(on behalf of the POOL team)
Chapter 2 Database System Concepts and Architecture
SOFTWARE DESIGN AND ARCHITECTURE
POOL: Component Overview and use of the File Catalog
Distribution and components
POOL persistency framework for LHC
The Client/Server Database Environment
LCG Distributed Deployment of Databases A Project Proposal
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
Database Architectures and the Web
#01 Client/Server Computing
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Database Management System (DBMS)
POOL/RLS Experience Current CMS Data Challenges shows clear problems wrt to the use of RLS Partially due to the normal “learning curve” on all sides in.
Data, Databases, and DBMSs
MANAGING DATA RESOURCES
Oracle Architecture Overview
SDMX Reference Infrastructure Introduction
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Introduction to Databases Transparencies
Lecture 1: Multi-tier Architecture Overview
Ch 4. The Evolution of Analytic Scalability
Data Warehouse.
Database System Concepts and Architecture
SSIS. FIRST EXPERIENCE. By Virginia Mushkatblat
#01 Client/Server Computing
Presentation transcript:

Grid Data Integration In the CMS Experiment Saima Iqbal, Tony Solomonides & Ian Willers CERN & University of the West of England, Bristol November 24, 2018

Outline Project requirements Use of Data Warehouse and Data Marts Architectural design Use of POOL Prototype critical review Conclusion & Future Work November 24, 2018

CMS data flow CERN – Tier 0 IN2P3 RAL FNAL Tier 1 Uni n Lab a Tier2 Department    Desktop CERN – Tier 0 FNAL RAL IN2P3 622 Mbps 2.5 Gbps 155 mbps Tier2 Lab a Uni b Lab c Uni n November 24, 2018

Project Requirements Provide and maintain the read-only view of the data (for the analysis applications). Performant persistency mechanism to support data retrieval from the Distributed Heterogeneous Relational Databases (DHRD) across a Grid environment. Flexible architecture to support changes in the persistency requirements (like schema evolution) and in the backend database technologies. Analysis applications may be on any Tier. Always work on a copy. Prototype provides a local copy from the source database. Heterogeneity: all RDBs, different schemas, different platforms, different technologies (Oracle, MySQL, etc). Data warehouse provides a solution to the need for a performant persistence mechanism. November 24, 2018

Use of Data Warehouse and Data Marts Data warehouse is a database with Performant persistency mechanism , often remote, contains snapshots of data integrated from (distributed) heterogeneous data sources. A technology independent repository. Provides a read-only view of the data (i.e. no transaction allowed). To support fast data access, built with denormalised database schema (i.e. maximum indices and minimum relations). Populated through the ETL (Extraction, Transformation, Loading) process and provides a flexible persistency architecture. Best supported by the Relational Database Management technologies. Extraction, Transportation, Transformation and Loading: Data extracted from Heterogeneous data sources; extracted data then transformed according to the schema supported by the warehouse, then transported and loaded into the data warehouse. Data Marts Databases that store the replicated or distributed data from the centralized data warehouse. November 24, 2018

Use of POOL Relational C++ Class Abstraction Layer ORACLE 3- Connection String (database URL User Name and Password) C++ Class POOL RelationalFileCatalog Relational Abstraction Layer ORACLE Tier-0 MySQL Tier-2 ODBC Component Relational Access Component 1- LFN (Logical Database Name) 2- PFN (Physical Database Name) Provide Connection String POOL is a common persistency framework for the LHC Computing Grid (LCG) application area. POOL is tasked to store petabytes of experiment data and metadata in a distributed, grid enabled way. POOL combines C++ Object streaming technology such as ROOT I/O for the bulk data with a transactionally safe relational database (RDBMS) store such as MySQL. POOL provides navigational access to distributed data without exposing details of underlying technology. November 24, 2018

Architectural Design Use POOL RAL (Relational Access Component) to extract data from data mart C++ class/Data Access via POOL RAL POOL’s RelationalFileCatalog C++ class/Data Access via POOL RAL POOL RAL Views from data warehouse materialised in the data mart Used to register databases Queried RelationalFileCatalog to retrieve the database URL for the requested data-set Data Mart (ORACLE) @Tier-1 Use POOL RAL (ODBC Access Component) to extract data from MySQL (source) database Views created on the data stored in the warehouse Data from source databases integrated into the data warehouse Data Warehouse (ORACLE) ORACLE@ CERN Row-Wise-Ntuples MySQL@ CALTECH Row-Wise-Ntuples November 24, 2018

Prototype Critical Review The proposed use of a data warehouse provides a light weight approach for the analysis applications. Access data locally without worrying about the individual relational database technology. and their respective database schemas. If there are ‘D’ number of the DHRD technologies with ‘S’ number of distinct schema are needed to make available in the Grid environment, then there could be ‘DxS’ database implementations are required. Whereas, data warehouse approach provides a single denormalised schema (could be replicated and distributed) to access data stored in the ‘DxS’ number of DHRD. However, separate ETL process is required for each newly added database technology. November 24, 2018

Conclusion Software prototype was successful in handling the project requirements according to the architectural design. Use of the POOL RelationalFileCatalog makes it possible to use this data warehouse in the Grid environment. Provides an integrated approach for the registration of distributed heterogeneous relational databases and to access these databases in a globally distributed environment (Grid). Studies the impact of select……. these are averages of different comparative runs at busy and quiet times granularity argument about scaling up – we would expect smoother curves, but difference to persist November 24, 2018

Questions November 24, 2018

Future Directions Databases could be searched according to the type of data they stored instead of logical database names. Monitoring of databases, especially for the databases stored replicated data. Use of data warehouse meta-data. Can be made Grid-Services compliant by using POOL file catalog features. Data mining instead of hard coded SQL statements. A single ETL process (research question). November 24, 2018