Grid Data Integration In the CMS Experiment

Grid Data Integration In the CMS Experiment
Saima Iqbal, Tony Solomonides & Ian Willers CERN & University of the West of England, Bristol November 24, 2018

Outline Project requirements Use of Data Warehouse and Data Marts
Architectural design Use of POOL Prototype critical review Conclusion & Future Work November 24, 2018

CMS data flow CERN – Tier 0 IN2P3 RAL FNAL Tier 1 Uni n Lab a Tier2
Department    Desktop CERN – Tier 0 FNAL RAL IN2P3 622 Mbps 2.5 Gbps 155 mbps Tier2 Lab a Uni b Lab c Uni n November 24, 2018

Project Requirements Provide and maintain the read-only view of the data (for the analysis applications). Performant persistency mechanism to support data retrieval from the Distributed Heterogeneous Relational Databases (DHRD) across a Grid environment. Flexible architecture to support changes in the persistency requirements (like schema evolution) and in the backend database technologies. Analysis applications may be on any Tier. Always work on a copy. Prototype provides a local copy from the source database. Heterogeneity: all RDBs, different schemas, different platforms, different technologies (Oracle, MySQL, etc). Data warehouse provides a solution to the need for a performant persistence mechanism. November 24, 2018

Use of Data Warehouse and Data Marts
Data warehouse is a database with Performant persistency mechanism , often remote, contains snapshots of data integrated from (distributed) heterogeneous data sources. A technology independent repository. Provides a read-only view of the data (i.e. no transaction allowed). To support fast data access, built with denormalised database schema (i.e. maximum indices and minimum relations). Populated through the ETL (Extraction, Transformation, Loading) process and provides a flexible persistency architecture. Best supported by the Relational Database Management technologies. Extraction, Transportation, Transformation and Loading: Data extracted from Heterogeneous data sources; extracted data then transformed according to the schema supported by the warehouse, then transported and loaded into the data warehouse. Data Marts Databases that store the replicated or distributed data from the centralized data warehouse. November 24, 2018

Use of POOL Relational C++ Class Abstraction Layer ORACLE
3- Connection String (database URL User Name and Password) C++ Class POOL RelationalFileCatalog Relational Abstraction Layer ORACLE Tier-0 MySQL Tier-2 ODBC Component Relational Access Component 1- LFN (Logical Database Name) 2- PFN (Physical Database Name) Provide Connection String POOL is a common persistency framework for the LHC Computing Grid (LCG) application area. POOL is tasked to store petabytes of experiment data and metadata in a distributed, grid enabled way. POOL combines C++ Object streaming technology such as ROOT I/O for the bulk data with a transactionally safe relational database (RDBMS) store such as MySQL. POOL provides navigational access to distributed data without exposing details of underlying technology. November 24, 2018

Architectural Design Use POOL RAL (Relational Access Component) to extract data from data mart C++ class/Data Access via POOL RAL POOL’s RelationalFileCatalog C++ class/Data Access via POOL RAL POOL RAL Views from data warehouse materialised in the data mart Used to register databases Queried RelationalFileCatalog to retrieve the database URL for the requested data-set Data Mart (ORACLE) @Tier-1 Use POOL RAL (ODBC Access Component) to extract data from MySQL (source) database Views created on the data stored in the warehouse Data from source databases integrated into the data warehouse Data Warehouse (ORACLE) CERN Row-Wise-Ntuples CALTECH Row-Wise-Ntuples November 24, 2018

Prototype Critical Review
The proposed use of a data warehouse provides a light weight approach for the analysis applications. Access data locally without worrying about the individual relational database technology. and their respective database schemas. If there are ‘D’ number of the DHRD technologies with ‘S’ number of distinct schema are needed to make available in the Grid environment, then there could be ‘DxS’ database implementations are required. Whereas, data warehouse approach provides a single denormalised schema (could be replicated and distributed) to access data stored in the ‘DxS’ number of DHRD. However, separate ETL process is required for each newly added database technology. November 24, 2018

Conclusion Software prototype was successful in handling the project requirements according to the architectural design. Use of the POOL RelationalFileCatalog makes it possible to use this data warehouse in the Grid environment. Provides an integrated approach for the registration of distributed heterogeneous relational databases and to access these databases in a globally distributed environment (Grid). Studies the impact of select……. these are averages of different comparative runs at busy and quiet times granularity argument about scaling up – we would expect smoother curves, but difference to persist November 24, 2018

Questions November 24, 2018

Future Directions Databases could be searched according to the type of data they stored instead of logical database names. Monitoring of databases, especially for the databases stored replicated data. Use of data warehouse meta-data. Can be made Grid-Services compliant by using POOL file catalog features. Data mining instead of hard coded SQL statements. A single ETL process (research question). November 24, 2018

Grid Data Integration In the CMS Experiment

Similar presentations

Presentation on theme: "Grid Data Integration In the CMS Experiment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid Data Integration In the CMS Experiment

Similar presentations

Presentation on theme: "Grid Data Integration In the CMS Experiment"— Presentation transcript:

Similar presentations

About project

Feedback