Maria Girone, CERN – IT, Data Management Group

Slides:

Advertisements

Similar presentations

ITEC474 INTRODUCTION.

Advertisements

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ

Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data

Oracle Clustering and Replication Technologies CCR Workshop - Otranto Barbara Martelli Gianluca Peco.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.

Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.

Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

Daniela Anzellotti Alessandro De Salvo Barbara Martelli Lorenzo Rinaldi.

CERN Physics Database Services and Plans Maria Girone, CERN-IT

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

Distributed Data Management Graeme Kerr Oracle in R&D Programme.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

CERN IT Department CH-1211 Genève 23 Switzerland t Possible Service Upgrade Jacek Wojcieszuk, CERN/IT-DM Distributed Database Operations.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.

CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Database Administration Basics. Basic Concepts and Definitions  Data Facts that can be recorded and stored  Metadata Data that describes properties.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

Site Services and Policies Summary Dirk Düllmann, CERN IT More details at

Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.

Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.

20 Copyright © 2006, Oracle. All rights reserved. Best Practices and Operational Considerations.

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

Oracle Clustering and Replication Technologies UK Metadata Workshop - Oxford Barbara Martelli Gianluca Peco.

Oracle Database High Availability

Jean-Philippe Baud, IT-GD, CERN November 2007

Replication using Oracle Streams at CERN

WP18, High-speed data recording Krzysztof Wrona, European XFEL

High Availability 24 hours a day, 7 days a week, 365 days a year…

Managing Multi-User Databases

Database Replication and Monitoring

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Backup & Recovery of Physics Databases

IT-DB Physics Services Planning for LHC start-up

Evolution of Data(base) Replication Technologies for WLCG

POW MND section.

Database Services at CERN Status Update

Database Readiness Workshop Intro & Goals

Maximum Availability Architecture Enterprise Technology Centre.

WLCG Service Interventions

Conditions Data access using FroNTier Squid cache Server

Oracle Database High Availability

Scalable Database Services for Physics: Oracle 10g RAC on Linux

Ákos Frohner EGEE'08 September 2008

Workshop Summary Dirk Duellmann.

Oracle Database Monitoring and beyond

Introduction of Week 6 Assignment Discussion

Oracle Storage Performance Studies

3D Project Status Report

Database Management System (DBMS)

Case studies – Atlas and PVSS Oracle archiver

Data Lifecycle Review and Outlook

Database Services for CERN Deployment and Monitoring

Oracle Streams Performance

Scalable Database Services for Physics: Oracle 10g RAC on Linux

Database System Architectures

Presentation transcript:

Maria Girone, CERN – IT, Data Management Group Distributed Database Services A Fundamental Component of WLCG Service Experience and Outlook CHEP 2009, 23 March 2009 Maria Girone, CERN – IT, Data Management Group .

Outline Databases in the LHC Computing Grid Lessons Learned in 2008 Key Technologies Real Application Clusters (RAC) Streams Data Guard Lessons Learned in 2008 Preparing for the 2009-2010 running

Databases and LHC Relational DBs play today a key role in the experiments’ production chains online acquisition, offline production, data (re)processing, data distribution, analysis PVSS, conditions, geometry, alignment, calibration, file bookkeeping, file transfers, etc.. Grid Infrastructure and Operation services GridView, SAM, Dashboards, GridMap, VOMS… Data Management Services FTS, LFC, CASTOR + SRM, … Originally deployed at CERN for the construction of LEP, relational databases now play a key role in the experiments' production chains, from online acquisition through to offline production, data distribution, reprocessing and analysis Listed among the critical services for the LHC experiment

Distributed Database Operations WLCG Computing model involves Tier0, Tier1 and Tier2 sites with key & significant roles, e.g. re-processing at Tier1s For these reasons some data need to be accessible at CERN and other WLCG sites: set up a Distributed Database Infrastructure for the WLCG Oracle Streams replication used to form a database backbone between Online and offline at Tier0 ATLAS, CMS and LHCb (PVSS, conditions data) Tier0 and 10 Tier1 sites ATLAS and LHCb (for conditions and file bookkeping metadata) Frontier/SQUID for distributed cache of database queries via a web protocol used by CMS for conditions and being evaluated by ATLAS

DBA & Developer Community h/w upgrade Optimized COOL query “Effective Oracle by Design” strategy – partnership between DBAs and developers building on “best practices” Regular tutorials, starting from workshop & colloquia in early 2005 Consultancy to the database developers Performance tuning of critical applications is a major task Concurrency and stress testing is key for smooth production Most issues of performance and locking contention come up at this stage Atlas Scalability tests for conditions data (see R. Walker’s talk) Small DBA community: essential to collect and share knowhow among the (often limited) resources Emphasis on homogeneity (e.g. RAC/Linux, SAN infrastructure) Sharing policies and procedures Distributed Database Operations regular meetings and workshops now including discussions on CASTOR@Tier1 Hardware upgrade can provide ‘a boost’, but query tuning is essential

Service Key Requirements Data Availability, Reliability and Scalability Oracle Real Application Clusters (RAC) with Automatic Storage Management (ASM) Hardware with redundancy for fault tolerance Data Distribution Oracle Streams: for sharing information between databases Data Protection Oracle RMAN on TSM for backups Oracle Data Guard: for additional protection against failures (human errors, disaster recoveries,..)

Key Technologies

Real Application Clusters Building-block architecture for the Distributed Database Services at CERN and Tier1 sites Key to providing availability, reliability, scalability, consolidation and required service level ~25 clusters in production at Tier0, ~20 at Tier1 sites Similar h/w set-up, SAN, same OS and Oracle versions Rolling upgrade capabilities essential for service continuity Maintenance with no user-visible downtime 0.04% services unavailability (2008) = 3.5 hours/year 0.22% server unavailability (2008) = 19 hours/year Expansion to this level of users / applications / data would have been impossible within resource constraints using individual disk servers

Oracle RAC/SAN Architecture Dual-CPU quad-core 2950 DELL servers, 16GB memory, Intel 5400-series “Harpertown”; 2.33GHz clock Dual power supplies, mirrored local disks, 4 NIC (2 private 2 public), redundancy of FC switches, dual HBAs, “RAID 1+0 like” with ASM

Oracle Streams Replication Technology for sharing information between databases Database changes captured from the redo-log and propagated asynchronously as Logical Change Records (LCRs) ATLAS LHCB CMS Apply Capture Redo Logs Propagate Target Database Source Database

Data Guard Limiting database downtime in the event of: Multi-point hardware failures Wide-range of corruptions Disaster H/W upgrades Human errors within configured redo apply lag (24 hours) Ad-hoc testing of major schema upgrades or data reorganization on the standby Data changes Primary Standby Clients

2008 Experience and Lessons Learned

Service Priorities in 2008 Met requirements of CCRC’08, cosmic runs and LHC start-up H/W upgrade to multi-core servers at CERN and many Tier1s validated Took over responsibility for supporting the online databases at CERN Additional funding from LHC experiments for the DBAs Standardize on coherent setups – minimize complexity and diversity Oracle version (10.2.0.4, RHEL4, x86, 64bit) Enforce robust pre-production testing Eliminate service problems at the production stage Upgrade to quad-core servers

Service Enhancements in 2008 Data Guard for critical databases At CERN for physical standby: validated during CCRC’08 and deployed in production prior to the LHC start-up At CNAF for logical standby: for ATLAS LFC from CNAF to INFN-ROMA1 (see B. Martelli’s talk) Streams Replication Automatic Split & Merge procedures Use of transportable tablespaces for site re-synchronization Backup and Recovery On-tape backups performance improvements (from 30Mb/s to 70MB/s) Automated test recoveries LAN-free tape backup tests On-disk image copies to speed-up recovery from physical and logical data corruptions

Service Enhancements in 2008 (2) DB Monitoring Coherent tool for database (online, offline & standby) and streams monitoring/alerts integrated Extended to display Tier1 status Database Security Isolating the online DBs from external access (typically limited to experiments’ networks) Oracle Critical Security updates regularly validated and applied Account Policies Enforcing use of reader and writer. Owner locked on production. Automatic tool for temporary unlocking

Preparing for 2009-2010 running

2009 Priorities H/W and O/S upgrade DB Security Study usage of systems resources: CPU, random IO (read), available storage space and correlate it with experiments’ requests Evaluation of RHEL5 DB Security Rule-based tool to detect malicious access to DB Client connection to the DB Coral Server deployment (see A. Valassi’s talk) Data Life Cycle “old” data offline with possibility of accessing it on demand (SAM, conditions, PVSS, etc) multiple schema archiving, Oracle partitioning, compression Regular Oracle Reviews (discussed at WLCG workshop) To follow-up service requests/bugs which affect WLCG Services We are working on a tool that provides security oriented information on database access and discovers dangerous or improper actions. It will alert the DBA's in case any situation rated as dangerous occurs. There will also be a possibility to customize it according to experiments needs. The tool will be based on a set of rules describing suspicious actions and an engine that will search for the situations were the rules were disobeyed. The main requirement is that the rules have to be flexible and easy to change/add.

Oracle 11g Understand service implications & production deployment schedule for Oracle 11g version Will be a coordinated effort with the Tier1 sites Tested 11.1 features Streams Performance improvements Combined Capture & Apply Source and target data Compare & Converge Active Data Guard Enables to continously open a physical standby database for read-only access (critical to the online DBs) Execution Plan Stability Real Application Testing ASM and cluster storage improvements Now involved in 11gR2 beta testing Performance improvements minimizes disk I/O and reduce capture latency and reduces CPU consumption Combined Capture and Apply capture sends LCRs directly to apply (queues are not used) only 1 propagation job running = only one destination site useful for the online – offline setups and for the recover of the split destination sites (Tier1) big performance improvement - rate: 14.000 LCRs/sec (before 5.000 LCRs/sec) Source and target data compare and converge enables you to compare the rows in a shared database object, such as a table, at two different databases converge objects in case of differences so that they are consistent

Summary We are operating a complex world-wide distributed database infrastructure for LHC Computing Grid RAC/ASM key DB services at Tier0 & Tier1s Streams for detector conditions: key for data (re-)processing Data Guard key for data protection: critical databases Large scale tests show that the experiments’ requirements have been met Testing and validation – hardware, DB versions, applications – proven key to smooth production close cooperation between application developers and DBAs Monitoring of database & streams performance has been implemented key to maintain and optimise any larger system Several successful data-challenges but data-taking starts soon..

More Details CERN Physics Databases wiki: General advices, activities Interventions, service procedures, policies http://cern.ch/phydb/wiki Support: phydb.support@cern.ch LCG 3D wiki interventions, service procedures, performance pages http://cern.ch/phydb/lcg3d

Backup Slides

Streams Procedures Downstream capture to de-couple Tier 0 production databases from destination or network problems redo log retention optimized to allow for a re-synchronisation window of 5-days If one site is down LCRs are not removed from the queue Capture process might be paused by flow control Split & (and then) Merge procedures Resynchronization when the site is up again Transportable tablespaces Apply Capture Propagate Target Database Downstream Database Redo Logs Source Database Redo Transport method 22

Streams Throughput: Conditions ATLAS Conditions data 2 GB per day 600 - 800 LCRs per second 23

… and PVSS ATLAS PVSS tests 6 GB per day 2500 - 3000 LCRs per second 24