3D Project Status Report

Slides:



Advertisements
Similar presentations
CERN - IT Department CH-1211 Genève 23 Switzerland t Relational Databases for the LHC Computing Grid The LCG Distributed Database Deployment.
Advertisements

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Database monitoring and service validation Dirk Duellmann CERN IT/PSS and 3D
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
LCG 3D StatusDirk Duellmann1 LCG 3D Throughput Tests Scheduled for May - extended until end of June –Use the production database clusters at tier 1 and.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
ATLAS Scalability Tests of Tier-1 Database Replicas WLCG Collaboration Workshop (Tier0/Tier1/Tier2) Victoria, British Columbia, Canada September 1-2, 2007.
Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.
ATLAS Database Operations Invited talk at the XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, September 2007 Alexandre.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Database Readiness Workshop Summary Dirk Duellmann, CERN IT For the LCG 3D project SC4 / pilot WLCG Service Workshop.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
LCG 3D Project Update (given to LCG MB this Monday) Dirk Duellmann CERN IT/PSS and 3D
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
A quick summary and some ideas for the 2005 work plan Dirk Düllmann, CERN IT More details at
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 6 April 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
INFN Tier1/Tier2 Cloud WorkshopCNAF, 22 November 2006 Conditions Database Services How to implement the local replicas at Tier1 and Tier2 sites Andrea.
Marcin Bogusz CERN, PH-CMG WLCG Collaboration Workshop CMS online/offline replication Online/offline replication via Oracle Streams WLCG Collaboration.
WLCG Collaboration Workshop CMS online/offline replication
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
Replication using Oracle Streams at CERN
Computing Operations Roadmap
Maria Girone, CERN – IT, Data Management Group
Streams Service Review
Dirk Duellmann CERN IT/PSS and 3D
Database Replication and Monitoring
(on behalf of the POOL team)
IT-DB Physics Services Planning for LHC start-up
WLCG DB Service Reviews
POW MND section.
Database Services at CERN Status Update
3D Application Tests Application test proposals
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Database Readiness Workshop Intro & Goals
Readiness of ATLAS Computing - A personal view
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
Conditions Data access using FroNTier Squid cache Server
Workshop Summary Dirk Duellmann.
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
#01 Client/Server Computing
Oracle Streams Performance
#01 Client/Server Computing
Presentation transcript:

3D Project Status Report Maria Girone, CERN IT on behalf of the LCG 3D project http://lcg3d.cern.ch LHCC Comprehensive Review 19th-20th Nov 2007

Introduction Set up a distributed database infrastructure for the WLCG (according to the WLGC MoU) RAC as building-block architecture Several 8-node clusters at Tier0 Typically 2-node clusters at Tier1 Oracle streams replication used to form a database backbone between. In production since April ‘07 Tier0 and 10 Tier1 sites ATLAS and LHCb Online and offline at Tier0 ATLAS, CMS and LHCb Frontier/SQUID from Fermilab for distributing and caching database data via a web protocol is used by CMS Database services listed among the critical services by the experiments

Building Block- Database Clusters Tier 0: all networking and storage redundant Scalability and high availability achieved Storage and CPU scale independently Maintenance operations w/o down-time!

Oracle Streams ASGC TRIUMF New or updated data are detected and queued for transmission to destination databases Database changes captured from the redo-log and propagated asynchronously as Logical Change Records (LCRs) All changes are queued until successful application at all destinations need to control change rate at the source in order to minimise the replication latency 2GB/day user data to Tier 1 can be sustained with the current DB setups

Downstream Capture & Network Optimisations Downstream Database Capture to de-couple Tier0 production databases from Tier1 or network problems is in place for ATLAS and LHCb TCP and Oracle protocol optimisations yielded significant throughput improvements (almost factor 10)

Frontier/Squid Numerous significant performance improvements since CSA’06 experiment data model frontier client/server software CORAL integration CMS is confident that possible cache coherency issues can be avoided by Cache expiration windows for data and meta-data Policy implemented by the client applications Successfully used also in CSA’07 database queries encoded as URL requests and the corresponding query results transferred as html pages. The use of http and html enables the use of standard web cache servers such as SQUID Experiment data model resulting sometimes in too many queries addressed but CMS changing the data model Problems seen recently for POOL-ORA caching container meta-data

Streams Replication Operation Streams procedures included in the Oracle Tier0 physics database service team Optimized the redo log retention on downstream database to allow for sufficient re- synchronisation window without recall from tape (for 5 days) Preparing a review of the most recurrent problems for the WLCG Service Reliability Workshop, 26-30 Nov 2007 Will be the input for further automation of service procedures

Some Issues Several bugs reported to Oracle fixed or being fixed CERN openlab has set an excellent ground! Some user-defined types not supported by streams Reformatting is done with filtering rules LFC has been hit by a known Oracle bug as it was running on on a older instant client version Syncronization with the AA instant client established Set up read-only replicas on other 5 sites together with code change to support user ACLs Automate more the split-merge procedure when one site has to be dropped/re-syncronized Progress in Oracle 11g but we need a stop-gap solution Implement failover for the downstream capture component ACL

Streams problems Workarounds found for all the open bugs: Capture process is aborted with error ORA-01280: Fatal LogMiner Fixed in patches 5581472 and 5170394 related to the capture process and the logminer Bug 6163622 SQL apply degrades with larger transactions Fixed applying patch 6163622 Bug 5093060 STREAMS 5000 LCR limit is causing unnecessary FLOW CONTROL at apply site Fixed applying patch 5093060 Bug recyclebin=on; after child table dropped:ORA-26687 on parent Metalink note 412449.1 Fixed on 10.2.0.4 and 11g APPLY gets stuck: APPLY SERVER WAITING FOR EVENT Generic performance issue on RAC Fixed by applying patches 5500044 for the fix for <bug:5977546> and One off/Patch of 5964485 OEM agent blocks Streams processes Fixed applying patch 5330663 Workarounds found for all the open bugs: ORA-600 [KWQBMCRCPTS101] after dropping Propagation job ORA-26687 after table is dropped when there are two streams setups between the same source and destination databases ORA-00600: [KWQPCBK179], memory leak from propagation job Observed after applying the fix patch For the non-fixed problems: ORA-600 [KWQBMCRCPTS101] after dropping Propagation job ORA-26687 after table is dropped when there are two streams setups between the same source and destination databases we have developed and implemented a workaround for each problem ORA-00600: [KWQPCBK179], memory leak from propagation job this is not a critical problem 9

Service Levels and Policies DB Service level according to WLCG MoU need more production experience to confirm manpower coverage at all T1 sites piquet service being set-up at Tier 0 to replace existing 24x7 (best effort) service streams interventions for now 8x5 Criticality of this service rated by most experiments (ATLAS, CMS and LHCb) as “very high” or “high” Proposals from CERN Tier 0 have been accepted also by the collaborating Tier 1 sites Backup and Recovery RMAN based backups mandatory, data retention period 1 month security patch frequency and application procedure database software upgrade procedure patch validation window

Database & Streams Monitoring Monitoring and diagnostics has been extended and integrated in the experiments’ dashboards Weekly/monthly database and replication performance summary has been added extensive data and plots about replication activity, server usage, server availability available form the 3D wiki site (html or pdf) summary plot with LCR rates during last week is show on the 3D home page and could be referenced/included into other dashboard pages Complemented by weekly Tier 0 database usage reports which are in use already since more than one year

Integration with WLCG Procedures and Tools 3D monitoring and alerting has been integrated with WLCG procedures and tools dedicated workshop at SARA in March 2007 focussing on this Interventions announced according to the established WLCG procedures eg EGEE broadcasts, GOCDB entries To help reporting to the various coordination meetings we collect all 3D intervention plans also on the 3D wiki Web based registration will be replaced as soon as a common intervention registry is in production

Intervention & Streams Reports A factor 5 improvement in apply speed by tuning of filtering rules

Tier 1 DB Scalability Tests The experiments have started to evaluate/confirm also the estimated size of the server resources at Tier 1 number of DB nodes CPUs memory, network and storage configuration Need realistic work-load which now becomes available as experiment s/w frameworks approach complete coverage of their detectors ATLAS conducted two larger tests with ATHENA jobs against IN2P3 (shared solaris server) and CNAF (Linux) Total throughput of several thousand jobs/h achieved with some 50 concurrent test jobs per Tier 1 LHCb s/w framework integration done and scalability tests starting as well Lower throughput requirements assumed than for ATLAS Tests with several hundred concurrent jobs in progress

Database Resource Requests Experiment resource requests for T1 unchanged since more than one year 2 (3) node DB cluster for LHCb (ATLAS) fibre channel based shared storage 2 squid nodes for CMS standard worker node with local storage Setup shown to sustain replication throughput required for conditions data (1.7 GB/d ATLAS) ATLAS has switched to production with the Tier1 sites LHCb requested read-only LFC replicas at all 6 Tier1 sites as for conditions Successful replication tests for LFC between CERN and CNAF Updated requests for LHC startup have been collected at the WLCG workshop at CHEP’07 No major h/w extension requested at that time

Online-Offline Replication Oracle Streams used by ATLAS, CMS and LHCb Joint effort with ATLAS on replication of PVSS data between online and offline required to allow detector groups to analyse detailed PVSS logs without adverse impact on online database Significantly higher rates required than for COOL based conditions, which have been confirmed in extensive tests some 6 GB of user data per day Oracle Streams seems an appropriate technology also for this area

Backup & Recovery Exercise Organised a dedicated database workshop at CNAF in June 2007 on recovery exercise show that database implementation and procedures at each site are working show that coordination and re-synchronisation after a site recovery works show that replication procedures continue unaffected while some other sites are under recovery Exercise well appreciated by all participants several set-up problems haven been resolved during this hands on activity with all sites present six sites have now successfully completed a full local recovery and re-synchronisation remaining sites will be scheduled shortly after the Service Reliability Workshop

More Details LCG 3D wiki Recent LCG 3D workshops interventions, performance summaries http://lcg3d.cern.ch Recent LCG 3D workshops Monitoring and Service Procedures w/s @ SARA http://indico.cern.ch/conferenceDisplay.py?confId= 11365 Backup and Recovery w/s @ CNAF http://indico.cern.ch/conferenceDisplay.py?confId= 15803 Next @ LGC Service Reliability Workshop http://indico.cern.ch/conferenceOtherViews.py?vie w=standard&confId=20080

Approved by the MB on 13.03.07 TAGS at volunteer Tier-1 sites (BNL, TRIUMF, ...)

Updated Request from ATLAS Not yet approved. May need funding discussion ATLAS provides more info at https://twiki.cern.ch/twiki/bin/view/Atlas/DatabaseVolumes Year Total (TB) TAGS (TB) COOL (TB) PVSS etc. (TB) 2007 6.4 1.0 0.4 5 2008 14.5 5.6 0.9 8 2009 27.3 13.7 1.6 12 2010 46.8 27.5 2.9 16.4 20

Approved by the MB on 13.03.07

Summary The LCG 3D project has setup a world-wide distributed database infrastructure for LHC Close collaboration between LHC experiments and LCG sites with more than 100 DB nodes at CERN + several tens of nodes at Tier 1 sites this is one of the largest distributed database deployments world-wide Large scale experiment tests have validated the experiment resource requests implemented by the sites Backup & recovery tests have been performed to validate the operational procedures for error recovery Regular monitoring of the database and streams performance is available to the experiments and sites Tier 0+1 ready for experiment ramp-up to LHC production. Next steps: Replication of ATLAS Muon calibrations back to CERN Test as part of CCRC’08 Started testing 11gR1 features. Production deployment plan for 11gR2 by end of 2008