Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Roadmap 2009 – 2010 ~~~ Distributed DB Operations Workshop, PIC, April 2009.

Similar presentations


Presentation on theme: "WLCG Service Roadmap 2009 – 2010 ~~~ Distributed DB Operations Workshop, PIC, April 2009."— Presentation transcript:

1 WLCG Service Roadmap 2009 – 2010 Jamie.Shiers@cern.ch ~~~ Distributed DB Operations Workshop, PIC, April 2009

2 Agenda Summary of recommendations from recent LHCC review and WLCG Collaboration workshop Scale Test for Experiment Program – STEP 2009  Service Issues – Outstanding bugs & service requests WLCG Operations Summary – What do DBAs need to know? Remaining challenges and directions... (2011 and beyond?) 2

3 CHEP 2009 A recurrent theme during the recent conference on Computing in High Energy Physics was that we are ready for LHC data taking This was based on the experience during CCRC’08, the brief data taking during 2008, together with on-going experiment production & commissioning activities Explicit mention of database activities was made during numerous talks, including the opening plenary: 3

4 Databases The Online systems use a lot of data-bases: –Run database, Configuration DB, Conditions DB, DB for logs, for logbooks, histogram DB, inventory DB, … –Not to forget: the archiving of data collected from PVSS (used by all detector control systems) All experiments run Oracle RAC infrastructures, some use in addition MySQL, object data-base for ATLAS Configuration (OKS) Administration of Oracle DBs is largely outsourced to our good friends in the CERN IT/DM group Exchange of conditions between offline and online uses Oracle streaming (like replication to Tier1s) LHC DAQ CHEP09 - Niko Neufeld 4

5 CHEP 2009 A recurrent theme during the recent conference on Computing in High Energy Physics was that we are ready for LHC data taking This was based on the experience during CCRC’08, the brief data taking during 2008, together with on-going experiment production & commissioning activities Explicit mention of database activities was made during numerous talks, including the opening plenary:  For me, what was equally important was the message that was retained in the track summary: 5

6 6

7 WLCG Viewpoint  From the point of view of the WLCG project – which includes the experiment spokesmen and computing coordinators (as well the sites) in the project management structure – the LHC machine, the experiments and the WLCG project (itself a collaboration) work together to achieve a common goal:  To allow the science to be extracted swiftly & efficiently The Grid: the power of 3

8 Experiment Viewpoint  From the point of view of the experiments, the LHC machine and the experiments work together to achieve a common goal  Oh, yes, and [ The WLCG service ] “should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential“ “…whilst being stable, reliable and easy to use”  Whilst this is clearly simplified and/or exaggerated, this makes an important point that we must not forget: we (WLCG) must provide a service – that is the only reason we exist!

9 To summarize… WLCG Service LHC Experiments LHC Machine  An important differentiator between WLCG and a number of other large-scale computing projects is that its goal has always been to deliver a 24x365 high-availability, redundant and resilient peta-scale distributed computing service for thousands of scientists worldwide  At minimal (acquisition; operational) cost  Our role is to provide a service – excellence must be our target!

10 What Needs to be Improved? We still see “emergency” interventions that could be “avoided” – or at least foreseen and scheduled at a more convenient time Often DB applications where e.g. tables grow and grow until performance hits a wall  emergency cleanup (that goes wrong?); indices “lost”; bad manipulations; … We still see scheduled interventions that are not sufficiently well planned – that run well into “overtime” or have to be “redone” e.g. several Oracle upgrades end last year overran – we should be able to schedule such an upgrade by now (after 25+ years!) Or those that are not well planned or discussed with service providers / users and have a big negative impact on ongoing production e.g. some network reconfigurations and other interventions – particularly on more complex links, e.g. to US; debugging and follow-up not always satisfactory There are numerous concrete examples of the above concerning many sites: they are covered in the weekly reports to the MB and are systematically followed up  Much more serious are chronic (weeks, months) problems that have affected a number of Tier1 sites – more later… 10

11 Major Service Incidents Quite a few such incidents are “DB-related” in the sense that they concern services with a DB backend The execution of a “not quite tested” procedure on ATLAS online led – partly due to the Xmas shutdown – to a break in replication of ATLAS conditions from online out to Tier1s of over 1 month (online-offline was restored much earlier) Various Oracle problems over many weeks affected numerous services (CASTOR, SRM, FTS, LFC, ATLAS conditions) at ASGC  need for ~1FTE of suitably qualified personnel at WLCG Tier1 sites, particularly those running CASTOR; recommendations to follow CERN/3D DB configuration & perform a clean Oracle+CASTOR install; communication issues Various problems affecting CASTOR+SRM services at RAL over prolonged period, including “Oracle bugs” strongly reminiscent of those seen at CERN with earlier Oracle version: very similar (but not identical) problems seen recently at CERN & ASGC (not CNAF…) Plus not infrequent power + cooling problems [ + weather! ] Can take out an entire site – main concern is controlled recovery (and communication) 11

12 12 At the November 2008 WLCG workshops a recommendation was made that each WLCG Tier1 site should have at least 1 FTE of DBA effort. This effort (preferably spread over multiple people) should proactively monitor the databases behind the WLCG services at that site: CASTOR/dCache, LFC/FTS, conditions and other relevant applications. The skills required include the ability to backup and recover, tune and debug the database and associated applications. At least one WLCG Tier1 does not have this effort available today.

13 BNL ASGC/Taipei IN2P3/Lyon TRIUMF/BC RAL CNAF CERNTIER2s FZK NDGF PIC FNAL NIKHEF/ SARA

14 FZK FNAL TRIUMF NGDF CERN Barcelona/PIC Lyon/CCIN2P3 Bologna/CAF Amsterdam/NIKHEF-SARA BNL RAL Taipei/ASGC

15 FZK FNAL TRIUMF NGDF CERN Barcelona/PIC Lyon/CCIN2P3 Bologna/CAF Amsterdam/NIKHEF-SARA BNL RAL Taipei/ASGC

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ALICE critical services Critical servicesRankComment AliEN10ALICE computing framework Site VO boxes10Site becomes unusable if down CASTOR and xrootd at Tier-010Stops 1 st pass reco (24 hours buffer) Mass storage at Tier-15Downtime does not prevent data access File Transfer Service at Tier-07Stops 2 nd pass reco gLite workload management5Redundant PROOF at Tier-0 CERN Analysis Facility 5User analysis stops Rank 10: critical, max downtime 2 hours Rank 7: serious disruption, max downtime 5 hours Rank 5: reduced efficiency, max downtime 12 hours

17 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ATLAS criticality 10: Very highinterruption of these services affects online data-taking operations or stops any offline operations 7: Highinterruption of these services perturbs seriously offline computing operations 4: Moderateinterruption of these services perturbs software development and part of computing operations

18 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ATLAS critical services RankServices at Tier-0 Very highOracle (online), DDM central catalogues, Tier-0 LFC High Cavern  T0 transfers, online-offline DB connectivity, CASTOR internal data movement, Tier-0 CE, Oracle (offline), Tier-0/1 FTS, VOMS, Tier-1 LFC, Dashboard, Panda/Bamboo, DDM site services/VO boxes Moderate3D streaming, gLite WMS, Tier-0/1/2 SRM/SE, Tier-1/2 CE, CAF, CVS, Subversion, AFS, build system, Tag Collector RankServices at Tier-1 HighLFC, FTS, Oracle Moderate3D streaming, SRM/SE, CE RankServices at Tier-2 ModerateSRM/SE, CE RankServices elsewhere HighAMI database

19 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS criticality Rank 10: 24x7 on call Rank 8,9: expert call-out CMS gives a special meaning to all rank values Spinal Tap

20 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS critical services RankServices 10Oracle, CERN SRM-CASTOR, DBS, Batch, Kerberos, Cavern- T0 transfer+processing 9CERN FTS, PhEDEx, FroNTier launchpad, AFS, CAF 8gLite WMS, VOMS, Myproxy, BDII, WAN, Non-T0 prod tools 7APT servers, build machines, Tag collector, testbed machines, CMS web server, Twiki 6SAM, Dashboard, PhEDEx monitoring, Lemon 5WebTools, e-mail, Hypernews, Savannah, CVS server 4Linux repository, phone conferencing, valgrind machines 3Benchmarking machines, Indico

21 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services LHCb critical services RankServices 10Tier-0 CASTOR, AFS, CERN VO boxes (DIRAC3 central services), Tier-0 LFC master, Oracle at CERN 7VOMS, CERN FTS, Tier-0 ConditionDB, LHCb bookkeeping service, Oracle Streams, SAM 5Tier-0/1 CE and batch, Tier-0 gLite WMS, Tier-1 ConditionDB 3Tier-1 SE, Tier-0/1 Replica LFC, Dashboard, Tier-1 FTS, Tier-1 gLite WMS 1Tier-1 VO boxes

22 Critical Service Follow-up Targets (not commitments) proposed for Tier0 services Similar targets requested for Tier1s/Tier2s Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿Tier1s: 95% of problems resolved <1 working day ? ¿Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 22 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x5011 / alarm e-mail99% 1 hourOperator response to alarm / call to x5011 / alarm e-mail100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99%

23 Service Impact At least for ATLAS and CMS, database services and services that depend on them are ranked amongst the most critical of all!  We must remember our motto “by design” in ensuring that not only are service interruptions and degradations avoided where possible but also that any incidents are rapidly resolved. 23 10: Very highinterruption of these services affects online data-taking operations or stops any offline operations 7: Highinterruption of these services perturbs seriously offline computing operations

24 Concrete Actions 1.Review on a regular (3-6 monthly?) basis open Oracle “Service Requests” that are significant risk factors for the WLCG service (Tier0+Tier1s+Oracle) The first such meeting is being setup, will hopefully take place prior to CHEP 2009 2.Perform “technology-oriented” reviews of the main storage solutions (CASTOR, dCache) focussing on service and operational issues Follow-on to Jan/Feb workshops in these areas; again report at pre-CHEP WLCG Collaboration Workshop 3.Perform Site Reviews – initially Tier0 and Tier1 sites – focussing again and service and operational issues. Will take some time to cover all sites; proposal is for review panel to include members of the site to be reviewed who will participate also in the review before and after their site 24

25 The Goal The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents – just a (tabular?) summary of the KPIs  We are currently very far from this target with (typically) multiple service incidents that are either: New in a given week; Still being investigating or resolved several to many weeks later By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS) 25

26 Examples and Current High-Priority Issues Atlas, Streams and LogMiner crash CMS Frontier, change notification and Streams incompatibility Castor, BigID issue Castor, ORA-600 crash Castor, crosstalk and wrong SQL executed Oracle clients on SLC5, connectivity problem Restore performance for VLDB 26

27 Atlas Streams Setup: Capture Aborted with Fatal Logminer Error Atlas streams replication aborted with a critical error on 12-12-2008 –Full re-instantiation of streams was necessary –PVSS replication was down for 4 days –Conditions replication was down for 3 days –Replication to Tier1s was severely affected, part of the replication was down during the CERN annual shutdown period –Open SR 7239707.994 with Oracle 27

28 Streams Issue and LogMiner Error Streams issue: –Generated by LogMiner aborting and not being able skip ‘wrong redo’ and/or bypass single transactions –Issue could not be reproduced exactly, but likely to come up any time LogMiner will run in a critical error while reading a redolog/archivelog file –In a test environment we could see similar behaviour by manually corrupting redo files Workaround proposed by Oracle support on 10g: –skip the log file which contains the transaction causing the problem –Not acceptable in production environment –Will cause logical corruption, i.e. data loss at the target database 28

29 CMS Frontier: Change Notification Incompatibility with Streams Streams capture process aborts with errors ORA-16146 and ORA- 07445 –ORA-16146 reported intermittently and capture aborts (SR 20080444.6 ) and –ORA-07445: [kghufree()+485] repeatedly in a Streams environment (SR 7280176.993) –Streams and change notification have incompatibilities –The issue also causes high load on the server (performance issue) –Forces manual capture restart –Oracle support recommendations and actions Proposed patch 5868257, did not solve the issue Oracle support will try to reproduce in house –Alternative solution for CMS: replace the use of change notification with systems info of last modified tables. 29

30 Castor and BigID issue Wrong bind value being inserted into castor tables –i.e.’Big ID’ problem in CASTOR @RAL, ASGC and CNAF –OCCI NUMBER TYPE OVERFLOW Oracle support SR 7299845.994 –OCCI 10.2.0.3 application build with GCC 3.4.6 on Linux x86, Oracle 10.2.0.4 database –Randomly the application inserts huge values (10^17 instead of 10, for example) –Difficult to trace OCI or OCCI in multi-threaded environment –RAL working on a test case to reproduce the problem for Oracle support 30

31 Castor and Crosstalk Between Different Schemas SQL executed in wrong schema when fully qualified names are not used –Seen in 10.2.0.2 (LFC) –Similar bugs are reported on 10.2.0.3 –Supposed fixed in 10.2.0.4 –Although Castor @ RAL has reported one occurrence of ‘crosstalk’ which caused loss of 14K files SR 7103432.994 (September 2008) Issue not reproduced since 31

32 Castor and Block Corruption Issue Oracle crash due to ORA-600 on 14-3-2009 –Castor stager for Atlas down for about 10 hours –Because of bug 6647480 –Instance crashed and crash recovery failed –SR 7398656.993 opened with Oracle support –Issue resolved with patch CERN also requested merge for 7046187 and 6647480 –Note: bug 6647480 was available since October 2008, but not listed as critical in Oracle main support doc “10.2.0.4 Patch Set - Availability and Known Issues” –See https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem2 0090314 32

33 Oracle Clients cannot connect on SLC5 and RHEL5 Clients cannot connect –It’s a blocking issue for Oracle clients on SLC5 (and RHEL5) –Based on Bug 6140224, "SQLPLUS FAILS TO LOAD LIBNNZ11.SO WITH SELINUX ENABLED ON EL5/RHEL5" –The root cause is explained in Metalink Note 454196.1 The first proposed workaround by support (use of SELinux in ‘permissive’ mode) is not acceptable in our environment A second workaround proposed from support, involves repackaging of the Oracle client as rpms installed on localdisks (instead of AFS) –requires significant changes in CERN and Tier 1 deployment model –From oracle support, the issue should be fixed in 11gR2 Beta 2 We requested a backport to 10.2 33

34 RMAN Restore Performance Issue Backup restore performance issue –Affects significantly time to restore for very large databases (VLDB) –Restores are limited by gigabit Ethernet speed (100 MBPS) The issue effectively disallows to use multiple RAC nodes/NICS to restore a VLDB –Example: 10TB DB, the use of a single node/channel limits restore time to 1 day or more Status: –Opened as SR 7241741.994 –Bug 8269674 opened by Oracle support for this issue 34

35 Summary of Issues 35 IssueServices InvolvedService RequestImpact LogminerATLAS StreamsSR 7239707.994PVSS, conditions service interruptions (2/3 days); Replication to T1s severly affected LogminerStreams Proposed ‘workaround’ not acceptable -> data loss! Streams/ Change notification CMS FrontierSR 7280176.993Incompatibility with Streams BigIDCASTORSR 7299845.994Service interruption “Crosstalk”CASTOR + ??SR 7103432.994Logical data corruption ORA-600CASTORSR 7398656.993Service interruption Clients on SL5Clients can’t connect!

36 Status of Discussions with Oracle After some initial contact and e-mail discussions, a first preparatory meeting is scheduled for April 6 th at Oracle’s offices in Geneva There is already a draft agenda for this meeting –Introductions & Welcome –WLCG Service & Operations – Issues and Concerns –WLCG Distributed Database Operations: Summary of Current & Recent Operational Problems –Oracle Support – Reporting and Escalating Problems for Distributed Applications (Customers) –Discussion on Frequency, Attendance, Location, … –Wrap-up & Actions http://indico.cern.ch/conferenceDisplay.py?confId=53482  A report on this meeting will be given at the Distributed DB Operations meeting to be held in PIC later in April Agenda: http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=54037

37 Outlook Depending on the outcome of the discussions with Oracle, we would expect to hold such “WLCG Oracle Operational” (W) reviews roughly quarterly RAL have already offered to host such a meeting The proposed initial scope is to: –Follow-up on outstanding service problems; –Be made aware of critical bugs in production releases, together with associated patches and / or work-arounds. 37

38 Summary of Meeting Oracle described their “support portfolio” suggesting possible options that might be of relevance to WLCG The target is – if deemed appropriate / affordable – to have an agreement in place well ahead of 2009 LHC data taking (some months at least…) This does not give much time for the necessary technical and contractual discussions… Action on both CERN & Oracle to follow-up… 38

39 Service Priorities 1.Stability – up to the limits that are currently possible 2.Clarity – where not (well always…)  All sites and experiments should use consistently the existing meetings and infrastructure where applicable Join the daily WLCG operations con-call regularly – particularly when there have been problems at your site and / or when upcoming interventions are foreseen Always submit GGUS tickets – private emails / phone calls do not leave a trace Use the established mailing lists – quite a few mails still “get lost” or do not reach all of the intended people / sites  The LCG Home Page is your entry point! 39

40 2009 Data Taking – The Prognosis Production activities will work sufficiently well – many Use Cases have been tested extensively and for prolonged periods at a level equal to (or even greater than) the peak loads that can be expected from 2009 LHC operation Yes, there will be problems but we must focus on restoring the service as rapidly and as systematically as possible Analysis activities are an area of larger concern – by definition the load is much less predictable Flexible Analysis Services and Analysis User Support will be key In parallel, we must transition to a post-EGEE III environment – whilst still not knowing exactly what this entails… But we do know what we need to run stable Grid Services! 40

41 Ian.Bird@cern.ch 41 Schedule for 2009 - 2010 From Chamonix summary: http://indico.cern.ch/conferenceDisplay.py?confId=45433 http://indico.cern.ch/conferenceDisplay.py?confId=45433 From Chamonix summary: http://indico.cern.ch/conferenceDisplay.py?confId=45433 http://indico.cern.ch/conferenceDisplay.py?confId=45433

42 Ian.Bird@cern.ch 42 Likely scenario  Injection: end September 2009  Collisions: end October 2009  Long run from ~November 2009 for ~44 weeks  This is equivalent to the full 2009 + 2010 running as planned with 2010 being a nominal year  Short stop (2 weeks) over Christmas/New Year  Energy will be limited to 5 TeV  Heavy Ion run at the end of 2010  No detailed planning yet  6 month shutdown between 2010/2011 (?) – restart in May ?  Now understand the effective amount of data taking in 2009+2010 will be ~6.1 x 10^6 seconds (cf 2 x 10^7 anticipated in original planning)

43 Ian.Bird@cern.ch 43 Pending issues for 2009  Plan to have visits of Tier 1 sites – to understand service issues  MSS  Databases – seems to be often a source of problems  Share and spread knowledge of running reliable services  SRM performance  Need good testing of Tier 1 tape recall/reprocessing, together with Tier 1 tape writing – for several experiments together  Encapsulated tests?  Data access with many users for analysis – need to see experiment testing of analysis  Transition plan for 2010 – to cover services today provided by EGEE  May be short or long term – but is probably going to be needed

44 Ian.Bird@cern.ch 44 200920102011 JanFebMarAprMayJunJulAugSepOctNovDecJanFebMarAprMayJunJulAugSepOctNovDecJanFeb SUpp runningHI? STEP’09 May + June WLCG timeline 2009-2010 2009 Capacity commissioned 2010 Capacity commissioned Switch to SL5/64bit completed? Deployment of glexec/SCAS; CREAM; SRM upgrades; SL5 WN EGEE-III ends EGI... ??? Workshops Machine will ramp-up slowly in efficiency – incremental improvements will still be possible – larger changes will need to be carefully planned, tested & agreed!

45 WLCG Service Summary Great strides have been made in the past year, witnessed by key achievements such as wide-scale production deployment of SRM v2.2 services, successful completion of CCRC’08 and support for experiment production and data taking Daily operations con-calls – together with the weekly summary – are key to follow-up of service problems Some straightforward steps for improving service delivery have been identified and are being carried out  Full 2009-scale testing of the remaining production + analysis Use Cases is urgently required (STEP’09) – without a successful and repeatable demonstration we cannot assume that this will work! 45

46 Outlook 46 Grids, Clouds, Virtualization etc. Numerous discussions on this at CHEP – as well as at other fora (OGF etc.)Numerous discussions on this at CHEP – as well as at other fora (OGF etc.) Investigations are already ongoing in a number of areas – expect these to increase with possibly one or more “WLCG” projects in the near future (to fit with priorities from 2009 / 2010 data taking)Investigations are already ongoing in a number of areas – expect these to increase with possibly one or more “WLCG” projects in the near future (to fit with priorities from 2009 / 2010 data taking) Questions regarding databases and data management – important that this community is engaged with this work (imho)Questions regarding databases and data management – important that this community is engaged with this work (imho)

47 Current Data Management vs Database Strategies Data Management  Specify only interface (e.g. SRM) and allow sites to chose implementation (both of SRM and backend s/w & h/w mass storage system) Databases  Agree on a single technology (for specific purposes) and agree on detailed implementation and deployment details WLCG experience from both areas shows that you need to have very detailed control down to the lowest levels to get the required performance and scalability. How can this be achieved through today’s (or tomorrow’s) Cloud interfaces? Are we just dumb???

48 Can Clouds Replace Grids? – The Checklist  We have established a short checklist that will allow us to determine whether clouds can replace – or be used in conjunction with – Grids for LHC-scale data intensive applications: 1.Non-trivial quality of service must be achieved; 2.The scale of the test(s) must be meaningful for petascale computing; 3.Data Volumes, Rates and Access patterns representative of LHC data acquisition, (re-)processing and analysis; 4.Cost (of entry; of ownership).

49 Summary  The LHC project is simply massive on any scale There are many unsung heroes – and we should be aware of our role & contributions  If / when major discoveries are made, no-one is likely to remember to mention the contribution of the services that allowed such advances… But we will know and should be proud of it nonetheless 49

50 50


Download ppt "WLCG Service Roadmap 2009 – 2010 ~~~ Distributed DB Operations Workshop, PIC, April 2009."

Similar presentations


Ads by Google