Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtualisation: status and plans Dag Toppe Larsen 13.02.2013.

Similar presentations


Presentation on theme: "Virtualisation: status and plans Dag Toppe Larsen 13.02.2013."— Presentation transcript:

1 Virtualisation: status and plans Dag Toppe Larsen 13.02.2013

2 13.02.2013NA61/NA49 meeting, CERN2 Task outline Set up cluster (private and/or LXCLOUD) Make experiment software available Create automation scripts Validate outputs Overall integration Production reconstruction

3 13.02.2013NA61/NA49 meeting, CERN3 LXCLOUD infrastructure IT has switched from OpenNebula to OpenStack as cloud infrastructure IT is planing to internally run “everything” on virtual machines “Agile” strategy Required for 2 nd data centre in Budapest Currently we are migrating to 3 rd version/ iteration of LXCLOUD test facility Past two based on OpenNebula, current on OpenStack (close to final service) Initially some contextualisation issues, but now OK Will again be decommissioned in March... When full test production has been completed, will request more cloud resources from IT On LXCLOUD, users will be charged by running time of virtual machines Does not matter if they process or not -> have to dynamically create/kill virtual machines automatically as processing requirements changes Example: start cluster of 100 nodes for processing of data for ongoing production  Kill/start nodes as needed

4 13.02.2013NA61/NA49 meeting, CERN4 CernVM On-line Simplifies creation of single virtual cluster distributed over multiple physical clouds Example: one virtual cluster with nodes running on both LXCLOUD, Belgrade cluster and Amazon cloud Completely transparent New development/service from CernVM-team Switching to this infrastructure

5 13.02.2013NA61/NA49 meeting, CERN5 Make experiment software available NA61/NA49 software must be available on CVMFS for CernVM to process data NA61 Legacy software chain installed SHINE software installed  ROOT and other dependencies provided via CVMFS  SVN checkout compiles “out of the box”  Using 32-bit CernVM image NA49 Software has been installed

6 13.02.2013NA61/NA49 meeting, CERN6 Obtaining required information prior to production Data processing web interface a front-end to the data processing scripts Reaction list from bookkeeping system Reaction run list from bookkeeping system Software list from CVMFS directory tree Global key list global key data base User selects data and parameters, and click “process”

7 13.02.2013NA61/NA49 meeting, CERN7 Data production sequence

8 13.02.2013NA61/NA49 meeting, CERN8 Main steps/scripts of data processing Initiate processing Access bookkeeping Processing script Actual processing Detect failed jobs Resubmit failed jobs Close production Merge mini shoe, QA, logs Updated bookkeeping

9 13.02.2013NA61/NA49 meeting, CERN9 Processing script Input to script: Reaction name Software version Global key (CernVM version) Different batch systems CernVM/LXBATCH CernVM: Condor LXPLUS: PBS Try to unify as much as possible Create/modify set-up file depending on global key, magnetic field, beam momentum, reaction Set up environment variables depending on version of legacy software/SHINE Actual processing Get raw data from castor Set environment Process Upload reconstructed data to Castor

10 13.02.2013NA61/NA49 meeting, CERN10 Detecting failed jobs Failed jobs identified from: Non-existing/empty/small output DSPACK, SHOE, ROOT files Failed/exited/terminated chunks/events (from log file) Mostly working OK, but a small number false positives (short runs with only 1 or 2 “empty” events) Script being updated to store job status in SqLite database rather than text files Acrontab entry to regularly update database as jobs are being processed Failed jobs resubmitted (up to 3 times?) Merge small files (mini shoe, QA, logs) When all jobs done, update bookkeeping database

11 13.02.2013NA61/NA49 meeting, CERN11 Output validation – status run-015252x000 has been processed on both CernVM/CVMFS and LXPLUS/AFS Software v12j, global key 11_015 /afs/cern.ch/user/n/na61cld/public/compare/run- 015252x000.ds.* Initially hard to compare log files since CernVM/ LXBATCH used different scripts for processing Now using “unified” production script All initial log file differences investigated/accounted for Currently only differences are related to paths/ processing node Using ds_diff, numerical differences are generally at the order 1/1000 – 1/100000 (some exceptions) File: run-015252x000.ds.diff Asked Grzegorz to compare the outputs using his script Should give more clear “high-level” picture

12 13.02.2013NA61/NA49 meeting, CERN12 Output validation – next steps Compare using Grzegorz' script Evaluate whether results are consistent Redo processing/comparison for run- 008688x000 pp31, should give “cleaner” results Process whole reaction using CernVM Fresh BeBe30 might be good candidate since it is already processed using new software v12j/global key 13_001 on LXBATCH Compare physics results from both productions?

13 13.02.2013NA61/NA49 meeting, CERN13 Preliminary resource estimate On current LXCLOUD test facility one chunk (~400 events) requires ~45 minutes for processing Includes Castor access One reaction has ~4 million events -> ~10000 chunks -> ~7500 CPU (core) hours What is acceptable processing time? 200 cores -> ~37.5 hours = ~1.5 days More precise estimate will be obtained from the pending production of complete reaction

14 13.02.2013NA61/NA49 meeting, CERN14 Status Currently, most pieces of the puzzle more or less ready CvmFS software installation Software validation Scripts for creating/killing virtual machines/clusters New production scripts for CernVM Scripts for detecting failed jobs Dimitrije's production manager Interface to Alexander's bookkeeping database Main outstanding issue: finish putting all the pieces fully together Many of the challenges encountered related to the legacy software Will be easier when switch to Shine for complete reconstruction

15 13.02.2013NA61/NA49 meeting, CERN15 Next steps Parallel task 1: testing/validation Compare using Grzegorz' script Rerun validation for run 8688 Process full reaction Estimate/request resources Parallel task 2: putting the pieces together Integrate automatic virtual cluster creation Integrate automatic detection/resubmission of failed jobs Integrate production manager Integrate bookkeeping database Transfer to NA61

16 13.02.2013NA61/NA49 meeting, CERN16 Roadmap TaskStatus/doneRemainingExpected NA61 software installation OK NA49 software installation OKData validationOK Scripts for production, resubmit failed jobs, create/ kill clusters, etc. Standalone scripts created Scripts need to be modified to be integrated into overall system Next few months Validate outputsMostly doneGreg's script, full reactionFebruary/ March Production mngr., bookkeeping Interfaces createdIntegrate into overall system Next few months Integrate components Components mostly ready IntegrationNext few months Production of full reaction Scripts readySet up sufficiently large cluster on new LXCLOUD February/ March Estimate resources Requires production of full reaction February/ March


Download ppt "Virtualisation: status and plans Dag Toppe Larsen 13.02.2013."

Similar presentations


Ads by Google