Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

Slides:



Advertisements
Similar presentations
SLA-Oriented Resource Provisioning for Cloud Computing
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Introduction The Open Science Grid (OSG) is a consortium of more than 100 institutions including universities, national laboratories, and computing centers.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Feb 23, 20111/22 LSST Simulations on OSG Feb 23, 2011 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview The Open.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
1 port BOSS on Wenjing Wu (IHEP-CC)
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
ALICE-USA Grid-Deployment Plans (By the way, ALICE is an LHC Experiment, TOO!) Or (We Sometimes Feel Like and “AliEn” in our own Home…) Larry Pinsky—Computing.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 Tucson Fermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO DES Data Management Ray Plante.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
Mar 27, gLExec Accounting Solutions in OSG Gabriele Garzoglio gLExec Accounting Solutions in OSG Mar 27, 2008 Middleware Security Group Meeting Igor.
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Improving Software with the UW Metronome Becky Gietzel Todd L Miller.
Korea Workshop May GAE CMS Analysis (Example) Michael Thomas (on behalf of the GAE group)
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Next Generation of Apache Hadoop MapReduce Owen
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
OSG User Support OSG User Support March 13, 2013 Chander Sehgal
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
HPC In The Cloud Case Study: Proteomics Workflow
The EDG Testbed Deployment Details
U.S. ATLAS Grid Production Experience
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
Bulk production of Monte Carlo
Provisioning 160,000 cores with HEPCloud at SC17
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Readiness of ATLAS Computing - A personal view
Job workflow Pre production operations:
Simulation use cases for T2 in ALICE
Grid Canada Testbed using HEP applications
Haiyan Meng and Douglas Thain
Public vs Private Cloud Usage Costs:
Laura Bright David Maier Portland State University
Presentation transcript:

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement of LSST LSST Simulation Workflow Requirements System Architecture Resource Utilization Operational Experience Results Validation

Sep 21, 20102/14 LSST Simulations on OSG Introduction LSST at Purdue (Ian Shipsey) and OSG are collaborating to explore the use of the OSG to run LSST computations We have integrated the LSST “current” version of the image simulation with OSG In Sep, Bo Xin (LSST Purdue) has… – produced 400 / 529 image pairs (some jobs may still require recovery) –validated the OSG production against one official reference (PT1) image

Sep 21, 20103/14 LSST Simulations on OSG OSG “engagement” of LSST Goal of OSG is to empower LSST to use OSG resources independently OSG is currently “engaging” LSST i.e. it provides experts for a limited time to help with –commissioning a submission system (software & resources) to run LSST workflows on OSG resources –supporting the integration of LSST applications with OSG –supporting the initial LSST operations

Sep 21, 20104/14 LSST Simulations on OSG History Feb 2010 – Proof of principle of 1 LSST image simulated on OSG Jun 2010 – OSG EB forms an OSG task force for LSST –Goal of the project is to simulate one night of image taking (500 image pairs) Jul 2010 – Commissioning of the LSST submission system on OSG –Using an old application release (svn-11853), 1 person produced 183 times the same image in 1 day Aug 2010 – Integrated “current” LSST release (svn-16264) with the system Sep 2010 – LSST operator (Bo Xin) ran operations to produce 529 pairs.

Sep 21, 20105/14 LSST Simulations on OSG Workflow Requirements LSST simulation of 1 image: 189 trivially parallel jobs for the 189 chips Input to the workflow: –SED catalog and focal plane conf. files: 15 GB uncompr., pre-installed at all sites –Instance Catalog (SED files + wind speed, etc.): 500 MB compr. per image pair Workflow: –Trim catalog file into 189 chip-specific files –Submit 2 x 189 jobs: 1 image pair (same image w/ 2 exposures) Output: 2 x 189 FITS files, 25 MB each compr.

Sep 21, 20106/14 LSST Simulations on OSG Production by Numbers Goal: simulate 1 night of LSST data collection: 500 pairs 200k simulation jobs (1 chip at a time) trim jobs Assume 4 hours / job for trim and simulation (over-est.)  800,000 CPU hours Assume 2000 jobs DC  ~50,000 CPU hours / day 17 days to complete (w/o counting failures) 12,000 jobs / day i.e. 31 image pairs / day 50 GB / day of input files moved (different for every job) 300 GB / day of output Total number of files = 400,000 (50% input - 50% output) Total output compressed = 5.0 TB (25 MB per job)

Sep 21, 20107/14 LSST Simulations on OSG Condor Collector OSG Glidein Factory Glidein Grid Site CE Glidein VO Frontend WN Info Job & Data Local Disk Condor Scheduler DAG Submitter Create User Interface Submit Architecture Hadoop Submit Monitor Binaries and SED files pre-installed via OSG MM

Sep 21, 20108/14 LSST Simulations on OSG Monitoring Operations … Periodically look at the output dir If job exits due to system failure  automatically resubmit indefinitely If job exits due to app. failure  resubmit 5 times

Sep 21, 20109/14 LSST Simulations on OSG Resource Utilization By Sep 3, produced 150 pairs in 5 days using 13 sites. Now 400 / 529 pairs are produced (some chips job may require recovery) Frontend Status: Jobs & GlideinsGratia Resource Utilization plots 150 pairs produced

Sep 21, /14 LSST Simulations on OSG Typical Workflow Statistics

Sep 21, /14 LSST Simulations on OSG Operational Challenges LSST binaries were not Grid-ready –Application assumed writable software distribution –Application assumed path-lengths too short for the Grid –Orchestration script did not exit with failure upon error (required manual recovery until fixed) Typical failures at sites: –Job required more memory than the batch system allotted –Storage unavailable due to maintenance at some of the most productive sites Limited disk quota on the submission machine After the production of 150 pairs, the operator was mostly traveling and had limited time to dedicate to the operations

Sep 21, /14 LSST Simulations on OSG Validation: reproducibility on OSG An image is validated by subtracting it pixel by pixel from a reference. We produced 2 different images twice on OSG, each on a different mix of resources. In both cases, the pixel-subtraction is consistently 0.

Sep 21, /14 LSST Simulations on OSG Validation: comparison with PT1 We compared 1 image pair (378 chips) with an “official” reference (PT1). For 374 chips the pixel- subtraction is consistently 0. For 4 chips it is negligible*. *More investigations are under way. Study done by Bo Xin.

Sep 21, /14 LSST Simulations on OSG Conclusions This project has demonstrated that OSG is a valuable platform to simulate LSST images 1 LSST person, Bo Xin, with expert support has produced 150 image pairs in 5 days, using in average 50,000 CPU h / day. Bo is currently finishing the production of 529 pairs (400 done with possible needs for recovery) We see image simulation only as the first application integrated with OSG. We are interested in the integration of data-intensive applications as well. We are interested in discussing the creation of LSST as an independent Virtual Organization in OSG Thank you to the OSG Task Force –Especially to Brian Bockelman, Parag Mhashilkar, and Derek Weitzel