On-demand Services for the Scientific Program at Fermilab Gabriele Garzoglio Grid and Cloud Services Department, Scientific Computing Division, Fermilab.

Slides:



Advertisements
Similar presentations
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Advertisements

Fermilab, the Grid & Cloud Computing Department and the KISTI Collaboration GSDC - KISTI Workshop Jangsan-ri, Nov 4, 2011 Gabriele Garzoglio Grid & Cloud.
SLA-Oriented Resource Provisioning for Cloud Computing
Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Introduction The Open Science Grid (OSG) is a consortium of more than 100 institutions including universities, national laboratories, and computing centers.
Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
OSG Public Storage and iRODS
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
GlobusWorld 2012: Experience with EXPERIENCE WITH GLOBUS ONLINE AT FERMILAB Gabriele Garzoglio Computing Sector Fermi National Accelerator.
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
The Grid & Cloud Computing Department at Fermilab and the KISTI Collaboration Meeting with KISTI Nov 1, 2011 Gabriele Garzoglio Grid & Cloud Computing.
1/12 Scientific Data Processing Solutions – Department Meeting – Dec 2014 Scientific Data Processing Solutions Department Meeting Dec 2014 Dec 04, 2014.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 Automate your way to.
FermiCloud: Enabling Scientific Workflows with Federation and Interoperability Steven C. Timm FermiCloud Project Lead Grid & Cloud Computing Department.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Web Technologies Lecture 13 Introduction to cloud computing.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Opportunistic Computing Only Knocks Once: Processing at SDSC Ian Fisk FNAL On behalf of the CMS Collaboration.
Auxiliary services Web page Secrets repository RSV Nagios Monitoring Ganglia NIS server Syslog Forward FermiCloud: A private cloud to support Fermilab.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
Recent Evolution of the Offline Computing Model of the NOA Experiment Talk #200 Craig Group & Alec Habig CHEP 2015 Okinawa, Japan.
Fabric for Frontier Experiments at Fermilab Gabriele Garzoglio Grid and Cloud Services Department, Scientific Computing Division, Fermilab ISGC – Thu,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Hao Wu, Shangping Ren, Gabriele Garzoglio, Steven Timm, Gerard Bernabeu, Hyun Woo Kim, Keith Chadwick, Seo-Young Noh A Reference Model for Virtual Machine.
CD FY10 Budget and Tactical Plan Review FY10 Tactical Plans for Scientific Computing Facilities / General Physics Computing Facility (GPCF) Stu Fuess 06-Oct-2009.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Enabling On-Demand Scientific Workflows on a Federated Cloud Steven C. Timm, Gabriele Garzoglio, Seo-Young Noh, Haeng-Jin Jang KISTI project evaluation.
GPCF* Update Present status as a series of questions / answers related to decisions made / yet to be made * General Physics Computing Facility (GPCF) is.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
Gabriele Garzoglio, on behalf of the HEPCloud Leadership Team: Stu Fuess, Burt Holzman, Rob Kennedy, Steve Timm, Tony Tiradani IIT Workshop 30 Mar 2016.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Accessing the VI-SEEM infrastructure
Dynamic Extension of the INFN Tier-1 on external resources
AWS Integration in Distributed Computing
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Provisioning 160,000 cores with HEPCloud at SC17
“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.
WLCG Collaboration Workshop;
Building an Elastic Batch System with Private and Public Clouds
Presentation transcript:

On-demand Services for the Scientific Program at Fermilab Gabriele Garzoglio Grid and Cloud Services Department, Scientific Computing Division, Fermilab ISGC – Thu, March 27, 2014

Overview Frontier Physics at Fermilab: Computing Challenges Strategy: Grid and Cloud Computing On-Demand Services for Scalability: Use Cases The Costs of Physics at Amazon 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab2

The Cloud Program Team Fermilab Grid And Cloud Services Department –Gabriele Garzoglio (Dept. Head) –Steven Timm (FermiCloud Project lead) –Gerard Bernabeu Altayo (lead FermiCloud operations) –Hyun Woo Kim (lead FermiCloud development) –Tanya Levshina, Parag Mhashilkar, Kevin Hill, Karen Shepelak (Technical Support) –Keith Chadwick (Scientific Computing Division Technical Architect) KISTI-GSDC team –Haeng-Jin Jang, Seo-Young Noh, Gyeongryoon Kim Funded through Collaborative Agreement with KISTI –Nick Palombo (consultant) –Hao Wu, Tiago Pais (2013 summer students, IIT) –Francesco Ceccarelli (2013 summer student, INFN Italy) Professors at Illinois Institute of Technology (IIT) –Ioan Raicu, Shangping Ren 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab3

Fermilab Experiment Schedule Measurements at all frontiers – Electroweak physics, neutrino oscillations, muon g-2, dark energy, dark matter 8 major experiments in 3 frontiers running simultaneously in 2016 Sharing both beam and computing resources Impressive breadth of experiments at FNAL FY16 4 IF Experiments

Motivation: Computing needs for IF / CF experiments Requests for increasing slot allocations Expect stochastic load, rather than DC Focus on concurrent peak utilization Addressing the problem for IF / CF experiments as an ensemble under FIFE Collaborating with CMS on computing and infrastructure “Fabric for Frontier Experiments at Fermilab” – Thu 3/27 14:40 – Rm2 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab5 Fermilab Scientific Computing Review* * Number of slots have been updated periodically since

Slots: Fermilab, OSG, & Clouds Current full capacity of FermiGrid  ~30k slots (Clusters: General Purpose, CMS, CDF, DZero) Full capacity of OSG  ~85k slots Additional OSG opportunistic slots  15k – 30k Additional per-pay slots at commercial Clouds Porting experiments’ computing to distributed resources is part of the strategy. 30k Slots FermiGrid 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab6

Grid vs. Clouds (…in my world…) GridCommercial CloudCommunity Cloud Statically-allocated federated resources Dynamically-allocated resources Diverse per-site computing environments User-defined computing environment Mature workload management systems using Grid interfaces Allocation and provision being integrated in workload management through diverse interfaces (EC2 emulation, Google, OpenStack, …) Lot of capacity Growing capacity Dedicated slots: “free” at collaborating sites (after site onboarding) Dedicated slots: ~$0.12 CPU h ~$0.09 – $0.12 GB Dedicated slots: “free” at (a few) collaborating sites. ~$0.04 / CPU BNL ~$0.03 / CPU FNAL Opportunistic slots: “free” after onboarding; no guarantee Spot pricing: bid your slot ~$0.01 (better guarantee with higher bid) Model not developed yet 73/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab7

On-demand Services for Intensity and Cosmic Frontiers Goal: each experiment can access O(10k) slots on distributed resources when needed Problem: How many supporting services of type X do we deploy for experiment Y ? Use cases for service X: …batch computation –data transfer, submission nodes, coordinated clusters, DB proxies, WN for Grid-bursting, … …interactive sessions –build nodes, WN-like environment for debugging, DAQ emulation, … …advanced use cases –private batch systems, private-net clusters, Hadoop / MapReduce, generate customized VM, … Solution platform: dynamically instantiated virtualization 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab8

Dynamic provisioning drivers Workloads become more heterogeneous –Very large data sets (DES, Darkside) –Consecutive tasks on same large data set (DES) –Large memory requirements (LBNE) –Large shower libraries (Neutrino) –Order of minutes execution on O(30GB) data. –Whole node scheduling Need flexible ways to reconfigure compute clusters and supporting services for changing workload In FY16: 8 major experiments at Fermilab from 3 frontiers. Usage patterns: hard to plan over “slow” (yearly?) cycles. Need to be more dynamic. 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab9

Cloud Computing Environments FermiCloud Services Production quality Infrastructure-as-a-Service based on OpenNebula. Made technical breakthroughs in authentication, authorization, fabric deployment, accounting and high availability cloud infrastructure Continues to evolve and develop cloud computing techniques for scientific workflows in collaboration with KISTI, S. Korea. Ref: ISGC 2011/12/13, ISGC12 Interop. Workshop, Keith Chadwick Amazon Web Services Startup allocation to integrate commercial clouds Community Clouds Resources available through collaborative agreements (KISTI, OSG, OpenStack Federation, etc.) 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab10

§1 - On-demand storage & data movement services Stochastic load of data movement –Different experiments running different workflows at different times Locally on FermiCloud… –scale experiment-specific data movement servers (GridFTP, xrootd, dCache doors, …) with expected load –configure servers to work as an ensemble, gathering them up under a common entry gateway (SRM / BeStMan, cluster-level xrootd redirector, …) At the Grid / Cloud site… –Helps to have a temporary storage element on remote grid/cloud site as well –Web caching and database proxies on the remote grid/cloud site 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab11

§2 - On-demand coordinated cluster Launch batch system head node + DB proxy + storage –Instantiate dynamically-provisioned WN depending on the characteristics of the workflow and current load –Supports legacy clusters, heterogeneous workflows, overlay networks Launch interactive computing environment with supporting services –CMS T3 “in-a-box” 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab12

§3 - On-demand Grid-bursting Add WN from clouds to local batch systems 1,000,000 ev generated with ~10k jobs for 88,535 CPU h in 2 weeks of ops and 2 TB of data Run on OSG at SMU (dedicated), UNL, Uwisc, UC, UCSD and O(100) jobs at FermiCloud Operations consisted of 1 submission + 2 recoveries Spent about 10% more resources than expected due to preemption We ran 400 jobs of this app on Amazon Web Services How much would it cost to run the full campaign on AWS ? 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab13

Fermi Cloud P. Mhashilkar et al, “Cloud Bursting with GlideinWMS”, CHEP 2013 GlideinWMS – Grid Bursting Job Queue FrontEnd Users GlideinWMS Pilot Factory Pilots Jobs Gcloud KISTI Amazon AWS 3/27/2014 Gabriele Garzoglio – On-demand services for the scientific program at Fermilab 14

3/27/2014 Targeting use of AWS general purpose instances 1 ECU = 1.2 GHz Xeon in 2007 Options: std instances and spot pricing FermiCloud economic model: $0.03/h HEPiX-Fall13 T. Wong: cost at RACF BNL: $0.04 / h …watch out for data (network providers are negotiating with Clouds) AWS prices Instance Family Instance Type Processor ArchvCPUECU Memory (GiB) Instance Storage (GB) EBS- optimized Available Network Performance General purpose m1.small32-bit or x 160-Low 64-bit General purpose m1.medium32-bit or x 410-Moderate 64-bit Gabriele Garzoglio – On-demand services for the scientific program at Fermilab15

AWS / OSG Worker Node Equivalence AWS m1.medium ≈ “Standard” OSG WN –Run 400 NOvA Monte Carlo jobs generating 100 events each on m1.medium (not enough memory on m1.small) SiteFailed Jobs RAM (GB) Wall Avg (min) Wall mn/mx (min) Wall Tot (hr) Trans (GB) Cost UNL via OSG / FermiCloud / AWS (m1.medium) / $42 –HEPiX-Fall13 T. Wong (BNL): “EC2 jobs on m1.small took ~50% longer than on dedicated facility” 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab16

Cost range for NOvA event generation 1,000,000 events, 2TB data, 90,000 CPU h –Data transfer cost = ~ $240 –Std m1.medium Instance: $11k –Spot price: $2k –FermiCloud-equivalent cost estimate: $3k –RACF-equivalent cost estimate: $4k Next Test – MC generation of cosmics in the NOvA Near Detector –Jobs: 1000 jobs for ~1h / job & 250,000 events / job –Data: output per job: ~100MB. Total size100GB –Estimate 1,000 CPU hours in a day with 100 VM –Std m1.medium: $130 ($12 data charges) –Spot price: $24 ($12 data charges) –FermiCloud-equivalent cost estimate: $30 –RACF-equivalent cost estimate: $40 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab17

Recent publications Automatic Cloud Bursting Under FermiCloud –ICPADS CSS workshop, Dec A Reference Model for Virtual Machine Launching Overhead –CCGrid conference, May 2014 Grids, Virtualization, and Clouds at Fermilab –CHEP conference, Oct Exploring Infiniband Hardware Virtualization in OpenNebula towards Efficient High-Performance Computing. –CCGrid Scale workshop, May /13/2014Steven C. Timm | FermiCloud On-Demand Services18

Summary Computing on distributed resources is strategic for the Intensity and Cosmic Frontiers experiments We will support these experiments with the Grid and Cloud paradigms We observe the emergence of strong drivers for on-demand service deployment –more heterogeneous workloads, cluster profile adaptation, … Running jobs for the Intensity Frontier experiments on Cloud is successful at initial small scale 3/27/2014Gabriele Garzoglio – On-demand services for the scientific program at Fermilab19