High Throughput & Resilient Fabric Deployments on FermiCloud Steven C. Timm Grid & Cloud Services Department Fermilab Work supported by the U.S. Department.

Slides:



Advertisements
Similar presentations
Fermilab, the Grid & Cloud Computing Department and the KISTI Collaboration GSDC - KISTI Workshop Jangsan-ri, Nov 4, 2011 Gabriele Garzoglio Grid & Cloud.
Advertisements

Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
Virtualization and Cloud Computing in Support of Science at Fermilab Keith Chadwick Grid & Cloud Computing Department Fermilab Work supported by the U.S.
Investigation of Storage Systems for use in Grid Applications 1/20 Investigation of Storage Systems for use in Grid Applications ISGC 2012 Feb 27, 2012.
Software Architecture
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Investigation of Storage Systems for use in Grid Applications 1/20 Investigation of Storage Systems for use in Grid Applications ISGC 2012 Feb 27, 2012.
Fermilab Site Report Spring 2012 HEPiX Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
BlueArc IOZone Root Benchmark How well do VM clients perform vs. Bare Metal clients? Bare Metal Reads are (~10%) faster than VM Reads. Bare Metal Writes.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Web Technologies Lecture 13 Introduction to cloud computing.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
FermiCloud Project Overview and Progress Keith Chadwick Grid & Cloud Computing Department Head Fermilab Work supported by the U.S. Department of Energy.
Auxiliary services Web page Secrets repository RSV Nagios Monitoring Ganglia NIS server Syslog Forward FermiCloud: A private cloud to support Fermilab.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
Enabling Scientific Workflows on FermiCloud using OpenNebula Steven Timm Grid & Cloud Services Department Fermilab Work supported by the U.S. Department.
The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported.
Authentication, Authorization, and Contextualization in FermiCloud S. Timm, D. Yocum, F. Lowe, K. Chadwick, G. Garzoglio, D. Strain, D. Dykstra, T. Hesselroth.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Enabling Scientific Workflows on FermiCloud using OpenNebula Steven Timm Grid & Cloud Services Department Fermilab Work supported by the U.S. Department.
FermiCloud Project: Integration, Development, Futures Gabriele Garzoglio Associate Head Grid & Cloud Computing Department Fermilab Work supported by the.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Hao Wu, Shangping Ren, Gabriele Garzoglio, Steven Timm, Gerard Bernabeu, Hyun Woo Kim, Keith Chadwick, Seo-Young Noh A Reference Model for Virtual Machine.
FermiCloud Update ISGC 2013 Keith Chadwick Grid & Cloud Computing Department Fermilab Work supported by the U.S. Department of Energy under contract No.
GPCF* Update Present status as a series of questions / answers related to decisions made / yet to be made * General Physics Computing Facility (GPCF) is.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
CLOUD COMPUTING Presented to Graduate Students Mechanical Engineering Dr. John P. Abraham Professor, Computer Engineering UTPA.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
Virtualization for Cloud Computing
Dynamic Extension of the INFN Tier-1 on external resources
Chapter 6: Securing the Cloud
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
The advances in IHEP Cloud facility
BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.
Volunteer Computing for Science Gateways
Introduction to Distributed Platforms
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
StratusLab Final Periodic Review
An Introduction to Cloud Computing
StratusLab Final Periodic Review
Provisioning 160,000 cores with HEPCloud at SC17
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Virtualisation for NA49/NA61
Welcome! Thank you for joining us. We’ll get started in a few minutes.
Grid Computing.
WLCG Collaboration Workshop;
Introduction to Cloud Computing
Dr. John P. Abraham Professor, Computer Engineering UTPA
Versatile HPC: Comet Virtual Clusters for the Long Tail of Science SC17 Denver Colorado Comet Virtualization Team: Trevor Cooper, Dmitry Mishin, Christopher.
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Introduce yourself Presented by
Outline Virtualization Cloud Computing Microsoft Azure Platform
The National Grid Service Mike Mineter NeSC-TOE
Virtualization Dr. S. R. Ahmed.
Presentation transcript:

High Throughput & Resilient Fabric Deployments on FermiCloud Steven C. Timm Grid & Cloud Services Department Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

Outline Introduction: Fermilab and High- Throughput Computing HPC vs. HTC Drivers for FermiCloud Project Virtualized MPI Virtualized File Systems Resilience Summary and Acknowledgements 24-Sep-2013S. Timm, ISC Cloud '131

Fermilab and HTC Fermi National Accelerator Laboratory: Lead United States particle physics laboratory ~60 PB of data on tape High Throughput Computing characterized by: “Pleasingly parallel” tasks High CPU instruction / Bytes IO ratio But still lots of I/O. See Pfister: “In Search of Clusters” Fermilab does HPC too, see for description of Lattice QCD 24-Sep-2013S. Timm, ISC Cloud '132

Grid and Cloud Services Operations: Grid Authorization Grid Accounting Computing Elements Batch Submission All require high availability All require multiple integration systems to test. Also requires virtualization And access as root. Solutions: Development of authorization, accounting, and batch submission software Packaging and integration Requires development machines not used all the time Plus environments that are easily reset And access as root 24-Sep-2013S. Timm, ISC Cloud '133

HTC Virtualization Drivers Large multi-core servers have evolved from from 2 to 64 cores per box, A single “rogue” user/application can impact 63 other users/applications. Virtualization can provide mechanisms to securely isolate users/applications. Typical “bare metal” hardware has significantly more performance than usually needed for a single-purpose server, Virtualization can provide mechanisms to harvest/utilize the remaining cycles. Complicated software stacks are difficult to distribute on grid, Distribution of preconfigured virtual machines together with GlideinWMS and HTCondor can aid in addressing this problem. Large demand for transient development/testing/integration work, Virtual machines are ideal for this work. Science is increasingly turning to complex, multiphase workflows. Virtualization coupled with cloud can provide the ability to flexibly reconfigure hardware “on demand” to meet the changing needs of science. Legacy code: Data and code preservation for recently-completed experiments at Fermilab Tevatron and elsewhere. Burst Capacity: Systems are full all the time, need more cycles just before conferences. 24-Sep-2013S. Timm, ISC Cloud '134

HPC Virtualization Drivers Dependent on high-performance shared file systems (GPFS, Lustre) Heterogeneous hardware (Phi, GPU, etc) Need ways to dynamically configure infrastructure Sandbox small users to keep them from crashing big cluster But what about the performance hit? 24-Sep-2013S. Timm, ISC Cloud '135

FermiCloud – Initial Project Specifications FermiCloud Project was established in 2009 with the goal of developing and establishing Scientific Cloud capabilities for the Fermilab Scientific Program, Building on the very successful FermiGrid program that supports the full Fermilab user community and makes significant contributions as members of the Open Science Grid Consortium. Reuse High Availabilty, AuthZ/AuthN, Virtualization from Grid In a (very) broad brush, the mission of the FermiCloud project is: To deploy a production quality Infrastructure as a Service (IaaS) Cloud Computing capability in support of the Fermilab Scientific Program. To support additional IaaS, PaaS and SaaS Cloud Computing capabilities based on the FermiCloud infrastructure at Fermilab. The FermiCloud project is a program of work that is split over several overlapping phases. Each phase builds on the capabilities delivered as part of the previous phases. 24-Sep-2013S. Timm, ISC Cloud '136

Overlapping Phases 24-Sep-2013S. Timm, ISC Cloud '137 Phase 1: “Build and Deploy the Infrastructure” Phase 1: “Build and Deploy the Infrastructure” Phase 2: “Deploy Management Services, Extend the Infrastructure and Research Capabilities” Phase 2: “Deploy Management Services, Extend the Infrastructure and Research Capabilities” Phase 3: “Establish Production Services and Evolve System Capabilities in Response to User Needs & Requests” Phase 3: “Establish Production Services and Evolve System Capabilities in Response to User Needs & Requests” Phase 4: “Expand the service capabilities to serve more of our user communities” Phase 4: “Expand the service capabilities to serve more of our user communities” Time Today

Current FermiCloud Capabilities The current FermiCloud hardware capabilities include: Public network access via the high performance Fermilab network, -This is a distributed, redundant network. Private 1 Gb/sec network, -This network is bridged across FCC and GCC on private fiber, High performance Infiniband network, -Currently split into two segments, Access to a high performance FibreChannel based SAN, -This SAN spans both buildings. Access to the high performance BlueArc based filesystems, -The BlueArc is located on FCC-2, Access to the Fermilab dCache and enStore services, -These services are split across FCC and GCC, Access to 100 Gbit Ethernet test bed in LCC (Integration nodes), -Intel 10 Gbit Ethernet converged network adapter X540-T1. 24-Sep-2013S. Timm, ISC Cloud '138

Typical Use Cases Public net virtual machine: On Fermilab Network open to Internet, Can access dCache and Bluearc Mass Storage, Common home directory between multiple VM’s. Public/Private Cluster: One gateway VM on public/private net, Cluster of many VM’s on private net. Data acquisition simulation Storage VM: VM with large non-persistent storage, Use for large MySQL or Postgres databases, Lustre/Hadoop/Bestman/xRootd/dCache/OrangeFS/IRODS servers. 24-Sep-2013S. Timm, ISC Cloud '139

OpenNebula OpenNebula was picked as result of evaluation of Open source cloud management software. OpenNebula 2.0 pilot system in GCC available to users since November Began with 5 nodes, gradually expanded to 13 nodes Virtual Machines run on pilot system in 3+ years. OpenNebula 3.2 production-quality system installed in FCC in June 2012 in advance of GCC total power outage—now comprises 18 nodes. Transition of virtual machines and users from ONe 2.0 pilot system to production system almost complete. In the meantime OpenNebula has done five more releases, will catch up shortly. 24-Sep-2013S. Timm, ISC Cloud '1310

FermiCloud – Infiniband & MPI FermiCloud purchased Mellanox SysConnect II Infiniband adapters in May These cards were advertised to support SR-IOV virtualization. We received the SR-IOV drivers from Mellanox as a result of a meeting at SC2011. These were a private patch to the normal open-source OFED drivers. After a couple months of kernel building, MPI recompilation, firmware burning, and finding the right double-pre-alpha libvirt, etc, we were ready to measure. Using this driver, we were able to make measurements of High Performance Linpack comparing MPI on “bare metal” to MPI on KVM virtual machines on the identical hardware. The next three slides will detail the configurations and measurements that were completed in March MPI is compiled using IP over Infiniband (IPOIB) This allows a direct measurement of the MPI “virtualization overhead”. 24-Sep-2013S. Timm, ISC Cloud '1311

Infiniband Switch Infiniband Card Process pinned to CPU Infiniband Card Process pinned to CPU FermiCloud “Bare Metal MPI” 24-Sep-2013S. Timm, ISC Cloud '1312

FermiCloud “Virtual MPI” 24-Sep-2013S. Timm, ISC Cloud '1313 Infiniband Switch Infiniband Card VM pinned to CPU Infiniband Card VM pinned to CPU

MPI on FermiCloud (Note 1) Configuration #Host Systems #VM/host#CPU Total Physical CPU HPL Benchmark (Gflops) Gflops/Core Bare Metal without pinning Bare Metal with pinning (Note 2) VM without pinning (Notes 2,3) 281 vCPU VM with pinning (Notes 2,3) 281 vCPU VM+SRIOV with pinning (Notes 2,4) 272 vCPU Notes:(1) Work performed by Dr. Hyunwoo Kim of KISTI in collaboration with Dr. Steven Timm of Fermilab. (2) Process/Virtual Machine “pinned” to CPU and associated NUMA memory via use of numactl. (3) Software Bridged Virtual Network using IP over IB (seen by Virtual Machine as a virtual Ethernet). (4) SRIOV driver presents native InfiniBand to virtual machine(s), 2 nd virtual CPU is required to start SRIOV, but is only a virtual CPU, not an actual physical CPU. 24-Sep S. Timm, ISC Cloud '13

Infiniband 2 years later Now KVM SR-IOV drivers part of standard OFED distribution. No special kernel or driver build needed OpenNebula can launch your VM with an Infiniband interface, with no libvirt hackery needed. Still: be careful burning new firmware on IB Card. 24-Sep-2013S. Timm, ISC Cloud '1315

Virtualized Storage Service Investigation Motivation: General purpose systems from various vendors being used as file servers, Systems can have many more cores than needed to perform the file service, Cores go unused => Inefficient power, space and cooling usage, Custom configurations => Complicates sparing issues. Question: Can virtualization help here? What (if any) is the virtualization penalty? 24-Sep-2013S. Timm, ISC Cloud '1316

FermiCloud Test Bed - Virtualized Server 2 TB 6 Disks eth FCL : 3 Data & 1 Name node mount BlueArc mount Dom0: - 8 CPU - 24 GB RAM On Board Client VM 7 x Opt. Storage Server VM 8 KVM VM per machine; 1 cores / 2 GB RAM each. ( ) 24-Sep-2013S. Timm, ISC Cloud '1317 ITB / FCL Ext. Clients (7 nodes - 21 VM) CPU: dual, quad core Xeon 2.66GHz with 4 MB cache; 16 GB RAM. 3 Xen VM SL5 per machine; 2 cores / 2 GB RAM each.

Virtualized File Service Results Summary See ISGC talk for the details - FileSystemBenchmark Read (MB/s) “Bare Metal” Write (MB/s) VM Write (MB/s) Notes Lustre IOZone Significant write penalty when FS on VM Root-based Hadoop IOZone Varies on number of replicas, fuse does not export a full posix fs. Root-based7.9-- OrangeFS IOZone Varies on number of name nodes Root-based8.1-- BlueArc IOZone300330n/a Varies on system conditions Root-based Sep-2013S. Timm, ISC Cloud '1318

FermiGrid-HA2 Experience In 2009, based on operational experience and plans for redevelopment of the FCC-1 computer room, the FermiGrid-HA2 project was established to split the set of FermiGrid services across computer rooms in two separate buildings (FCC-2 and GCC-B). This project was completed on 7-Jun-2011 (and tested by a building failure less than two hours later). FermiGrid-HA2 worked exactly as designed. Our operational experience with FermiGrid-HA and FermiGrid- HA2 has shown the benefits of virtualization and service redundancy. Benefits to the user community – increased service reliability and uptime. Benefits to the service maintainers – flexible scheduling of maintenance and upgrade activities. 24-Sep-2013S. Timm, ISC Cloud '1319

Experience with FermiGrid = Drivers for FermiCloud 24-Sep-2013S. Timm, ISC Cloud '1320 Access to pools of resources using common interfaces: Monitoring, quotas, allocations, accounting, etc. Opportunistic access: Users can use the common interfaces to “burst” to additional resources to meet their needs Efficient operations: Deploy common services centrally High availability services: Flexible and resilient operations

Some Recent Major Facility and Network Outages at Fermilab FCC main breaker 4x (Feb, Oct 2010) FCC-1 network cuts 2x (Spring 2011) GCC-B Load shed events (June-Aug 2011) This accelerated planned move of nodes to FCC-3. GCC load shed events and maintenance (July 2012). FCC-3 cloud was ready just in time to keep server VM’s up. FCC-2 outage (Oct. 2012) FermiCloud wasn’t affected, our VM’s stayed up. 24-Sep-2013S. Timm, ISC Cloud '1321

Service Outages on Commercial Clouds Amazon has had several significant service outages over the past few years: Outage in April 2011 of storage services resulted in actual data loss, An electrical storm that swept across the East Coast late Friday 29-Jun knocked out power at a Virginia data center run by Amazon Web Services. An outage of one of Amazon's cloud computing data centers knocked out popular sites like Reddit, Foursquare Pinterest and TMZ on Monday 22- Oct-2012, Amazon outage affecting Netflix operations over Chirstmas 2012 and New Years 2013, Latest outage on Thursday 31-Jan Microsoft Azure: Leap day bug on 29-Feb Google: Outage on 26-Oct Sep-2013S. Timm, ISC Cloud '1322

FermiCloud – Fault Tolerance As we have learned from FermiGrid, having a distributed fault tolerant infrastructure is highly desirable for production operations. We are actively working on deploying the FermiCloud hardware resources in a fault tolerant infrastructure: The physical systems are split across two buildings, There is a fault tolerant network infrastructure in place that interconnects the two buildings, We have deployed SAN hardware in both buildings, We have a dual head-node configuration with failover We have a GFS2 + CLVM for our multi-user filesystem and distributed SAN. SAN replicated between buildings using CLVM mirroring. GOAL: If a building is “lost”, then automatically relaunch “24x7” VMs on surviving infrastructure, then relaunch “9x5” VMs if there is sufficient remaining capacity, Perform notification (via Service-Now) when exceptions are detected. 24-Sep-2013S. Timm, ISC Cloud '1323

FCC and GCC 24-Sep-2013S. Timm, ISC Cloud '1324 FC C GC C The FCC and GCC buildings are separated by approximately 1 mile (1.6 km). FCC has UPS and Generator. GCC has UPS.

Distributed Network Core Provides Redundant Connectivity 24-Sep-2013S. Timm, ISC Cloud '1325 GCC-A Nexus 7010 Nexus 7010 Robotic Tape Libraries (4) Robotic Tape Libraries (4) Robotic Tape Libraries (3) Robotic Tape Libraries (3) Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Disk Servers 20 Gigabit/s L3 Routed Network 80 Gigabit/s L2 Switched Network 40 Gigabit/s L2 Switched Networks Note – Intermediate level switches and top of rack switches are not shown in the this diagram. Private Networks over dedicated fiber Grid Worker Nodes Grid Worker Nodes Nexus 7010 Nexus 7010 FCC-2 Nexus 7010 Nexus 7010 FCC-3 Nexus 7010 Nexus 7010 GCC-B Grid Worker Nodes Grid Worker Nodes Deployment completed in June 2012

Distributed Shared File System Design: Dual-port FibreChannel HBA in each node, Two Brocade SAN switches per rack, Brocades linked rack-to-rack with dark fiber, 60TB Nexsan Satabeast in FCC-3 and GCC-B, Redhat Clustering + CLVM + GFS2 used for file system, Each VM image is a file in the GFS2 file system LVM mirroring RAID 1 across buildings. Benefits: Fast Launch—almost immediate as compared to 3-4 minutes with ssh/scp, Live Migration—Can move virtual machines from one host to another for scheduled maintenance, transparent to users, Persistent data volumes—can move quickly with machines, Can relaunch virtual machines in surviving building in case of building failure/outage, 24-Sep-2013S. Timm, ISC Cloud '1326

FermiCloud – Network & SAN “Today” Private Ethernet over dedicated fiber Fibre Channel over dedicated fiber 24-Sep S. Timm, ISC Cloud '13 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 FCC-2GCC-A FCC-3GCC-B Nexus 7010 Nexus 7010 fcl315 To fcl323 fcl315 To fcl323 FCC-3 Brocade Satabeast Brocade fcl001 To fcl013 fcl001 To fcl013 GCC-B Brocade Satabeast Brocade

FermiCloud-HA Head Node Configuration 24-Sep-2013S. Timm, ISC Cloud '1328 fcl001 (GCC-B) fcl301 (FCC-3) ONED/SCHED fcl-ganglia2 fermicloudnis2 fermicloudrepo2 fclweb2 fcl-cobbler fermicloudlog fermicloudadmin fcl-lvs2 fcl-mysql2 ONED/SCHED fcl-ganglia1 fermicloudnis1 fermicloudrepo1 fclweb1 fermicloudrsv fcl-lvs1 fcl-mysql1 2 way rsync Live Migration Multi-master MySQL CLVM/rgmanager

Security Main areas of cloud security development: Secure Contextualization: Secrets such as X.509 service certificates and Kerberos keytabs are not stored in virtual machines (See following talk for more details). X.509 Authentication/Authorization: X.509 Authentication written by T. Hesselroth, code submitted to and accepted by OpenNebula, publicly available since Jan Security Policy: A security taskforce met and delivered a report to the Fermilab Computer Security Board, recommending the creation of a new Cloud Computing Environment, now in progress. We also participated in the HEPiX Virtualisation Task Force, We respectfully disagree with the recommendations regarding VM endorsement. 24-Sep-2013S. Timm, ISC Cloud '1329

Current FermiCloud Focus Virtual Provisioning: “Grid Bursting” -Extending our grid cluster into FermiCloud and Amazon EC2 via KISTI vcluster utility. “Cloud Bursting” -Using native OpenNebula burst feature. -Using Condor and GlideinWMS. Idle VM Detection/suspension 24-Sep-2013S. Timm, ISC Cloud '1330

Current FermiCloud Focus 2 Interoperability: User guides on how to convert VM images to and from FermiCloud/OpenNebula VirtualBox, VMWare, OpenStack, Amazon EC2, Google Cloud. API studies for EC2 compatibility Scientific Workflows: CMS using High Level Trigger CERN as Cloud. GlideinWMS/Condor/OpenStack for reconstruction. Replicate success now with NOvA neutrino simulation workflow On Amazon EC2, OpenNebula, OpenStack On-demand Services: Scalable batch submission and data movement services. 24-Sep-2013S. Timm, ISC Cloud '1331

Reframing Cloud Discussion Purpose of Infrastructure-as-a-service: On demand only? No—a whole new way to think about IT infrastructure both internal and external. Cloud API is just a part of rethinking IT infrastructure for data- intensive science (and MIS). Only as good as the hardware and software it’s built on. Network fabric, storage, and applications all crucial. Buy or build? Both! Will always need some in-house capacity. Performance hit? Most can be traced to badly written applications or misconfigured OS. 24-Sep-2013S. Timm, ISC Cloud '1332

FermiCloud Project Summary - 1 Science is directly and indirectly benefiting from FermiCloud: CDF, D0, Intensity Frontier, Cosmic Frontier, CMS, ATLAS, Open Science Grid,… FermiCloud operates at the forefront of delivering cloud computing capabilities to support scientific research: By starting small, developing a list of requirements, building on existing Grid knowledge and infrastructure to address those requirements, FermiCloud has managed to deliver a production class Infrastructure as a Service cloud computing capability that supports science at Fermilab. FermiCloud has provided FermiGrid with an infrastructure that has allowed us to test Grid middleware at production scale prior to deployment. The Open Science Grid software team used FermiCloud resources to support their RPM “refactoring” and is currently using it to support their ongoing middleware development/integration. 24-Sep S. Timm, ISC Cloud '13

FermiCloud Project Summary - 2 The FermiCloud collaboration with KISTI has leveraged the resources and expertise of both institutions to achieve significant benefits. vCluster has demonstrated proof of principle “Grid Bursting” using FermiCloud and Amazon EC2 resources. Using SRIOV drivers on FermiCloud virtual machines, MPI performance has been demonstrated to be >96% of the native “bare metal” performance. The future is mostly cloudy. 24-Sep S. Timm, ISC Cloud '13

Acknowledgements None of this work could have been accomplished without: The excellent support from other departments of the Fermilab Computing Sector – including Computing Facilities, Site Networking, and Logistics. The excellent collaboration with the open source communities – especially Scientific Linux and OpenNebula, As well as the excellent collaboration and contributions from KISTI Talented summer students from Illinois Institute of Technology. 24-Sep-2013S. Timm, ISC Cloud '1335