The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported.

The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

FermiCloud Facility Primary goal of the project: To deploy a production quality Infrastructure as a Service (IaaS) Cloud Computing capability in support of the Fermilab Scientific Program. This section details the work we have done from proof-of-principle working code in 2010 to a production class facility. 05-Feb-2013FermiCloud Review - Facility1

Outline Hardware and Operating System Stack OpenNebula Cloud System Distributed High Availability Infrastructure Network Topology SAN Topology Head Nodes Security Web Interface Auxiliary Services Monitoring, Accounting,… Proposed Service Level Agreement Proposed Economic Model Production Readiness Conclusions 05-Feb-2013FermiCloud Review - Facility2

FermiCloud – Hardware Specifications Currently 23 systems split across FCC-3 and GCC-B: 2 x 2.67 GHz Intel “Westmere” 4 core CPU Total 8 physical cores, potentially 16 cores with Hyper Threading (HT), 48 GBytes of memory (we started with 24 & upgraded to 48), 2 x 1GBit Ethernet interface (1 public, 1 private), 8 port Raid Controller, 2 x 300 GBytes of high speed local disk (15K RPM SAS), 6 x 2 TBytes = 12 TB raw of RAID SATA disk = ~10 TB formatted, InfiniBand SysConnect II DDR HBA, Brocade FibreChannel HBA (added in Fall 2011/Spring 2012), 2U SuperMicro chassis with redundant power supplies Since Aug 2011, 10 machines in FCC-3 and 13 in GCC-B. 05-Feb-20133FermiCloud Review - Facility

OS/Hypervisor FCC-3 nodes: SLF6/ KVM GCC-B nodes: SLF5/KVM (1 node reserved for Xen). Transitioning node by node to SLF6. Using Red Hat Clustering + Clustered LVM and GFS2 to manage shared file system. 05-Feb-2013FermiCloud Review - Facility4

FermiCloud Typical VM Specifications Unit: 1 Virtual CPU [2.67 GHz “core” with Hyper Threading (HT)], 2 GBytes of memory, 10-20 GBytes of SAN based “VM Image” storage, Additional ~20-50 GBytes of “transient” local storage. Additional CPU “cores”, memory and storage are available for “purchase”: Based on the (Draft) FermiCloud Economic Model, Raw VM costs are competitive with Amazon EC2, FermiCloud VMs can be custom configured per “client”, Access to Fermilab science datasets is much better than Amazon EC2. 05-Feb-20135FermiCloud Review - Facility

FermiCloud – VM Format Virtual machine images are stored in a way that they can be exported as a device: The OS partition contains / partition file system plus a boot sector and a partition table. Not compressed. Kernel and initrd are stored internally to the image, Note that it is possible to have Xen and KVM kernels loaded in the same VM image and run it under either hypervisor. Secrets are not stored in the image Stock SLF5 and SLF6 images provided to users. Image transfer by several mechanisms Copy via SCP to VM host and run on local disk Run from shared file system such as GFS2/CLVM on SAN Makes launch faster, allows live migration. Distribute to all nodes via BitTorrent, launch via LVM Quick Copy on Write. 05-Feb-2013FermiCloud Review - Facility6

Typical Use Cases Public net virtual machine: On Fermilab Network open to Internet, Can access dCache and Bluearc Mass Storage, Shared Home directories between multiple VM’s. Public/Private Cluster: One gateway VM on public/private net, Cluster of many VM’s on private net. Storage VM: VM with large non-persistent storage, Use for large MySQL or Postgres databases, Lustre/Hadoop/Bestman/xRootd servers. 05-Feb-2013FermiCloud Review - Facility7

OpenNebula OpenNebula 2.0 pilot system in GCC available to users since November 2010. Began with 5 nodes, gradually expanded to 13 nodes. 4000 Virtual Machines run on pilot system in 2+ years. OpenNebula 3.2 production-quality system installed in FCC in June 2012 in advance of GCC total power outage—now comprises 10 nodes. Gradually transitioning virtual machines and users from ONe 2.0 pilot system to production system. 05-Feb-2013FermiCloud Review - Facility8

Distributed Resilient Infrastructure Why: Users benefit from increased uptime, Operators benefit from resiliency and gain flexibility to schedule routine maintenance. How: Distribute hardware in multiple buildings and computer rooms, Distributed network infrastructure, Distributed shared file system on SAN, Distributed and redundant head nodes. 05-Feb-2013FermiCloud Review - Facility9

Some Recent Major Facility and Network Outages FCC main breaker 4x (Feb, Oct 2010) FCC-1 network cuts 2x (Spring 2011) GCC-B Load shed events (June-Aug 2011) This accelerated planned move of nodes to FCC-3. GCC load shed events and maintenance (July 2012). FCC-3 cloud was ready just in time to keep server VM’s up. FCC-2 outage (Oct. 2012) FermiCloud wasn’t affected, our VM’s stayed up. 05-Feb-2013FermiCloud Review - Facility10

Service Outages on Commercial Clouds Amazon has had several significant service outages over the past few years: Outage in April 2011 of storage services resulted in actual data loss, An electrical storm that swept across the East Coast late Friday 29-Jun-20- 12 knocked out power at a Virginia data center run by Amazon Web Services. An outage of one of Amazon's cloud computing data centers knocked out popular sites like Reddit, Foursquare Pinterest and TMZ on Monday 22- Oct-2012, Amazon outage affecting Netflix operations over Chirstmas 2012 and New Years 2013, Latest outage on Thursday 31-Jan-2013. Microsoft Azure: Leap day bug on 29-Feb-2012. Google: Outage on 26-Oct-2012. 05-Feb-2013FermiCloud Review - Facility11

FermiCloud – Fault Tolerance As we have learned from FermiGrid, having a distributed fault tolerant infrastructure is highly desirable for production operations. We are actively working on deploying the FermiCloud hardware resources in a fault tolerant infrastructure: The physical systems are split across two buildings, There is a fault tolerant network infrastructure in place that interconnects the two buildings, We have deployed SAN hardware in both buildings, We have a dual head-node configuration with HB for failover We have a GFS2 + CLVM for our multi-user filesystem and distributed & replicated SAN. GOAL: If a building is “lost”, then automatically relaunch “24x7” VMs on surviving infrastructure, then relaunch “9x5” VMs if there is sufficient remaining capacity, Perform notification (via Service-Now) when exceptions are detected. 05-Feb-2013FermiCloud Review - Facility12

FCC and GCC 05-Feb-2013FermiCloud Review - Facility13 FC C GC C The FCC and GCC buildings are separated by approximately 1 mile (1.6 km). FCC has UPS and Generator. GCC has UPS.

Distributed Network Core Provides Redundant Connectivity 05-Feb-2013FermiCloud Review - Facility14 GCC-A Nexus 7010 Nexus 7010 Robotic Tape Libraries (4) Robotic Tape Libraries (4) Robotic Tape Libraries (3) Robotic Tape Libraries (3) Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Disk Servers 20 Gigabit/s L3 Routed Network 80 Gigabit/s L2 Switched Network 40 Gigabit/s L2 Switched Networks Note – Intermediate level switches and top of rack switches are not shown in the this diagram. Private Networks over dedicated fiber Grid Worker Nodes Grid Worker Nodes Nexus 7010 Nexus 7010 FCC-2 Nexus 7010 Nexus 7010 FCC-3 Nexus 7010 Nexus 7010 GCC-B Grid Worker Nodes Grid Worker Nodes Deployment completed in June 2012

Distributed Shared File System Design Dual-port FibreChannel HBA in each node, Two Brocade SAN switches per rack, Brocades linked rack-to-rack with dark fiber, 60TB Nexsan Satabeast in FCC-3 and GCC-B, Redhat Clustering + CLVM + GFS2 used for file system, Each VM image is a file in the GFS2 file system, Qdisk for cluster quorum, Next step—use LVM mirroring to do RAID 1 across buildings. (Documented feature in manual). 05-Feb-2013FermiCloud Review - Facility15

FermiCloud – Network & SAN “Today” Private Ethernet over dedicated fiber Fibre Channel over dedicated fiber 05-Feb-201316FermiCloud Review - Facility Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 FCC-2GCC-A FCC-3GCC-B Nexus 7010 Nexus 7010 fcl315 To fcl323 fcl315 To fcl323 FCC-3 Brocade Satabeast Brocade fcl001 To fcl013 fcl001 To fcl013 GCC-B Brocade Satabeast Brocade

FermiCloud – Network & SAN (Possible Future – FY2013/2014) 05-Feb-2013FermiCloud Review - Facility17 Fibre Channel FCC-2 Nexus 7010 Nexus 7010 Nexus 7010 Nexus 7010 GCC-A Nexus 7010 Nexus 7010 FCC-3 Nexus 7010 Nexus 7010 GCC-B fcl316 To fcl330 fcl316 To fcl330 FCC-3 Brocade Satabeast Brocade fcl0yy To fcl0zz fcl0yy To fcl0zz FCC-2 Brocade fcl0xx To fcl0yy fcl0xx To fcl0yy GCC-A Brocade fcl001 To fcl015 fcl001 To fcl015 GCC-B Brocade Satabeast Brocade

Distributed Shared File System Benefits Fast Launch—almost immediate as compared to 3-4 minutes with ssh/scp, Live Migration—Can move virtual machines from one host to another for scheduled maintenance, transparent to users, Persistent data volumes—can move quickly with machines, Once mirrored volume in place—can relaunch virtual machines in surviving building in case of building failure/outage, Head nodes have independent shared file system based on GFS2/DRBD active/active. 05-Feb-2013FermiCloud Review - Facility18

FermiCloud-HA Head Node Configuration 05-Feb-2013FermiCloud Review - Facility19 fcl001 (GCC-B) fcl301 (FCC-3) ONED/SCHED fcl-ganglia2 fermicloudnis2 fermicloudrepo2 fclweb2 fcl-cobbler fermicloudlog fermicloudadmin fcl-lvs2 fcl-mysql2 ONED/SCHED fcl-ganglia1 fermicloudnis1 fermicloudrepo1 fclweb1 fermicloudrsv fcl-lvs1 fcl-mysql1 2 way rsync Live Migration Multi-master MySQL Pulse/Piranha Heartbeat

Security Contributions Security Policy Proposed Cloud Computing Environment X.509 Authentication and Authorization Secure Contextualization Participation in HEPiX Virtualisation taskforce 05-Feb-2013FermiCloud Review - Facility20

Cloud Computing Environment FermiCloud Security taskforce recommended to CSBoard/CST that a new Cloud Computing Environment be established This is currently under preparation. Normal FermiCloud use is authenticated by Fermi Kerberos credentials, \ either X.509 or MIT Kerberos or both. Special concerns: Users have root: Usage can be a combination of Grid usage (Open Science Environment) and I Interactive usage (General Computing Environment) Planning for “secure cloud” to handle expected use cases: Archival systems at old patch levels or legacy OS. Data and code preservation systems. Non-baselined OS (Ubuntu, Centos, SUSE) Non-kerberos services which can live only on private net. 05-Feb-2013FermiCloud Review - Facility21

OpenNebula Authentication OpenNebula came with “pluggable” authentication, but few plugins initially available. OpenNebula 2.0 Web services by default used access key / secret key mechanism similar to Amazon EC2. No https available. Four ways to access OpenNebula Command line tools Sunstone Web GUI “ECONE” web service emulation of Amazon Restful (Query) API OCCI web service. Fermilab wrote X.509-based authentication plugins. Patches to OpenNebula to support this were developed at Fermilab and submitted back to the OpenNebula project in Fall 2011 (generally available in OpenNebula V3.2 onwards). X.509 plugins available for command line and for web services authentication. In continued discussion with Open Grid Forum and those who want to do X.509 Authentication in OpenStack – trying to establish a standard. 05-Feb-2013FermiCloud Review - Facility22

X.509 Authentication—how it works Command line: User creates a X.509-based token using “oneuser login” command This makes a base64 hash of the user’s proxy and certificate chain, combined with a username:expiration date, signed with the user’s private key Web Services: Web services daemon contacts OpenNebula XML-RPC core on the users’ behalf, using the host certificate to sign the authentication token. Use Apache mod_ssl or gLite’s GridSite to pass the grid certificate DN (and optionally FQAN) to web services. Known Current Limitations: With Web services, one DN can map to only one user. See next talk for X.509 Authorization plans 05-Feb-2013FermiCloud Review - Facility23

Security: Incident Response FermiCloud does not accept that the third party VM signing advocated by HEPiX Virtualisation Working Group is sufficient. In the event of an incident on FermiCloud, the actions of the following individuals will be examined: The individual who transferred the VM image to FermiCloud (yes we have logs). The individual who requested that the VM image be launched on FermiCloud (yes we have logs). If neither individual can provide an acceptable answer, then both individuals may lose FermiCloud access privileges. Note that virtualization allows us to snapshot memory and process stack of a running VM and capture all evidence. 05-Feb-2013FermiCloud Review - Facility24

Sunstone Web UI 05-Feb-2013FermiCloud Review - Facility25

Selecting a template 05-Feb-2013FermiCloud Review - Facility26

Launching the Virtual Machine 05-Feb-2013FermiCloud Review - Facility27

Monitoring VM’s 05-Feb-2013FermiCloud Review - Facility28

Auxiliary Services Monitoring Usage monitoring, Nagios, Ganglia, RSV Accounting/Billing Gratia Installation/Provisioning Cobbler, Puppet Web Server Secure secrets repositories Syslog forwarder NIS servers Dual MySQL database server (OpenNebula backend) LVS frontend. 05-Feb-2013FermiCloud Review - Facility29

FermiCloud – Monitoring FermiCloud Usage Monitor: http://fclweb.fnal.gov/metrics/fermicloud-usage.html Data collection dynamically “ping-pongs” across systems deployed in FCC and GCC to offer redundancy, See plot on next page. We have deployed a production monitoring infrastructure based on Nagios: Utilizing the OSG Resource Service Validation (RSV) scripts + check_mk agent Hardware is up OpenNebula Services running All 24x7 and 9x5 VM’s running that should be Any critical services on those VM’s 05-Feb-201330FermiCloud Review - Facility

FermiCloud Monitoring 05-Feb-2013FermiCloud Review - Facility31

Currently have two “probes” based on the Gratia accounting framework used by Fermilab and the Open Science Grid: https://twiki.grid.iu.edu/bin/view/Accounting/WebHome Standard Process Accounting (“psacct”) Probe: Installed and runs within the virtual machine image, and on bare metal. Reports to standard gratia-fermi-psacct.fnal.gov. Open Nebula Gratia Accounting Probe: Runs on the OpenNebula management node and collects data from ONE logs, emits standard Gratia usage records, Reports to the “virtualization” Gratia collector, The “virtualization” Gratia collector runs existing standard Gratia collector software (no development was required), The development of the Open Nebula Gratia accounting probe was performed by Tanya Levshina and Parag Mhashilkar. Additional Gratia accounting probes could be developed: Commercial – OracleVM, VMware, --- Open Source – Nimbus, Eucalyptus, OpenStack, … In contact with Usage Record working group of OGF on proposed cloud accounting usage record. FermiCloud - Accounting 05-Feb-201332FermiCloud Review - Facility

FermiCloud Accounting - 1 05-Feb-2013FermiCloud Review - Facility33

FermiCloud Accounting - 2 05-Feb-2013FermiCloud Review - Facility34

FermiCloud Service Availability 05-Feb-2013FermiCloud Review - Facility35

FermiCloud – Infiniband & MPI To enable HPC stakeholders, the FermiCloud hardware specifications included the Mellanox SysConnect II Infiniband card that was claimed by Mellanox to support virtualization with the Infiniband SRIOV driver. Unfortunately, despite promises from Mellanox prior to the purchase, we were unable to take delivery of the Infiniband SRIOV driver while working through the standard sales support channels. While at SuperComputing at Seattle in November 2011, Steve Timm, Amitoj Singh and Keith Chadwick met with the Mellanox engineers and were able to make arrangements to receive the Infiniband SRIOV driver. The Infiniband SRIOV driver was delivered to us in December 2011, and using this driver, we were able to make measurements comparing MPI on “bare metal” to MPI on KVM virtual machines on the identical hardware. The next three slides will detail the configurations and measurements that were completed in March 2012. This allows a direct measurement of the MPI “virtualization overhead”. 05-Feb-2013FermiCloud Review - Facility36

Infiniband Switch Infiniband Card Process pinned to CPU Infiniband Card Process pinned to CPU FermiCloud “Bare Metal MPI” 05-Feb-2013FermiCloud Review - Facility37

FermiCloud “Virtual MPI” 05-Feb-2013FermiCloud Review - Facility38 Infiniband Switch Infiniband Card VM pinned to CPU Infiniband Card VM pinned to CPU

MPI on FermiCloud (Note 1) Configuration #Host Systems #VM/host#CPU Total Physical CPU HPL Benchmark (Gflops) Gflops/Core Bare Metal without pinning 2--81613.90.87 Bare Metal with pinning (Note 2) 2--81624.51.53 VM without pinning (Notes 2,3) 281 vCPU168.20.51 VM with pinning (Notes 2,3) 281 vCPU1617.51.09 VM+SRIOV with pinning (Notes 2,4) 272 vCPU1423.61.69 Notes:(1) Work performed by Dr. Hyunwoo Kim of KISTI in collaboration with Dr. Steven Timm of Fermilab. (2) Process/Virtual Machine “pinned” to CPU and associated NUMA memory via use of numactl. (3) Software Bridged Virtual Network using IP over IB (seen by Virtual Machine as a virtual Ethernet). (4) SRIOV driver presents native InfiniBand to virtual machine(s), 2 nd virtual CPU is required to start SRIOV, but is only a virtual CPU, not an actual physical CPU. 05-Feb-201339FermiCloud Review - Facility

Service Level Agreements 24x7: Virtual machine will be deployed on the FermiCloud infrastructure 24x7. Typical use case – production services. 9x5: Virtual machine will be deployed on the FermiCloud infrastructure 8-5, M-F, may be “suspended or shelved” at other times. Typical use case – software/middleware developer. Opportunistic: Virtual machine may be deployed on the FermiCloud infrastructure providing that sufficient unallocated virtual machine “slots” are available, may be “suspended or shelved” at any time. Typical use case – short term computing needs. HyperThreading / No HyperThreading: Virtual machine will be deployed on FermiCloud infrastructure that [has / does not have] HyperThreading enabled. Nights and Weekends: Make FermiCloud resources (other than 24x7 SLA) available for “Grid Bursting”. 05-Feb-2013FermiCloud Review - Facility40

FermiCloud Economic Model Calculate rack cost: Rack, public Ethernet switch, private Ethernet switch, Infiniband switch, $11,000 USD (one time). Calculate system cost: Based on 4 year lifecycle, $6,500 USD / 16 processors / 4 years => $125 USD / year Calculate storage cost: 4 x FibreChannel switch, 2 x SATAbeast, 5 year lifecycle, $130K USD / 60 Tbytes / 5 years => $430 USD / TB-year Calculate fully burdened system administrator cost: Current estimate is 400 systems per administrator, $250K USD / year / 400 systems => $750 USD / system-year 05-Feb-2013FermiCloud Review - Facility41

FermiCloud Draft Economic Model Results (USD) SLA24x79x5Opportunistic “Unit” (HT CPU + 2 GB)$125$45$25 Add’l core$125 Add’l memory per GB$30 Add’l local disk per TB$40 SAN disk per TB$475 BlueArc per TB$430 System Administrator$750 Specialized Service Support“Market” 05-Feb-2013FermiCloud Review - Facility42 Note - Costs in the above chart are per year

FermiCloud Facility Summary - 1 FermiCloud usage has continuously expanded as more machines have been made available to general users. Currently average 15.7 cores allocated out of 16 HT cores available per machine. More than 4000 virtual machines have run already under OpenNebula on FermiCloud. FermiCloud already hosts scientific services (SAMGrid forwarders and Intensity Frontier GridFTP) at very high levels of performance and reliability. These services have already utilized (transparently to the end user community) the FermiCloud high availability infrastructure. 05-Feb-201343FermiCloud Review - Facility

FermiCloud Facility Summary - 2 FermiCloud replaced seven racks of obsolete test machines with 1½ racks of new machines, These racks support development, integration and production. The FermiCloud Facility is ready for production: Change management, Service Level Agreements, Support Rotation, Monitoring, Accounting are all in place or are about to be in place. Environment is flexible and reconfigurable in software: FermiCloud capabilities allows us to do tasks we couldn’t do before. 05-Feb-201344FermiCloud Review - Facility

FermiCloud Facility Summary - 3 FermiCloud Facility is economical to operate: Commodity Hardware, Open-Source Software, Low personnel costs to support, Costs comparable to or better than Amazon.com, Access to datasets significantly better that from commercial clouds. Build vs. Buy is false dichotomy: Big use of commercial clouds will increase the necessity for local facilities like FermiCloud in order to support the users learning how to run their services/applications efficiently on the cloud. FermiCloud operates at the forefront of delivering cloud computing capabilities to support physics research! 05-Feb-201345FermiCloud Review - Facility

Thank You! Any Questions?

Extra Slides Are in the “Extra Slides” Presentation

The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported.

Similar presentations

Presentation on theme: "The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported.

Similar presentations

Presentation on theme: "The FermiCloud Infrastructure- as-a-Service Facility Steven C. Timm FermiCloud Project Leader Grid & Cloud Computing Department Fermilab Work supported."— Presentation transcript:

Similar presentations

About project

Feedback