FermiCloud Review Response to Questions Keith Chadwick Steve Timm Gabriele Garzoglio Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359
Red Pill or Blue Pill? 6-Feb-2013FermiCloud Review - Response to Questions1
Question 1 Please provide a list of use cases – users, applications, required capabilities and capacities - for the immediate plans to move into production and projected as supportable with the existing resources. What additional resources might be needed in the next year based on other known use cases/users. 6-Feb-2013FermiCloud Review - Response to Questions2
Current Production Summary 6-Feb-2013FermiCloud Review - Response to Questions3 User VM (by SLA)VM Quantity Server VM24x712 Persistent Integration VM9x542 Test VMOppor.129 User VM Sub Total183 Mgmt VM (by SLA)VM Quantity Auxiliary Services24x77 that are required to9x510 manage FermiCloudOppor.0 (nagios, ganglia, mysql, lvs, etc.)Mgmt VM Sub Total17 VM Grand Total % subscription levelExisting Fermicloud384
Server VM’s dzeroJ. Boyd2 4 cores 120GB storage SAMGRID Forwarding nodes for LCG Submission IFVotava91 core Intensity Frontier GridFTP servers (9 currently, 2 more requested) geant4Wenzel11 coreGEANT4 Validation server 6-Feb-2013FermiCloud Review - Response to Questions4
Persistent Integration VMs geant4Wenzel1Development GEANT4 Validation Server IRODSLevshina2IRODS server for OSG IRODS users minosTagg1MINOS event display test FGSTimm1ITB Storage Element--FermiGrid ITB site FGSTimm7 FermiGrid Services Stress Testing (SAZ,GUMS, MySQL) Xen FGSTimm2FermiGrid gums stress test FGSTimm2FermiGrid KVM-based MySQL stress test FGSSharma6dCache test stand/OSG SW FGSTimm, Sharma9SHA-2 testing DOCSDykstra1Extenci project, test Lustre over WAN DOCSLevshina1VOMRS testing DOCSDykstra1gateway to 100GB ANI testbed 6-Feb-2013FermiCloud Review - Response to Questions5
Test VM’s Big categories: OSG Software Team, GlideinWMS Project, Gratia development/integration, CVMFS testing, FermiGrid “at scale” testing of Grid middleware (GUMS, SAZ, etc.). 6-Feb-2013FermiCloud Review - Response to Questions6
(Known) Big Upcoming Use Cases KISTI joint project, grid bursting tests: virtual machines. Mike Wang NFS v4.1 testing: He would like 25+ VM’s. SHA2/IPv6 testing—could need 20+ VM’s: Can accommodate on existing hardware. Would be nice to have idle VM detection/reclamation feature working. 6-Feb-2013FermiCloud Review - Response to Questions7
Other Good Potential Use Cases Possible MPI Use case: G. Lukhanin started Nova DAQ simulation in spring 2012, didn’t finish, MPI was used as isolated private net for heavy multicast activity. DMS Dept. dCache testing Were stakeholders of old FAPL cluster, Haven’t had chance to start FermiCloud work. 6-Feb-2013FermiCloud Review - Response to Questions8
Run II Data Preservation Cloud technology offers possibility for Legacy/unpatched OS, Dedicated private net (software defined network), Hosting database servers, Compute servers which are only booted on demand. FermiCloud would need new HW buy to absorb this capacity. 6-Feb-2013FermiCloud Review - Response to Questions9
Data-intensive Science DES & Darkside50: High I/O per compute instruction, Large data sets. These could be addressed in cloud-like configuration but we expect that we would need more hardware, particularly closely attached high performance storage. 6-Feb-2013FermiCloud Review - Response to Questions10
Budget In the current FY2013 budget request, we have requested $28K + $90K (total $118K) funding for additional FermiCloud “host” systems, locations TBD. FCC-3, FCC-2, GCC-A, or LCC are all possible locations. We also have another $64K in the FY2013 budget request that was targeted for GP Grid Worker nodes that could be reprogramed. It is likely that similar expansion could be needed in FY2014, with possibilities for significant additional expansion depending on stakeholder requirements. In FY2015, the first set of 23 systems will reach 5 years in service and are likely candidates for retirement. As previously said, new stakeholders such as DES, Darkside, or Run II data preservation may require additional hardware acquisitions, although if we know that they are coming, we can work with them to assure that our planned hardware acquisitions can address at least some of their needs. 6-Feb-2013FermiCloud Review - Response to Questions11
Question 2 We would like to learn the process used to choose the list of the development features being proposed and how the prioritization is done based on the use case or other drivers. 6-Feb-2013FermiCloud Review - Response to Questions12
List of Development Features The list of development features is determined as a combination of the following constraints and input: Compliance to the policies of Fermilab, The judgment of the FermiCloud Project Management Team based on their multiple years of supporting scientific computing, Input from our collaborators, users and stakeholders. Currently, this is gathered at the weekly project management meetings, where potential development topics and priorities are discussed [Examples – SAN, InfiniBand, 100G, authorization]. We believe that a (?monthly?) dedicated stakeholder forum separate from the project management meetings would serve this need more inclusively. Based on this collective thinking, we believe that certain capabilities are going to be expected by our users [Example - Cloud bursting] 6-Feb-2013FermiCloud Review - Response to Questions13
Prioritization of Development Items The priority is determined as a combination of the following constraints and input: Compliance to the security policies of the open science and general computing environments. [Example: x509 authentication], Operational needs of the administrative team [Example: development of the X509 AuthZ call-out module to allow the central management of privileges; resource accounting; resource optimization via idle VM detection to prioritize VM survival through building downtimes and to implement off- hours Grid-bursting ], Input from scientific stakeholders, Priorities of the major collaborators - negotiations on what makes a collaboration possible / successful [Example: exchanging workloads with KISTI through a Cloud Federation], In addition, we want to develop capabilities that place ourselves strategically on certain high-profile initiatives, such as data preservation [Example - accepting VMs as jobs to retain computational environment without maintaining a central VM repository of all supported stakeholders]; Finally, our priorities are informed by the interactions with colleagues, program managers, task forces, standardization bodies, etc. 6-Feb-2013FermiCloud Review - Response to Questions14
Development vs. Operational Effort The GCC Department is already organized into distinct operations and development groups: Operations – FGS Development – DOCS Members of both groups work very closely with the other groups to deliver solutions to our stakeholders. The GCC Department does have a couple of very capable “switch hitters” (Neha Sharma and Hyunwoo Kim) that have shown that they can support both development and operations. Later this month, we will lose Doug Strain (he has taken a position with Google), so we do have an opportunity to slightly rebalance the department personnel. At the present time we are opting to propose that the replacement has more exposure to Cloud computing (Doug’s effort was a split across OSG Storage and GlideinWMS), We could consider recasting this replacement be tasked towards operations. 6-Feb-2013FermiCloud Review - Response to Questions15
Future FermiCloud Hardware >=64 cores, >=192 Gbytes of memory, FibreChannel HBA, Raid card, with >=8 high speed disks, Possible 10 Gb/s interface, 2U chassis. 6-Feb-2013FermiCloud Review - Response to Questions16
Summary Today, the FermiCloud project is driven by the “best judgment” of the Grid & Cloud Computing Department Management, coupled with those stakeholders that attend the weekly FermiCloud project meeting. Allowing the FermiCloud project to formally engage the set of FermiCloud Stakeholders to collect requests and recommendations would greatly improve this state of affairs. If we are going to be part of the Run II Data Preservation efforts, they will expect (and we would agree to) significant input on our future plans 6-Feb-2013FermiCloud Review - Response to Questions17
"Give me a place to stand, and I will move the Earth.” - Archimedes 6-Feb-2013FermiCloud Review - Response to Questions18
Cast of Characters The Earth – Science The Lever – Virtualization The Clouds – Cloud Computing Archimedes – Fermilab Scientists The Fulcrum – FermiCloud and the GCC Department 6-Feb-2013FermiCloud Review - Response to Questions19
Thank You Any Questions?