Managing Scale and Complexity of Next Generation HPC Systems and Clouds Peter ffoulkes Vice President of Marketing April 2011
The World’s Most Capable Computing Systems Are Powered by Moab 2 The world’s largest HPC system, No. 2-ranked Jaguar, with over 18,500 nodes, 224,000 cores and a speed of 1.75 petaflop/s Half of the top 10 systems Over one third of the top 50 systems (17 systems) 38% of the compute cores in the top 100 systems Source: Nov 2010 rankings from The Top 500 List (Nov 2010) 2Oak Ridge National Laboratory 5Lawrence Berkeley National Lab 7Los Alamos National Laboratory 8University of Tennessee 10Los Alamos National Laboratory 12Lawrence Livermore National Lab 14Sandia National Laboratories 16Lawrence Livermore National Lab 23Forschungszentrum Jülich 26Lawrence Berkeley National Lab 30Oak Ridge National Laboratory 31Sandia National Laboratories 32NOAA / Oak Ridge National Lab 39SciNet, University of Toronto
Oak Ridge National Laboratory Jaguar, the second most capable HPC system in the world, running at petaflop/s 18,686 nodes, 224,256 processing cores, 300TB of memory Diversity of users was severely limiting system workload- management capability 3 Moab resolved Jaguar’s workload management problems and increased system utilization, decreased downtime, and allowed more control over resources
What’s Next… 4 Tsubami 2.0 2,816 (6 core) CPUs (16,896 cores) combined with 4,224 of NVIDIA’s Tesla M2050 (448 core) general-purpose GPUS, dual-rail, non-blocking fabric employing two Voltaire 40 Gb/s InfiniBand connections on each node Tianhe 1A 3,600 nodes, 14,336 (6 core) CPUs, 7,168 (448 core) GPUs - 86,000 general purpose cores and 7,168 GPUs, 160 Gbit/second Galaxy interconnect developed in China
Managing Scale and Complexity Moab 6.0 A new command communication architecture that delivers a 100-fold increase in internal communications throughput Support for the most commonly used Moab commands and grids deploying multiple Moab instances, dramatically increasing the manageability of complex supercomputing environments Support for hybrid installations deploying GPGPU technologies in conjunction with TORQUE
Managing GPGPUs Moab 6.0 and TORQUE Specify GPGPUs in the same manner as CPUs GPGPUs are requested as a defined resource Applications receive indexed GPGPU information about which GPGPU(s) to access Moab’s intelligent scheduling ensures GPGPUs never get oversubscribed GPGPU usage is recorded in utilization reports
Managing Scale and Complexity Moab 6.0 New on-demand dynamic provisioning and management capabilities that support both virtual and physical resources, including VM migration for load balancing, workload packing and consolidation Idle-resource management to deliver increased utilization, efficiency and energy conservation for HPC and enterprise cloud deployments Improved administration and reporting, including new parameterized administration functions; enhanced limits for event, group and account management; and new formats for job and reservation event reporting
Managing Scale and Complexity Moab Viewpoint 2.0 HPC as a service and HPC cloud capability: Creation, management and status reporting of reservations and job queues for HPC and batch workloads and system maintenance On-demand dynamic management of VMs and physical nodes Increased scalability to support management of tens of thousands of nodes and hundreds of thousands of VMs Flexible security management for flexible security options at installation, including built-in security, single Sign On (SSO), or Lightweight Directory Access Protocol (LDAP) models Service-based administration and reporting for easy access and management of HPC and cloud resources
University of Cambridge: Cosmos 9 Overview COSMOS has expanded: New SGI Altix UV1000, 6-core NehalemEX chips, 768 cores, 2TGB of global shared memory Existing SGI Altix 4700, 920 cores and 2.5TB RAM Both compute systems are supported by 64TB of high performance storage. Challenge Managing both cluster-based workloads and SMP shared memory workloads in the same environment
University of Birmingham 10 Overview The University of Birmingham’s 1500 core cluster runs a mixed workload, from many - often hundreds - of short single-core parameter-sweep jobs to massively parallel multi-core computations, some running for over a week. Challenges The workload is variable, especially at different times of the year, and keeping the whole cluster powered up during less busy periods is wasteful of power. A sophisticated system for managing the power requirements is required to be aware of the scheduled as well as the active workload to ensure that resources are always available when required without power being wasted. Solution Moab Adaptive HPC Suite™ Results An annual saving of about 10% of the current power costs, amounting to £50,000 from powering off nodes that are not in use. Further savings from ancillary supplies, especially the air conditioning in the datacentre are expected.
What’s in a cloud: Vapor-ware or silver lining? Is the Future of Computing Clear or is it Obscured by Clouds? National Institute of Standards and Technology: “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.” Essential Characteristics: On-demand self-service, Broad network access, Resource pooling, Rapid elasticity, Measured Service. Service Models: Cloud Software as a Service (SaaS), Cloud Platform as a Service (PaaS), Cloud Infrastructure as a Service (IaaS) Deployment Models: Private cloud, Community cloud, Public cloud, Hybrid cloud. Note: Cloud software takes full advantage of the cloud paradigm by being service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. The NIST Definition of Cloud Computing.
Agile Automated Adaptive Delivers business services rapidly, efficiently and successfully Eliminates human error, enables scaling and capacity, reduces management complexity and cost Anticipates and adapts intelligently to dynamic business service needs and conditions Three Essential Cloud Characteristics
13 SciNet—University of Toronto Solution Energy-aware, stateless, on-demand multi- OS provisioning Moab Adaptive HPC Suite™ and xCAT provisioning software 4,000 server supercomputer system 30,000 Intel Xeon 5500 cores, – a theoretical peak of 306 TFlops Results A state-of-the-art data center that saves enough energy to power more than 700 homes yearly. On-demand provisioning allows users to make their OS choice part of their automated job template. SciNet always has several different flavors of Linux running simultaneously. “Why should we pay for cooling when it’s so cold outside? Toronto is pretty cold for at least half of the year. We could have bought a humongous pile of cheap x86 boxes but couldn’t power, maintain or operate them in any logical way.” Dr. Daniel Gruner, PhD, chief technology officer of software for SciNet.
Who: Top 3 financial services company What: Moab - Automation Intelligence Manager will manage 80-90% of workloads (Up to 10,000+ applications of more than 100,000 servers across more than 10 datacenters) Use Case: Iaas, PaaS, AaaS, using Workload-Driven Cloud 2.0 Objective: Increase agility, reduce risk and save over $1 billion dollars in 3 years. A Global Bank based in the USA
16