Creating a Dynamic HPC Infrastructure with Platform Computing Chris Porter Sept. 8th, 2011
The Business Problems Rigidity Inefficiency Incapacity Workload requirements are dynamic and only becoming more so Unbalanced utilization of distributed resources (hot and cold spots) Peak demand occasionally exceeds local AND total supply of resources
Platform’s Roadmap to HPC in the Cloud Efficiency, Flexibility, Service Levels Public Cloud 5. Burst to external providers when needed Private Cloud Grid 4. Make infrastructure dynamic within a shared private cloud Coming soon! Cluster 3. Make infrastructure dynamic Platform LSF + Adaptive Cluster 2. Employ effective resource management and sharing Platform LSF + Adaptive Cluster 1. Harness the power of commodity clusters We Are Here Time
Application Stack Silos Typical Solutions Static infrastructure = trade-offs Utilization Static HPC Capacity Large Job Starvation Application Stack Silos Static HPC Capacity Effective sharing and high utilization, but simply more demand than supply The Problem? Capacity constrained by the static dedicated HPC resources Large Job Starvation A shared cluster with diverse sizes of jobs The Problem? Resource fragmentation leading to large job starvation Application Stack Silos A single cluster with smart sharing, but with applications that require specific environments The Problem? Better still, but capacity is constrained by the static application stacks and OS versions Queue Sprawl - Multiple Host Groups A single cluster with partitioned resources and dedicated queues based on OS and possibly application The Problem? Better, but still expensive Under-utilized, limited sharing High administrative costs due to “queue sprawl” Cluster Sprawl - Multiple Static Clusters To ensure acceptable service levels, each department has their own cluster The Problem? Costly, under-utilized silos that are expensive to manage and maintain Queue Sprawl Cluster Sprawl Service Level
Introducing Platform Adaptive Cluster Utilization With Platform Adaptive Cluster, you can avoid this trade-off Delivers both better economic value and service levels! Service Levels / QoS
Platform Adaptive Cluster Transform a static cluster into a fully dynamic cloud environment Reduce complexity of user environment Control resource allocation policies at the group level Benefit from a mature, stable HPC cloud product Increase user service level Reduce cost & save power Increase resource sharing & utilization Redeploy servers quickly & efficiently Achieve self-service for users Allows the environment to flex as requirements change over time Automate administration
Create a Dynamic HPC Environment Dynamic Provisioning of OS Memory Consolidation RHEL 5.5 RHEL 4.8 Big Mem Job Platform LSF + Adaptive Cluster Multi-boot or use over the wire provisioning . Job requirements are driving the provisioning requirements. Have plenty of memory available, but don’t have enough available memory on one server. Use job containers to move the smaller jobs. Workload out of balance with OS provisioning Large memory jobs starved by running jobs RHEL 5.5 RHEL 4.8 RHEL 4.8 RHEL 5.5 RHEL 5.5
Use Case: Large EDA Customer Problem Service level requirements for high priority workload conflicts with the need to keep utilization high Resources are reserved for critical workload When there is no priority workload, utilization is low Alternatives Use reserve resources, force priority workload to wait Preempt low priority workload, kill or wait for priority Solution: Platform LSF + Platform Adaptive Cluster Use reserved infrastructure for low priority workload When priority workload arrives migrate low priority to other resources Utilization is high, no workload is starved or lost
Chris Porter cporter@platform.com Any Questions? Chris Porter cporter@platform.com 9