Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory
Introduction Want to save power without a performance hit Dynamic Frequency and Voltage Scaling –Slow down the CPU –Linear speed loss, quadratic CPU power drop –Efficient, but limited range –A number of fixed p-states CPU Packing –Run a workload on few CPU cores –Linear speed loss, linear CPU power drop –Less efficient, greater range –A number of fixed configurations Using both?
Hardware architecture 16 nodes: complete systems, each with: –2 CPU sockets per node (physical dies) –2 cores per socket (4 total cores per node) 4-level memory hierarchy –L1 & L2 cache Per-core –Local memory Per-socket –Remote memory Accessible via HyperTransport bus Socket 0 AMD64 Core 1 L1 Instr L1 Data L2 Memory (1GB) HyperTransport AMD64 Core 0 L1 Instr L1 Data L2 Socket 1 AMD64 Core 3 L1 Instr L1 Data L2 AMD64 Core 2 L1 Instr L1 Data L2 Memory (1GB)
P-states and configurations Scaling: –Entire socket must scale together –5 P-states: every 200Mhz from 1.8 to 1.0GHz Packing: –5 configurations: All four cores: ×4 Three cores: ×3 Cores 0 and 1: ×2 Cores 0 and 2: ×2* One core: ×1 –For multi-node tests, prepend number of nodes 4×2: 4 nodes, cores 0 and 1 active, 8 total cores –Packing results "simulate" full socket shutdown (subtract 20W) HyperTransport Socket Core 3Core 2 Memory Socket Core 1Core 0
Three application classes CPU-bound –No communication, fits in cache –100% CPU utilization –Similar to while(1){} High-Performance Computing (HPC) –Inter-node communication –Significant memory usage –Performance = Execution time Commercial –Constant servicing of remote requests –Possibly significant memory usage –Performance = Throughput
(1) CPU-bound workloads Workload –DAXPY: A small linear algebra kernel –Representative of entire class Scaling: –Linear slowdown –Quadratic power cut Packing: –×4 is most efficient –×2* is no good here –×3 is right out –Single-socket configs ×1 and ×2 save power, but kill performance Power (W) Different p-states Throughput
(2) HPC workloads Packing with fixed nodes Power Energy EDP Time ×2* has no effect LU ×2* speedup CG slowdown CG CPU utilization falls LU ×2* speedup
(2) HPC workloads Packing with fixed cores Power Energy EDP Time
(3) Commercial workloads Scale first, then pack Power (W) Throughput (replies/second)
Conclusions Packing less efficient than scaling –Therefore: Scale first, then pack Nothing can help CPU-bound apps Memory/IO bound workloads are scalable Resource utilization affects (predicts?) effectiveness of scaling and packing Business workloads can benefit from scaling/packing –Especially at low utilization levels
Future work How does resource utilization influence the effectiveness of scaling/packing? –A predictive model based on resource usage? –A power management engine based on resource usage? Dynamic packing –Virtualization allows live migration –Can this be used to do packing on the fly?