Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.

Similar presentations


Presentation on theme: "Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory."— Presentation transcript:

1 Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory

2 Introduction Want to save power without a performance hit Dynamic Frequency and Voltage Scaling –Slow down the CPU –Linear speed loss, quadratic CPU power drop –Efficient, but limited range –A number of fixed p-states CPU Packing –Run a workload on few CPU cores –Linear speed loss, linear CPU power drop –Less efficient, greater range –A number of fixed configurations Using both?

3 Hardware architecture 16 nodes: complete systems, each with: –2 CPU sockets per node (physical dies) –2 cores per socket (4 total cores per node) 4-level memory hierarchy –L1 & L2 cache Per-core –Local memory Per-socket –Remote memory Accessible via HyperTransport bus Socket 0 AMD64 Core 1 L1 Instr L1 Data L2 Memory (1GB) HyperTransport AMD64 Core 0 L1 Instr L1 Data L2 Socket 1 AMD64 Core 3 L1 Instr L1 Data L2 AMD64 Core 2 L1 Instr L1 Data L2 Memory (1GB)

4 P-states and configurations Scaling: –Entire socket must scale together –5 P-states: every 200Mhz from 1.8 to 1.0GHz Packing: –5 configurations: All four cores: ×4 Three cores: ×3 Cores 0 and 1: ×2 Cores 0 and 2: ×2* One core: ×1 –For multi-node tests, prepend number of nodes 4×2: 4 nodes, cores 0 and 1 active, 8 total cores –Packing results "simulate" full socket shutdown (subtract 20W) HyperTransport Socket Core 3Core 2 Memory Socket Core 1Core 0

5 Three application classes CPU-bound –No communication, fits in cache –100% CPU utilization –Similar to while(1){} High-Performance Computing (HPC) –Inter-node communication –Significant memory usage –Performance = Execution time Commercial –Constant servicing of remote requests –Possibly significant memory usage –Performance = Throughput

6 (1) CPU-bound workloads Workload –DAXPY: A small linear algebra kernel –Representative of entire class Scaling: –Linear slowdown –Quadratic power cut Packing: –×4 is most efficient –×2* is no good here –×3 is right out –Single-socket configs ×1 and ×2 save power, but kill performance Power (W) Different p-states Throughput

7 (2) HPC workloads Packing with fixed nodes Power Energy EDP Time ×2* has no effect LU ×2* speedup CG slowdown CG CPU utilization falls LU ×2* speedup

8 (2) HPC workloads Packing with fixed cores Power Energy EDP Time

9 (3) Commercial workloads Scale first, then pack Power (W) Throughput (replies/second)

10 Conclusions Packing less efficient than scaling –Therefore: Scale first, then pack Nothing can help CPU-bound apps Memory/IO bound workloads are scalable Resource utilization affects (predicts?) effectiveness of scaling and packing Business workloads can benefit from scaling/packing –Especially at low utilization levels

11 Future work How does resource utilization influence the effectiveness of scaling/packing? –A predictive model based on resource usage? –A power management engine based on resource usage? Dynamic packing –Virtualization allows live migration –Can this be used to do packing on the fly?


Download ppt "Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory."

Similar presentations


Ads by Google