A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
Golden era of computer architecture ~ 3 years behind CPU92 CPU95 CPU2000 CPU2006 Year SPEC CINT Performance (log. Scale) Era of DIY: Multicore Reconfigurable GPUs Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance”
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Parallel Resources Automatic Allocation/Scheduling Commit
M ULTICORE A RCHITECTURE (C IRCA 2010) Automatic Pipelining Parallel Resources Automatic Speculation Automatic Allocation/Scheduling Commit
Realizable parallelism Parallel Library Calls Time Threads Credit: Jack Dongarra
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. Parallel Programming Automatic Parallelization Parallel Libraries Computer Architecture Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’s former glory.
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
LD:1 LD:2 W:1 W:3 LD:3 Core 1Core 2Core 3 W:2 W:4 LD:4 LD:5 C:1 C:2 C:3 Core 4 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Program Dependence Graph AB D C Control Dependence Data Dependence PDG
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Spec-DOALL SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Spec-DOALL A2 B2C2 D2 A1 B1C1 D1 A3 B3C3 D3 SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence
Example B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Program Dependence Graph AB D C Control Dependence Data Dependence Spec-DOALL A2A1A3 B2 C2 D2 B1 C1 D1 B3 C3 D3 A: while (node) { while (true) { B2 C2 D2 B3 C3 D3 B4 C4 D4 197.parser Slowdown SpecDOALLPerf
Core 1Core 2 Core 3 Time C1 D1 B1 B7 C3 D3 B3 C4 D4 B4 C5 D5 B5 C6 B6 Spec-DOACROSS Core 1Core 2 Core 3 Time Spec-DSWP C2 D2 B2 C1 D1 B1 B3 B4 B2 C2 C3 D2 B5 B6 B7 D3 C5 C6 C4 D5 D4 Throughput: 1 iter/cycle DOACROSSDSWP
Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Core 1Core 2 Core 3 Time C1 D1 B1 C2 D2 B2 C3 D3 B3 Core 1Core 2 Core 3 B2 B3 B1 B5 B6 B4 C2 C3 C1 C5 C6 C4 B7 Pipeline Fill time 0.5 iter/cycle 1 iter/cycle D2 D3 D1 D5 D4 Time C4 D4 B4 C5 D5 B5 C6 B6 LatencyProblem B7
TLS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
19 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6
20 char *memory; void * alloc(int void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6
21 char *memory; void * alloc(int Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Easily Understood Non-Determinism!
[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11] ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
24 Sum Reduction Sum Reduction Unroll Rotate 0.90X 0.10X 30.0X 1.1X 0.8X Sum Reduction Sum Reduction Unroll Sum Reduction Sum Reduction Rotate Unroll 1.5X Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]
PS-DSWP Complainer PS-DSWP Complainer
Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls Unroll Sum Reduction Sum Reduction Rotate PS-DSWP Complainer PS-DSWP Complainer Who can help me? Programmer Annotation Programmer Annotation
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
Performance relative to Best Sequential 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
Restoration of Trend
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?
The End