Download presentation
Presentation is loading. Please wait.
Published byDaquan Mangum Modified over 10 years ago
1
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
2
Golden era of computer architecture 1992 2012 199419961998200020022004 200620082010 ~ 3 years behind CPU92 CPU95 CPU2000 CPU2006 Year SPEC CINT Performance (log. Scale) Era of DIY: Multicore Reconfigurable GPUs Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance”
3
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Parallel Resources Automatic Allocation/Scheduling Commit
4
M ULTICORE A RCHITECTURE (C IRCA 2010) Automatic Pipelining Parallel Resources Automatic Speculation Automatic Allocation/Scheduling Commit
6
Realizable parallelism Parallel Library Calls Time Threads Credit: Jack Dongarra
7
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
8
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. Parallel Programming Automatic Parallelization Parallel Libraries Computer Architecture Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’s former glory.
9
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
10
0 1 2 3 4 5 LD:1 LD:2 W:1 W:3 LD:3 Core 1Core 2Core 3 W:2 W:4 LD:4 LD:5 C:1 C:2 C:3 Core 4 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE
11
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Program Dependence Graph AB D C Control Dependence Data Dependence PDG
12
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Spec-DOALL SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence
13
Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Spec-DOALL A2 B2C2 D2 A1 B1C1 D1 A3 B3C3 D3 SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence
14
Example B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Program Dependence Graph AB D C Control Dependence Data Dependence Spec-DOALL A2A1A3 B2 C2 D2 B1 C1 D1 B3 C3 D3 A: while (node) { while (true) { B2 C2 D2 B3 C3 D3 B4 C4 D4 197.parser Slowdown SpecDOALLPerf
15
Core 1Core 2 Core 3 Time C1 D1 B1 B7 C3 D3 B3 C4 D4 B4 C5 D5 B5 C6 B6 Spec-DOACROSS Core 1Core 2 Core 3 Time Spec-DSWP C2 D2 B2 C1 D1 B1 B3 B4 B2 C2 C3 D2 B5 B6 B7 D3 C5 C6 C4 D5 D4 Throughput: 1 iter/cycle DOACROSSDSWP
16
Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Core 1Core 2 Core 3 Time C1 D1 B1 C2 D2 B2 C3 D3 B3 Core 1Core 2 Core 3 B2 B3 B1 B5 B6 B4 C2 C3 C1 C5 C6 C4 B7 Pipeline Fill time 0.5 iter/cycle 1 iter/cycle D2 D3 D1 D5 D4 Time C4 D4 B4 C5 D5 B5 C6 B6 LatencyProblem B7
17
TLS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster
18
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
19
19 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6
20
20 char *memory; void * alloc(int size); @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6
21
21 char *memory; void * alloc(int size); @Commutative Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Easily Understood Non-Determinism!
22
[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11] ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch
23
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
24
24 Sum Reduction Sum Reduction Unroll Rotate 0.90X 0.10X 30.0X 1.1X 0.8X Sum Reduction Sum Reduction Unroll Sum Reduction Sum Reduction Rotate Unroll 1.5X Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]
25
PS-DSWP Complainer PS-DSWP Complainer
26
Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls Unroll Sum Reduction Sum Reduction Rotate PS-DSWP Complainer PS-DSWP Complainer Who can help me? Programmer Annotation Programmer Annotation
27
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction
28
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative
29
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative
30
PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative
31
Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis
32
Performance relative to Best Sequential 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
33
Restoration of Trend
34
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?
35
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.