Download presentation
Presentation is loading. Please wait.
1
1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010
2
2 Disclosure I am co-inventor / co-founder of Plurality Based on 30 years of (on/off) research MSc of Dr. Nimrod Bayer, ~1990 However, research is forward-looking But takes advantage of Plurality tools and data
3
3 Many-cores CMP / Multi-core is “more of the same” Several high-end complex powerful processors Each processor manages itself Each processor can execute the OS Good for many unrelated tasks (e.g. Windows) Reasonable on 2–8 processors, then it breaks Many-cores 100 – 1,000 – 10,000 Useful for heavy compute-bound tasks So far (50 years) many disasters But there is light at the end of the tunnel
4
4 Agenda Review 4 cases Analyze How NOT to make a many-core
5
5 Many many-core contenders Ambric Aspex Semiconductor ATI GPGPU BrightScale ClearSpeed Technologies Coherent Logix, Inc. CPU Technology, Inc. Element CXI Elixent/Panasonic IBM Cell IMEC Intel Larrabee Intellasys IP Flex MathStar Motorola Labs NEC Nvidia GPGPU PACT XPP Picochip Plurality Rapport Inc. Recore Silicon Hive Stream Processors Inc. Tabula Tilera (many are dead / dying / will die / should die)
6
6 PACT XPP German company, since 1999 Martin Vorbach, an ex-user of Transputers 42x Transputers mesh 1980s
7
7 PACT XPP (96 elements)
8
8 PACT XPP die photo
9
9 PACT: Static mapping, circuit-switch reconfigured NoC
10
10 PACT ALU-PAE
11
11 PACT Static task mapping And a debug tool for that
12
12 PACT analysis Fine granularity computing Heterogeneous processors Static mapping complex programming Circuit-switched NoC static reconfigurations complex programming Limited parallelism Doesn’t scale easily
13
13 UK company Inspired by Transputers (1980s), David May 42x Transputers mesh 1980s
14
14 322x 16-bit LIW RISC
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24 : Static Task Mapping Compile
25
25 MIMD, fine granularity, homogeneous cores Static mapping complex programming Circuit-switched NoC static reconfigurations complex programming Doesn’t scale easily Can we create / debug / understand static mapping on 10K? analysis
26
26 USA company Based on RAW research @ MIT (A. Agarwal) Heavy DARPA funding, university IP Classic homogeneous MIMD on mesh NoC “Upgraded” Transputers with “powerful” uniprocessor features Caches Complex communications “tiles era”
27
27 Tiles Powerful processor High freq: ~1 GHz High power (0.5W) 5-mesh NoC P-M / P-P / P-IO 2.5 levels cache L1+ L2 Can fetch from L2 of others Variable access time 1 – 7 – 70 cycles
28
28 Caches Kill Performance Cache is great for a single processor Exploits locality (in time and space) Locality only happens locally on many-cores Other (shared) data are buried elsewhere Caches help speed up parallel (local) phases Amdahl [1967]: the challenge is NOT the parallel phases Need to program many looong tasks on same data Hard ! That’s the software gap in parallel programming
29
29 Array 36-64 processors MIMD / SIMD Total 5+ MB memory In distributed caches High power ~27W Die photo
30
30 allows statics Pre-programmed streams span multi-processors Static mapping
31
31 co-mapping: code, memory, routing
32
32 static mapping debugger
33
33 analysis Achieves good performance Bad on power Hard to scale Hard to program
34
34 Israel Technion research (since 1980s) P LURALITY
35
35 Architecture: Part I “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC P P P PPP P P external memory shared memory P-to-M resolving NoC P LURALITY
36
36 P P P PPP P P external memory shared memory P-to-M resolving NoC low latency parallel scheduling enables fine granularity scheduler P-to-S scheduling NoC “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC Architecture: Part II P LURALITY
37
37 Floorplan S P LURALITY
38
38 Other possible floor-plans? NoC to Instruction Memory NoC to Data Memory
39
39 programming model Compile into task-dependency-graph = ‘task map’ task codes Task maps loaded into scheduler Tasks loaded into memory regular duplicable task xxx( dependencies ) join/fork { … INSTANCE …. ….. } Task template: P P P PPP P P external memory shared memory P-to-M resolving NoC scheduler P-to-S scheduling NoC P LURALITY
40
40 Fine Grain Parallelization Convert (independent) loop iterations for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; } into parallel tasks duplicable task XX(…) 10000 { ii = INSTANCE; a[ii] = b[ii]*c[ii]; } All tasks, or any subset, can be executed in parallel
41
41 Task map example (2D FFT) Duplicable task Conditional task Join / fork task
42
42 Another task map (linear solver)
43
43 Linear Solver: Simulation snap-shots
44
44 Architectural Benefits Shared, uniform (equi-distant) memory no worry which core does what no advantage to any core because it already holds the data Many-bank memory + fast P-to-M NoC low latency no bottleneck accessing shared memory Fast scheduling of tasks to free cores (many at once) enables fine grain data parallelism impossible in other architectures due to: task scheduling overhead data locality Any core can do any task equally well on short notice scales automatically Programming model: intuitive to programmers easy for automatic parallelizing compiler P LURALITY
45
45 Target design (no silicon yet) 256 cores 500 MHz For 2 MB, slower for 20 MB Access time: 2 cycles (+) 3 Watts Designed to be Attractive to programmers (simple) Scalable Fight Amdahl’s rule P LURALITY
46
46 Analysis
47
47 Two (analytic) approaches to many cores 1) How many cores can fit into a fixed size VLSI chip? 2) One core is fixed size. How many cores can be integrated? The following analysis employs approach (1)
48
48 The VLSI-aware many-core (crude) analysis
49
49 The VLSI-aware many-core (crude) analysis 64 256 1K 4K power 1/ N perf N freq 1/ N Perf / power N
50
50 A modified analysis
51
51 The modified analysis 64 256 1K 4K power N perf N freq 1/ N Perf / power const
52
52 things we shouldn’t do in many-cores No processor-sensitive code No heterogeneous processors No speculation No speculative execution No speculative storage (aka cache) No speculative latency (aka packet-switched or circuit-switched NoC) No bottlenecks No scheduling bottleneck (aka OS) No issue bottlenecks (aka multithreading) No memory bottlenecks (aka local storage) No programming bottlenecks No multithreading / GPGPU / SIMD / static mappings / heterogeneous processors / … No statics No static task mapping No static communication patterns
53
53 Conclusions Powerful processors are inefficient Principles of high-end CPU are damaging Speculative anything, cache, locality, hierarchy Complexity harms (when exposed) Hard to program Doesn’t scale Hacking (static anything) is hacking Hard to program Doesn’t scale Keep it simple, stupid [Pythagoras, 520 BC]
54
54 Research issues Scalability Memory and P-M NoC: async / multicycle Add local memories? How to reduce power there? Scheduler and P-S NoC: active / distributed Schedule large number of tasks together? Shared memory as 1-cache to external memory Special instruction memories Programming model New PRAM, special case of functional? Hierarchical task map? Other task maps? Dynamic task map? Algorithm / program mapping to tasks Combine HPC and GP models? Power delivery
55
55
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.