Download presentation
Presentation is loading. Please wait.
Published byGilbert Dylan Casey Modified over 9 years ago
1
Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein
2
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 2 Questions ? familiar with FPGAs ? Programming easy? Who is familiar with systolic arrays ? Duality: data streams vs. instruction streams ? Programming a multicore microprocessor: will it be easy ?
3
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 3 pervas
4
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 4 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions
5
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 5 FPGAs found everywhere
6
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 6 Pervasiveness of RC http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html http://hartenstein.de/pervasiveness.html
7
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 7 RCeducation 2008 http://www.fpl.uni-kl.de/RCeducation08/ The 3rd International Workshop on Reconfigurable Computing Education April 10, 2008, Montpellier, France
8
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 8 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions the hardware / software chasm, the configware / software chasm the instruction stream tunnel the overhead-prone paradigm
9
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 9 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions instruction-stream vs. data stream bridging the chasm: an old hat stubborn curriculum task forces
10
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 10 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions
11
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 11 paradox Outline
12
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 12 RC education http://www.fpl.uni-kl.de/RCeducation/ http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html
13
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 13 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions platform FPGAs, coarse-grained arrays saving energy
14
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 14 FPGA with island architecture reconfigurable logic box switch box connect box reconfigurable interconnect fabrics
15
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 15 reconfigurability overhead> routing congestion wiring overhead overhead: >> 10 000 1980199020002010 10 0 10 3 10 6 10 9 FPGA logical FPGA routed density: FPGA physical (Gordon Moore curve) transistors / microchip (microprocessor) immense area inefficiency 1 st DeHon‘s Law [1996: Ph. D thesis, MIT] general purpose “simple” FPGA Deficiencies of reconfigurable fabrics (FPGA) (fine-grained) power guzzler slow clock deficiency factor: >10,000
16
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 16 Software-to-Configware (FPGA) Migration: molecular dynamics simulation 88 some published speed-up factors [2003 – 2005] 1980199020002010 10 0 10 3 10 6 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia BLAST 52 protein identification 40 Smith-Waterman pattern matching 288 Bioinformatics GRAPE 20 Astrophysics speedup factor crypto 1000 oil and gas 17 X 2/yr
17
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 17
18
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 18 Software-to-Configware (FPGA) Migration: molecular dynamics simulation 88 some published speed-up factors [2003 – 2005] 1980199020002010 10 0 10 3 10 6 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia BLAST 52 protein identification 40 Smith-Waterman pattern matching 288 Bioinformatics GRAPE 20 Astrophysics speedup factor crypto 1000 oil and gas 17 X 2/yr PISA The RC paradox deficiency factor: >10,000 speed-up factor: 6,000 total discrepancy: >60,000,000 3000
19
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 19 Software-to-Configware (FPGA) Migration: molecular dynamics simulation 88 some published speed-up factors [2003 – 2005] 1980199020002010 10 0 10 3 10 6 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia BLAST 52 protein identification 40 Smith-Waterman pattern matching 288 Bioinformatics GRAPE 20 Astrophysics speedup factor crypto 1000 oil and gas 17 X 2/yr The RC paradox deficiency factor: >10,000 speed-up factor: 6,000 total discrepancy: >60,000,000 3000
20
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 20 Software-to-Configware (FPGA) Migration: molecular dynamics simulation 88 some published speed-up factors [2003 – 2005] 1980199020002010 10 6 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia BLAST 52 protein identification 40 Smith-Waterman pattern matching 288 Bioinformatics GRAPE 20 Astrophysics speedup factor crypto 1000 oil and gas 17 X 2/yr PISA The RC paradox 10 0 10 3 deficiency factor: >10,000 speed-up factor: 6,000 total discrepancy: >60,000,000
21
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 21 Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] These examples worked fine with on-chip memory There are other algorithms more difficult to accelerate … … where d-daching might be useful ( ASM )
22
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 22 platform- FPGA Outline
23
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 23 How much on-chip embedded BRAM ? 256 – 1704 BGA 56 – 424 8 – 32 fast on-chip block RAMs: BRAMs DPU : coarse- grained On-chip LatticeCS series
24
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 24 coarse
25
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 25 array size: 10 x 16 rDPUs Coarse-grained Reconfigurable Array rout thru only not used backbus connect SNN filter on (supersystolic) KressArray (mainly a pipe network) r econfigurable D ata P ath U nit, 32 bits wide no CPU rDPU note: software perspective without instruction streams: pipelining compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside question after the talk: „but you can‘t implement decisions!“
26
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 26 Simple KressArray Configuration Example
27
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 27 DPU Much less deficiencies by coarse-grained 1980199020002010 10 0 10 3 10 6 10 9 (Gordon Moore curve) transistors / microchip rDPA physical rDPA logical area efficiency very close to Moore‘s law Hartenstein‘s Law [1996: ISIS, Austin, TX] very compact configuration code: very fast reconfiguration r DPU DPU CPU program counter rDPU
28
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 28 energy
29
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 29 Software-to-Configware (FPGA) Migration: Oil and gas [2005] 1980199020002010 10 0 10 3 10 6 speedup factor oil and gas 17 X 2/yr side effect: slashing the electricity bill by more than an order of magnitude side effect: slashing the electricity bill by more than an order of magnitude
30
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 30 An accidentially discovered side effect Software to FPGA migration of an oil and gas application: Speed-up factor of 17 Electricity bill down to <10% Hardware cost down to <10% All other publications reporting speed-up did not report energy consumption. Saves > $10,000 in electricity bills per year (7 ¢ / kWh) -.... per 64-processor 19" rack What about higher speed-up factors ? More dramatic electricity savings? Herb Riley, R. Associates $70 in 2010? - This will change.
31
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 31 What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007] $52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17] $200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons
32
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 32 Energy as a strategic issue Google‘s annual electricity bill: 50,000,000 $ Amsterdam‘s electricity: 25% into server farms NY city server farms: 1/4 km 2 building floor area [Mark P. Mills] Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure petaFlop supercomputer (by 2012 ?): extreme power consumption
33
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 33 Energy: an im portant motivation platform exampleEnergy: W / Gflops energy factor MDgrape-3* (domain-specific 2004) 0.2 1 Pentium 414 70 Earth Simulator (supercomputer 2003) 128 640 *) feasible also on reconfigurable platforms
34
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 34 Moore gap
35
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 35 Outline The Pervasiveness of FPGAs The Reconfigurable Computing Paradox The Gordon Moore gap The von Neumann syndrome We need a dual paradigm approach Conclusions & the multicore crisis
36
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 36 Moore’s law not applicable to all aspects of VLSI What is the reason of the paradox ? The Gordon Moore curve does not indicate performance The peak clock frequency does not indicate performance the law of Gates
37
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 37 Rapid Decline of Computational Density [BWRC, UC Berkeley, 2004] 1990199520002005 200 100 0 50 150 75 25 125 175 SPECfp2000/MHz/Billion Transistors DEC alpha SUN HP IBM alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs stolen from Bob Colwell CPU memory wall, caches,... primary design goal: avoiding a paradigm shift dramatic demo of the von Neumann Syndrome
38
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 38 Monstrous Steam Engines of Computing 5120 Processors, 5000 pins each Crossbar weight: 220 t, 3000 km of thick cable, larger than a battleship power measured in tens of megawatts, floor space measured in tens of thousands of square feet ready 2003
39
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 39 Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel. Machines Kendall Square Research Key Computer Laboratories Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000] MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics
40
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 40 We are in a Computing Crisis platform example hardw cost $ / Gflops cost factor energy W / Gflops energy factor MDgrape-3* (domain-specific 2004) 15 1 0.2 1 Pentium 4400 27 14 70 Earth Simulator (supercomputer 2003) 8000 533 128 640 *) feasible also with rDPA microprocessor crisis going multi core supercomputing crisis MPP parallelism does not scale
41
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 41 Syndrome
42
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 42 The von Neumann Paradigm Trap Program counter (auto-increment, jump, goto, branch) Datapath Unit with ALU etc., I/O unit, …. [Burks, Goldstein, von Neumann; 1946] RAM (memory cells have adresses ….) CS education got stuck in this paradigm trap which stems from technology of the 1940s We need a dual paradigm approach CS education’s right eye is blind, and its left eye suffers from tunnel view
43
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 43 What is the reason of the paradox ? Result from decades of tunnel view in CS R&D and education basic mind set completely wrong the von Neumann Syndrome “CPU: most flexible platform” ? >1000 CPUs running in parallel: the most inflexible platform However, FPGA & rDPA are very flexible The Law of More: drastically declining programmer productivity
44
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 44 multicore
45
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 45 Executive Summary doesn‘t help We must first understand the nature of the paradigm Understanding the Paradox ? von Neumann chickens ?
46
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 46
47
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 47 models
48
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 48 Von Neumann CPU DPU program counter DPU CPU term program counter execution triggered by paradigm CPU yes instruction fetch instruction- stream- based RAM memory - World of Software -Engineering Program Source: Software (tunnel view with the left eye)
49
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 49 von Neumann is not the common model program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age:
50
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 50 Here is the contemporary common model program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: Now we are in the configware age: accelerator reconfigurable accelerator hardwired CPU
51
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 51 term program counter execution triggered by paradigm CPU yes instruction fetch instruction- stream- based DPU ** no data arrival* data- stream- based machine models DPU CPU program counter RAM memory von Neumann Anti machine RAM data counter RAM data counter DPU RAM data counter rDPU *) “transport-triggered” **) does not have a program counter - no instruction fetch at run time
52
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 52
53
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 53 Nick Tredennick’s Paradigm Shifts: Von Neumann 1 programming source needed algorithm: variable resources: fixed software CPU Early historic machines algorithm: fixed resources: fixed (slowly preparing to use both eyes for a dual paradigm point of view)
54
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 54 Compilation: Software source program software compiler software code Software Engineering instruction schedule (Befehls-Fahrplan) sequential (von Neumann model)
55
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 55 Nick Tredennick’s Paradigm Shifts configware resources: variable 2 programming sources needed flowware algorithm: variable Reconfigurable Computing Von Neumann 1 programming source needed algorithm: variable resources: fixed software CPU Early historic machines algorithm: fixed resources: fixed
56
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 56 data counter GAG RAM ASM data counter GAG RAM ASM data counter GAG RAM ASM Configware Compilation configware code flowware code mapper configware compiler scheduler source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB programming the data counters configware compilation fundamentally different from software compilation x x x x x x x x x | || xx x x x x xx x - - - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | data streams rDPA pipe network data counter GAG RAM ASM : A uto- S equencing M emories ASM
57
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 57 The first archetype machine model main frame CPU compile or assemble procedural personalization Software Industry Software Industry’s Secret of Success simple basic. Machine Paradigm personalization: RAM-based instruction-stream- based mind set “von Neumann” But now we live in the Configware Age
58
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 58 systolic
59
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 59 of course algebraic (linear projection) only for applications with regular data dependencies Mathematicians caught by their own paradigm trap Rainer Kress d iscarded their algebraic synthesis methods and replaced it by simulated annealing: rDPA 1995 Synthesis Method? The super-systolic array: a generalization of the systolic array reductionist approach
60
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 60 Having introduced Data streams x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # systolic array research: throughout the 80ies: Mathematicians‘ hobby The road map to HPC: ignored for decades ~ 1980 DPA (pipe network) execution transport- triggered no memory wall H. T. Kung
61
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 61 Who generates the Data Streams? Mathematicians: it‘s not our job x x x x x x x x x | || xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | (it‘s not algebraic) „systolic“
62
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 62 Without a sequencer … … it’s not a machine Mathematicians have missed to invent the new machine paradigm reductionist approach: (it‘s not our job) resources sequencer Machine:... the anti machine
63
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 63 The counterpart of the von Neumann machine x x x x x x x x x | || xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | (r)DPA ASM data counter GAG RAM ASM : A uto- S equencing M emory data counters instead of a program counter data counters: located at memory (not at data path) Kress /Kung Anti Machine coarse- grained
64
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 64 Acceleration Mechanisms by ASM-based MoMSW parallelism by multi bank memory architecture reconfigurable address compuattion – before run time avoiding multiple accesses to the same data. avoiding memory cycles for address computation improve parallelism by storage scheme transformations minimize data movement across chip boundaries
65
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 65 FPGAs in Supercomputing Synergisms: coarse-grained parallelism through conventional parallel processing, reconfigurable logic box: 1 Bit and: fine-grained parallelism through direct configware execution on the FPGAs DPU program counter DPU CPU DataPath Units 32 Bit, 64 Bit (millions of rLBs embedded in a reconfigurable interconnect fabrics)
66
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 66 Anti machine resources sequencer memory algorithms flowware data counter s hardwired anti machine: resources sequencer memory algorithms flowware data counter s reconfigurable anti machine: configware
67
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 67 von Neumann machine resources sequencer Machine: resources sequencer memory algorithms software program counter von Neumann machine:
68
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 68 The clash of paradigms a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … accelerators µ processor structural hardware guy programmer procedural the basic mind set is instruction-stream-based kind of data-stream- based mind set the software / hardware chasm we need a datastream based machine paradigm microprocessor age:
69
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 69 Xputer Principles addr. generators reconfigurable Data Path reconfigurable Xputer CPU We used the VAX-11/750 of my group DPLA rALU contemporary ? 1984: first FPGAs: very tiny & very expensive ASM
70
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 70 super
71
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 71
72
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 72 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
73
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 73 dynamic
74
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 74 FPGA Modes of Operation configware code loaded from external flash memory, e. g. after power-on (~milliseconds) time C ph off E ph Execution phase E ph Configuration phase C ph Legend: simple, static reconfigurability (requiring new OS principles)
75
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 75 established R&D area illustrating dynamically reconfigurable time FPGA module no. macro Z E ph module z C phE ph macro X E ph module XC ph E ph C ph configware macro Y C ph E ph module Y C ph X configures Y Swapping and scheduling of relocatable configware code macros is managed by a configware operating system partially reconfigurable configware OS fundamentally different from software OS Reconfigurable Computing at Microsoft Microsoft ReconVista ? Microsoft ReconVista ? Configware OS
76
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 76 Gliederung The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
77
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 77 Reconfigurable HPC This area is almost 10 years old
78
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 78 Reconfigurable HPC This area is almost 10 years old
79
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 79 Have to re-think basic assumptions Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs. Not processing is costly, but moving data and messages We’ve to re-think basic assumptions behind computing
80
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 80 Illustrating the von Neumann paradigm trap The data-stream-based approach The instruction-stream-based approach von Neumann bottle- neck has no von Neumann bottle- neck the watering pot model [Hartenstein] many watering pots
81
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 81 Have to re-think basic assumptions Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs. Not processing is costly, but moving data and messages We’ve to re-think basic assumptions behind computing
82
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 82 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
83
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 83 More compute power by Configware than Software Conclusion: most compute power from Configware 75% of all (micro)processors are embedded4 : 1 avarage acceleration factor >2 -> rMIPS* : MIPS > 2 *) rMIPS: MIPS replaced by FPGA compute power 25% embedded µProc. accelerated by FPGA(s) 1 : 4 (a very cautious estimation**) -> 1 : 1 -> Every 2 nd µProc accelerated by FPGA(s) (difference probably an order of magnitude)
84
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 84 Xputer Lab (around 1990)
85
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 85 anti
86
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 86 Programming Language Paradigms very easy to learn multiple GAGs Principles of MoPL [1994]
87
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 87 „It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“ Avoiding the paradigm shift? Tarek El-Ghazawi, panelist at SuperComputing 2006 „A leap too far for the existing HPC community“ panelist Allan J. Cantle SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective
88
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 88 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
89
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 89 We need a new machine paradigm a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … data-stream-based mind set we urgently need a datastream based machine paradigm data it was pepared almost 30 years ago x x x x x x x x x | || xx x x x x xx x - - - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | data streams
90
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 90 Generic Address Generator GAG Generalization of the DMA data counter GAG GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik] patented by TI 1995 storge scheme optimization methodology, etc. Acceleration factors by: address computation without memory cycles avoid e.g. 94% address computation overhead* *) Software to Xputer migration
91
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 91 The 2nd “archetype” machine model compile structural personalization Configware Industry Configware Industry’s Secret of Success personalization: RAM-based data-stream- based mind set “Kress-Kung” accelerator reconfigurable simple basic. Machine Paradigm
92
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 92 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
93
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 93 array size: 10 x 16 = 160 rDPUs rout thru only not used backbus connect SNN filter on (supersystolic) KressArray (mainly a pipe network) r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU question after the talk: „but you can‘t implement decisions!“ note: software perspective without instruction streams Symptom of the von Neumann Syndrome A High level R&D manager of a large Japanese IT industry group yielded by single-paradigm mind set Executive summary? Forget it ! How about a microprocessor giant having >100 vice presidents ? if clause turns into multiplexer
94
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 94 rDPU CPU Dual Paradigm Application Development SW compiler CW compiler C language source Partitioner Juergen Becker’s CoDe-X, 1996 placement and routing automatic parallelization by loop transformations generating a pipe network
95
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 95 Hybrid Multi Core example twin paradigm machine each core can run CPU mode or rDPU mode rDPU CPU 64 cores How about a microprocessor giant having >100 vice presidents ? customer refuses the pradigm shift? disabled for the paradigm shift ?
96
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 96 rDPU CPU rDPU CPU Compilation for Dual Paradigm Multicore SW compiler CW compiler C language source Partitioner Juergen Becker’s CoDe-X, 1996 compile to hybrid multicore placement and routing automatic parallelization by loop transformations generating a pipe network
97
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 97 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
98
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 98 Here is the common model program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: configware age: CPU accelerator reconfigurable software/configware co-compiler software configware accelerator reconfigurable accelerator hardwired CPU
99
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 99 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions
100
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 100 Multi Core: Just more CPUs ? Complexity and clock frequency of single- core microprocessors come to an end Without a paradigm shift just more CPUs on chip lead to the dead roads known from supercomputing Multi-core microprocessor chips emerging: soon 32 cores on an AMD chip, and 80 on an intel Multi-threading is not the silver bullet We’ve to re-think basic assumptions behind computing
101
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 101 Solution not expected from CS officers We need mutual efforts, like EE/CS cooperation known from the Mead & Conway revolution Progress of the joint task force on CS curriculum recommendations is extremely disillusioning For RC other motivations are similarly high-grade: growing cost and looming shortage of energy. The personal supercomputer: a far-ranging massive push of innovation in all areas of science and economy: by Reconfigurable Computing it‘s more like a lobby: „my area is the most important“
102
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 102 Computing Sciences are in a severe Crisis We urgently need to shape the Reconfigurable Computing Revolution for enabling to go toward incredibly promising new horizons of affordable highest performance computing This cannot be achieved with the classical software-based mind set We need a new dual paradigm approach Watch out not to get screwed ! Supercomputing titans may be your enemies
103
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 103 The Configware Age Mainframe age and microprocessor (-only) age are history We are living in the configware age right now! Attempts to avoid the paradigm shift will again create a disaster
104
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 104 thank you for your patience
105
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 105 overhead
106
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 106
107
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 107 Von Neumann vs. anti machine #featurevon Neumann machine hardwired anti machine reconfigurable anti machine 1m’ code schedules:instruction streamdata streams 2# prog’ sources12 3source 1noneconfigware 4source 2softwareflowware 5sequenced by:program counter data counter s 6counter co-located with: PU (data path): CPU memory block: ASM 9inter PU communication:common memorypiped through 10data meeting PU:move data at run time move locality of execution at compile rime
108
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 108 Overhead avoided by anti machine #featurevon Neumann machine hardwired anti machine reconfigurable anti machine 11state address computation overhead at run time instruction streamnone 12data address computation overhead at run time instruction streamnone 13Inter PU communication overhead at run time instruction streamnone 14instruction fetch at run timeinstruction streamnone 15data meet PU at run timeinstruction streamnone
109
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 109 GAG
110
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 110 MoM Scan window ( MoMSW ) Illustration Multiple* vari-size reconfigurable MoMSW scan windows MoMSW controlled by reconfigurable GAG (generic address generators) 2-dimensional (data) memory address space MoM architectural primary features: *) typically 3 ASM: Auto- Sequencing Memory
111
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 111 CGFFT: Parallel Scan Pattern Animation MoM-3 with 3 varisize scan windows Datapath ASM: Auto- Sequencing Memory
112
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 112 Reconfigurable Generic Address Generator GAG Generalization of the DMA data counter GAG GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik] patented by TI 1995 storge scheme optimization methodology, etc. Acceleration factors by: address computation without memory cycles avoid e.g. 94% address computation overhead* supporting scratch optimization strategies (smart d-caching)
113
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 113 GAG: 2-D Generic Data Sequence Examples
114
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 114 GAG Slider Operation Demo Example address B 0 F floor
115
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 115 XMDS Scan Pattern Editor GUI
116
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 116 JPEG zigzag scan pattern x y EastScan is step by [1,0] end EastScan; SouthScan is step by [0,1] endSouthScan; *> Declarations NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan; SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan; HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag; goto PixMap[1,1] HalfZigZag; SouthWestScan uturn (HalfZigZag) HalfZigZag data counter 2 1 3 4 HalfZigZag
117
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 117 Significance of MoMSW Reconfigurable Scan Windows MoMSW Scan windows have the potential to drastically reduce traffic to/from slow off-chip memory. No instruction streams needed to implement scratch pad optimization strategies using fast on-chip memory MoMSW Scan windows may contribute to speed-up by a factor of 10 and sometimes even much more MoMSW Scan windows are the deterministic alternative („d-caching“) to (indeterministic and speculative) classical cache usage: performance can be well predicted For data-stream-based computing scan windows are highly effective, whereas classical caches are entirely useless
118
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 118 Linear Filter Application after inner scan line loop unrolling final design after scan line unrolling hardw. level access optim. initial design Parallelized Merged Buffer Linear Filter Application with example image of x=22 by y=11 pixel Speed-up factor >11 due to MoMSW-based d-caching & storage scheme optimization
119
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 119 PISA-MoM
120
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 120 Processing 4-by-4 Reference Patterns Mead-&-Conway nMOS Design Rules: 256 4-by-4 reference patterns Mead-&-Conway CMOS Design Rules: >800 4-by-4 reference patterns MoM: all reference patterns matched in a single clock cycle vN Software: some reference patterns can be skipped, depending on earlier patterns DPLA: fabricated by the E.I.S. Multi University Project: PISA DRC accelerator [ICCAD 1984] 1984: 1 DPLA replaces 256 FPGAs Reference patterns automatically generated from Design Rules PISA: a forerunner of the MoM accelerator reconfigurable
121
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 121 Speed-up by MoM-1 compared to 68020 PISA project
122
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 122 Speed-up by MoM-3 compared to SPARC 10/51
123
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 123 1985 – 1990: Multimedia & DSP: MoM-3 speedup
124
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 124 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
125
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 125 Significance of Address Generators Address generators have the potential to reduce computation time significantly. In a grid-based design rule check a speed-up of more than 2000 has been achieved* reconfigured address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead *) 15,000 if the same algorithm is used
126
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 126 hardware vs. software perspective platform hardware perspective data- stream- driven software perspective instruction- stream- driven flexi bility perfor mance pot. 1 single paradigm simple FPGA** XX +++++ 2 µprocessor & multi core XX +++- 3 coarse- grained XX +++++ 4 Platform FPGA 1 & (2)* & 3 XXX ( X )* +++++++ 5 dual paradigm 1 & 2 XXXX ++ 6 2 & 4 XXXX +++++++ 7 2 & 3 XXX ++++ 8 reconfigurable instr. set XXX ++++ *) with soft cores and/or on-chip microprocessor**) without soft cores for software people
127
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 127 Ingredients rLB Soft CPU simple FPGA r DPU BRAM CPU platform FPGA rLB hardwired special functions Soft CPU r DPU BRAM coarse-grained array RAM CPU and, for running legacy software r DPU BRAM anti machine (Xputer) ASM Soft CPU program counter CPU program counter data counter ASM CPU rDPU CPU with reconfigurable instruction set extension rLB (Kress/Kung machine) all multi core! on-chip
128
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 128 perspective ? what expertise needed ? hardware ? microprocessor (also multi core) simple FPGA (fine-grained) platform FPGA (domain-specific core assortment, embedded in FPGA fabrics) coarse-grained reconfigurable array reconfigurable instruction set processor mishmash model – a nightmare for under- graduate studies but by far best optimization potential software perspective von Neumann: software perspective hardware perspective mishmash model (s. a.)
129
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 129 flexibility (for accelerators) Objectives avoiding specific silicon rapid prototyping, field-patching, emulation cheap, compact vHPC for every area which needs:
130
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 130 Reconfigurable Computing opens many spectacular new horizons: Conclusion (1) Cheap vHPC without needing specific silicon, no mask.... Massive reduction of the electricity bill: locally and national Cheap embedded vHPC Cheap desktop supercomputer (a new market) Fast and cheap prototyping Replacing expensive hardwired accelerators Supporting fault tolerance, self-repair and self-organization Flexibility for systems with unstable multiple standards by dynamic reconfigurability Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace … )
131
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 131 Universal vHPC co-architecture demonstrator Conclusion (2) Needed: The compilation tool problem to be solved Language selection problem to be solved Education backlog problems to be solved Use this to develop a very good high school and undergraduate lab course A motivator: preparing for the top 500 contest For widely spreading its use successfully: select killer applications for demo
132
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 132 More compute power by Configware than Software Conclusion: most compute power from Configware 75% of all (micro)processors are embedded4 : 1 avarage acceleration factor >2 -> rMIPS* : MIPS > 2 *) rMIPS: MIPS replaced by FPGA compute power 25% embedded µProc. accelerated by FPGA(s) 1 : 4 (a very cautious estimation**) -> 1 : 1 -> Every 2 nd µProc accelerated by FPGA(s) (difference probably an order of magnitude)
133
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 133 Conclusion (3) Self-Repair and Self-Organization methodology Embedded r-emulation logistics methodology Universal vHPC co-architecture demonstrator select a killer application for demo For widely spreading its use successfully:
134
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 134 Universal HPC co-architecture for: some Goals embedded vHPC (nomadic, automotive,...) desktop vHPC (scientific computing...) Application co-development environment for Hardware non-experts,.... Acceptability by software-type users,... Meet product lifetime >> embedded syst. life: FPGA emulation logistics from development downto maintenance and repair stations examples: automotive, aerospace, industrial,..
135
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 135 SuperComputing 06 SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm? Tarek El-Ghazawi, The George Washington University - Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm? Dave Bennett, Xilinx, Inc - Reconfigurable Computing: The Future of HPC Daniel S. Poznanovic, SRC Computers, Inc. - Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm? Allan J. Cantle, Nallatech Ltd. - Challenges for Reconfigurable Computing in HPC Keith D. Underwood, Sandia National Laboratories - Reconfigurable Computing - Are We There Yet? Rob Pennington, National Center for Supercomputing Applications - Reconfigurable Computing: The Road Ahead Duncan Buell, University of South Carolina - Opportunities and Challenges with Reconfigurable HPC Alan D. George, University of Florida Panel
136
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 136 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
137
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 137 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
138
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 138 Acceleration Mechanisms by ASM-based MoMSW parallelism by multi bank memory architecture reconfigurable address compuattion – before run time avoiding multiple accesses to the same data. avoiding memory cycles for address computation improve parallelism by storage scheme transformations minimize data movement across chip boundaries
139
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 139 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
140
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 140 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
141
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 141 Outline The (non-v-N) anti-machine (Xputer) Speed-up by address generators Data-procedural Programming Language Generalization of the Systolic Array Partitioning Compilation Techniques Design Space Exploration Bridging the Paradigm Chasm
142
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 142 C or FORTRAN ? Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine. Gordon Bell: It will still have to run Fortran! *) like CoDe-X Support tools have been demonstrated by academia Classical programming languages, but with a slightly different semantics (data-procedural) are good candidates for parallel programming. Reiner Hartenstein (conclusion of this talk): or C (X-C) it’s a shorter leap
143
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 143 Newton’s 1st Law Scientists do not change their direction Newton’s 1 st Law à la Gordon Bell: ## *) like CoDe-X ### ## ### ##’ a
144
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 144 Edu defic
145
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 145 Dual paradigm: an old hat Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer Software mind set: instruction-stream-based: flow chart -> control instructions (FSM: state transition) -> Register Transfer Modules (DEC: mid 1970ies); similar concept: Case Western Reserve Univ. ; FF token bit evoke
146
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 146 Dual paradigm: an old hat (2) “It is so simple! why did it take 25 years to find out ?” Hardware Description Language scene ~1970: Because of the reductionists’ tunnel view Because of a lack of transdisciplinary thinking FF token bit evoke
147
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 147 Dual paradigm: an old hat (3) “procedure call” or function call call Module-name (parameters); Software: time domain Hardware Description Languages; Hardware description: space domain
148
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 148 ASM
149
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 149
150
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 150 Co-comp
151
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 151
152
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 152 program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: configware age: CPU accelerator reconfigurable software/configware co-compiler software configware rDPU CPU SW compiler CW compiler C language source Partitioner CoDe-X, 1996 Apropos HiPEAC: Software / Configware Co-Compilation automatic parallelization by loop transformations
153
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 153 Jürgen Becker’s CoDE-X -1 Co-Compiler Analyzer / Profiler GNU C compiler paradigm Computer machine X-C compiler Anti machine paradigm Partitioner X-C is C language extended by MoPL X-C CPU Xputer & running legacy software rALU: => array size: 1-by-1
154
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 154 Jürgen Becker’s CoDE-X -2 Co-Compiler Analyzer / Profiler GNU C compiler paradigm Computer machine DPSS X-C compiler Anti machine paradigm Partitioner X-C is C language extended by MoPL X-C rDPU CPU Resource Parameters supporting KressArray family Pipelining: A Shorter Leap
155
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 155 Jürgen Becker’s CoDE-X -2 Co-Compiler Analyzer / Profiler GNU C compiler paradigm Computer machine DPSS X-C compiler Anti machine paradigm Partitioner X-C is C language extended by MoPL X-C rDPU CPU rDPU CPU heterogenous multi-core by dual mode cores: CPU mode vs. rDPU mode
156
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 156 Why better
157
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 157
158
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 158 hardware vs. software perspective platform hardware perspective data- stream- driven software perspective instruction- stream- driven flexi bility perfor mance pot. 1 single paradigm simple FPGA** XX +++++ 2 µprocessor & multi core XX +++- 3 coarse- grained XX +++++ 4 Platform FPGA 1 & (2)* & 3 XXX ( X )* +++++++ 5 dual paradigm 1 & 2 XXXX ++ 6 2 & 4 XXXX +++++++ 7 2 & 3 XXX ++++ 8 reconfigurable instr. set XXX ++++ *) with soft cores and/or on-chip microprocessor**) without soft cores for software people
159
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 159 Data meeting the Processing Unit (PU) by Software by Configware routing the data by memory-cycle-hungry instruction streams thru shared memory placement of the execution locality... We have 2 choices pipe network generated by configware compilation... partly explaining the RC paradox
160
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 160 Data meeting the Processing Unit by Configware placement of the execution locality... … pipe network generated by configware compilation
161
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 161 conclus
162
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 162 thank you for your patience
163
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 163 END
164
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 164 END
165
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 165 „It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“ Avoiding the paradigm shift? Tarek El-Ghazawi, panelist at SuperComputing 2006 „A leap too far for the existing HPC community“ panelist Allan J. Cantle SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective
166
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 166 … the promise of almost unimagined computing power have the hardware developers raced too far ahead of many programmers' ability to create software ? parallel computing has been an esoteric skill limited to people involved with high- performance supercomputing. That is changing now that desktop computers and even laptops aregoing multicore. "High-performance computing experts have learned to deal with this, but they are a fraction of the programmers," Saied says. “ In the future you won't be able to get a computer that's not multicore multicore chips become ubiquitous, all programmers will have to learn new tricks." Even in high-performance computing there are areas that aren't yet ready for the new multicore machines. "In industry, much of their high-performance code is not parallel," Saied says. "These corporations have a lot of time and money invested in their software, and they are rightly worried about having to re-engineer that code base." Avoiding the paradigm shift? „A leap too far for the existing HPC community“ We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective
167
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 167 "Moore's Gap." Steve Kirsch, an engineering fellow for Raytheon Systems Co., says that multicore computing presents both the dream of infinite computing power and the nightmare of programming. "The real lesson here is that the hardware and software industries have to pay attention to each other," Kirsch says. "Their futures are tied together in a way that they haven't been in recent memory, and that will change the way both businesses will operate." Avoiding the paradigm shift? February, Intel released research details about a chip with 80 cores, a fingernail sized chip that has the same processing power that in 1996 required a supercomputer with a 2,000-square-foot footprint and using 1,000 times the electrical power. a problem for those who depend on previously written software that has been steadily improving and evolving over decades. "Our legacy software is a real concern to us. parallel programming for multicore computers may require new computer languages. "Today we program in sequential languages Do we need to express our algorithms at a higher level of abstraction? Research into these areas is critical to our success."
168
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 168 ""Our programming languages researchers are exploring new programming paradigms and models," Hambrusch says. "Our course on multicore architectures is also preparing students for future software development positions. Purdue is clearly playing a defining role in this critical technology." Avoiding the paradigm shift? "In five or six years, laptop computers will have the same capabilities, and face the same obstacles, as today's supercomputers," Saied says. "This challenge will face people who program for desktop computers, too. People who think they have nothing to do with supercomputers and parallel processing will find out that they need these skills, too." Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. Like locally-based Direct Memory Access (DMA), RDMA improves throughput and performance because it frees up resources. RDMA also facilitates a faster data transfer rate. RDMA implements a transport protocol in the network interface card (NIC) hardw
169
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 169 Three Ways to Make Multicore Work -- Number 1: -- Mathematics: Do more computational work with less data motion – E.g., Higher-order methods Trades memory motion for more operations per word, producing an accurate answer in less elapsed time than lower order methods – Different problem decompositions (no stratified solvers) The mathematical equivalent of loop fusion E.g., nonlinear Schwarz methods – Ensemble calculations Compute ensemble values directly – It is time (really past time) to rethink algorithms for memory locality and latency tolerance I didn’t say threads See, e.g., Edward A. Lee, "The Problem with Threads," Computer, vol. 39, no. 5, pp. 33-42, May, 2006. “Night of the Living Threads”, http://weblogs.mozillazine.org/roc/archives/2005/12/night_of_the_living_threads.html, 2005 Robert O'Callahan: “Why Threads Are A Bad Idea (for most purposes)” John Ousterhout (~2004) Allen Holub: “If I were king: A proposal for fixing the Java programming language's threading problems” http://www128.ibm.com/developerworks/library/j-king.html, 2000 Allen Holub has been working in the computer industry since 1979. He is widely published in magazines (Dr. Dobb's Journal, Programmers Journal, Byte, MSJ, among others), and he writes the "Java Toolbox" column for the online magazine JavaWorld. Avoiding the paradigm shift? Breaking the Assumptions -- Don’t have any off-chip memory – Consequence: Need algorithms, programming models, and software tools to work in more limited memory (a few GB) -- Have off-chip memory, but manage it more effectively – Consequence: Need to find a true, general-purpose hardware/software model -- Overlap latency with split operations – Consequence: Need to find massive amounts of concurrency; need to manage the programming challenges of split operations (these are hard for programmers to use correctly - may be an opportunity for formal methods) Multicore doesn’t just stress bandwidth, it increases the need for perfectly parallel algorithms -- All systems will look like attached processors - high latency, low (relative) bandwidth to main memory 128 cores? “When [a] request for data from Core 1 results in a L1 cache miss, the request is sent to the L2 cache. If this request hits a modified line in the L1 data cache of Core 2, certain internal conditions may cause incorrect data to be returned to the Core 1.” Everything does not double: traveling from New York to Chicago: before 1830: 3 weeks - 1857: 1+1/2 days; now: 6 hours - only a factor of 6 MPI on Multi-Core: 340 ns MPI ping/pong latency improvement will require better SWE tools Benchmarks Ping-pong latency – Ring-based ping-pong exchange between all nodes Nearest-neighbor ghost-area exchange – Test code from Argonne used to evaluate onesided and point-to-point operations CPU availability – Calculates percentage of CPU available at receiver by doing a fixed amount of work during message arrival
170
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 170 in Memoriam Stamatis Vassiliadis 1951 - 2007 in Memoriam Richard Newton 1951 - 2007 in Memoriam …
171
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 171 KressArray DPSS Application Set DPSS published at ASP-DAC 1995 Architecture Editor Mapping Editor statist. Data Delay Estim. Analyzer Architecture Estimator interm. form 2 expr. tree ALE-X Compiler Power Estimator Power Data VHDL Verilog HDL Generator Simulator User ALEX Code Improvement Proposal Generator Suggestion Selection User Interface interm. form 3 Mapper Design Rules Datapath Generator Kress rDPU Layout data stream Schedule Scheduler KressArray Xplorer (Platform Design Space Explorer) Xplorer Inference Engine (FOX) Sug- gest- ion KressArray family parameters
172
© 2007, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 172 KressArray Family generic Fabrics: a few examples Examples of 2 nd Level Interconnect: layouted over rDPU cell - no separate routing areas ! + rout-through and function rout- through only more NNports: rich Rout Resources Select Function Repertory select Nearest Neighbour (NN) Interconnect: an example 16328 24 4 2 rDPU Select mode, number, width of NNports http://kressarray.de
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.