The von Neumann Syndrome calls for a Revolution

The von Neumann Syndrome calls for a Revolution
9 November 2018 HPRCTA'07 - First International Workshop on High-Performance Reconfigurable Computing Technology and Applications - in conjunction with SC07 - Reno, NV, November 11, 2007 Reiner Hartenstein TU Kaiserslautern The von Neumann Syndrome calls for a Revolution

About Scientific Revolutions
9 November 2018 About Scientific Revolutions Thomas S. Kuhn: The Structure of Scientific Revolutions Ludwik Fleck: Genesis and Developent of a Scientific Fact 2

What is the von Neumann Syndrome
9 November 2018 What is the von Neumann Syndrome Computing the von Neumann style is tremendously inefficient. Multiple layers of massive overhead phenomena at run time often lead to code sizes of astronomic dimensions: resident at drastically slower off-chip memory. The manycore programming crisis requires complete re-mapping and re-implementation of applications. A sufficiently large population of programmers qualified to program applications for 4 and more cores is far from being available. 3

Education for multi-core
9 November 2018 Education for multi-core Mateo Valero I programming multicores Multicore-based pacifier 4

Will Computing be affordable in the Future?
9 November 2018 Will Computing be affordable in the Future? Another problem is a high priority political issue: the very high energy consumption of von-Neumann-based Systems. The electricity consumption of all visible and hidden computers reaches more than 20% of our total electricity consumption. A study predicts % for the US by the year 2020. 5

Reconfigurable Computing highly promising
9 November 2018 Reconfigurable Computing highly promising Fundamental concepts from Reconfigurable Computing promise a speed-up by almost one order of magnitude, for some application areas by up to 2 or 3 orders of magnitude, at the same time slashing the electricity bill down to 10% or less. It is really time to fully exploit the most disruptive revolution since the mainframe: Reconfigurable Computing - also to reverse the down trend in CS enrolment. Reconfigurable Computing shows us the road map to the personal desktop supercomputer making HPC affordable also for small firms and for individuals, and, to a drastic reduction of energy consumption. Contracts between microprocessor firms and Reconfigurable Computing system vendors are on the way but not yet published. The technology is ready, but most users are not. Why? 6

A Revolution is overdue
9 November 2018 A Revolution is overdue The talk sketches a road map requiring a redefinition of the entire discipline, inspired by the mind set of Reconfigurable Computing. 7

much more saved by coarse-grain
9 November 2018 much more saved by coarse-grain platform examle energy W / Gflops energy factor MDgrape-3* (domain-specific 2004) 0.2 1 Pentium 4 14 70 Earth Simulator (supercomputer 2003) 128 640 *) feasible also with rDPA 8

(3) Power-aware Applications
2020 100 200 (3) Power-aware Applications Cyber infrastructure energy consumption: several predictions. most pessimistic: almost 50% by 2025 in the USA Mobile Computation, Communication, Entertainment, etc. (high volume market) 2003 and later PCs and servers (high volume) HPC and Supercomputing, 9

An Example: FPGAs in Oil and Gas .... (1)
9 November 2018 An Example: FPGAs in Oil and Gas .... (1) [Herb Riley, R. Associates] „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" For this example speed-up is not my key issue (Jürgen Becker‘s tutorial showed much higher speed-ups - going upto a factor of 6000) For this oil and gas example a side effect is much more interesting than the speed-up 10

An Example: FPGAs in Oil and Gas .... (2)
9 November 2018 An Example: FPGAs in Oil and Gas .... (2) [Herb Riley, R. Associates] „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" Saves more than $10,000 in electricity bills per year (7¢ / kWh) per 64-processor 19" rack did you know … This is a strategic issue … 25% of Amsterdam‘s electric energy consumption goes into server farms ? … a quarter square-kilometer of office floor space within New York City is occupied by server farms ? 11

Oil and Gas as a strategic issue
9 November 2018 Oil and Gas as a strategic issue Low power design: not only to keep the chips cool You know the amount of Google’ s electricity bill? It should be investigated, how far the migrational achievements obtained for computationally intensive applications, can also be utilized for servers Recently the US senate ordered a study on the energy consumption of servers 12

Flag ship conference series: IEEE ISCA
migration of the lemings Other: cache coherence ? speculative scheduling? 98.5 % von Neumann [David Padua, John Hennessy, et al.] Parallelism faded away (2001: 84%) Jean-Loup Baer 13

Unqualified for RC ? hiring a student from the EE dept. ?
Using FPGAs for scientific computation? hiring a student from the EE dept. ? application disciplines use their own trick boxes: transdisciplinary fragmentation of methodology CS is responsible to provide a RC common model for transdisciplinary education and, to fix its intradisciplinary fragmentation 14

Computing Curricula 2004 fully ignores Reconfigurable Computing
9 November 2018 Joint Task Force for Computing Curricula 2004 fully ignores Reconfigurable Computing Curricula ? FPGA & synonyma: 0 hits (Google: 10 million hits) not even here 15

Curriculum Recommendations, v. 2005
Upon my complaints the only change: including to the last paragraph of the survey volume: "programmable hardware (including FPGAs, PGAs, PALs, GALs, etc.)." However, no structural changes at all v intended to be the final version (?) torpedoing the transdisciplinary responsibility of CS curricula This is criminal ! 16

fine-grained vs. coarse-grained reconfigurability
“fine-grained” means: data path width ~1 bit “coarse-grained”: path width = many bits (e.g. 32 bits) CPU w. extensible instruction set (partially reconfigurable) Domain-specific CPU design (not reconfigurable) Soft core CPU (reconfigurable) instruction-stream-based Domain-specific rDPU design rDPU with extensible „instruction„ set data-stream-based 17

coarse-grained: terminology
program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream-based DPU** no data arrival* data-stream-based program counter DPU CPU DPU **) does not have a program counter *) “transport-triggered” 18

coarse-grained: terminology
program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream-based DPU** no data arrival* data-stream-based program counter DPU CPU DPU **) does not have a program counter *) “transport-triggered” PACT Corp, Munich, offers rDPU arrays rDPAs 19

The Paradigm Shift to Data-Stream-Based
9 November 2018 The Paradigm Shift to Data-Stream-Based The Method of Communication and Data Transport by Software the von Neumann syndrome by Configware complex pipe network on rDPA 20

(generalization of the systolic array model)
The Anti Machine A kind of trans(sub)disciplinary effort: the fusion of paradigms Interpreation [Thomas S.Kuhn]: cleanup the terminology! non-von-Neumann machine paradigm (generalization of the systolic array model) Twin paradigm ? split up into 2 paradigms? Like mater & anti matter: one elementary particles physics 21

Languages turned into Religions
9 November 2018 Languages turned into Religions Java is a religion – not a language [Yale Patt] Teaching to students the tunnel view of language designers falling in love with the subtleties of formalismes instead of meeting the needs of the user 22

The language and tool disaster
9 November 2018 The language and tool disaster End of April a DARPA brainstorming conference Software people do not speak VHDL Hardware people do not speak MPI Bad quality of the application development tools A poll at FCCM’98 revealed, that % hardware designers hate their tools 23

The first Reconfigurable Computer
9 November 2018 The first Reconfigurable Computer prototyped by Herman Hollerith a century before FPGA introduction data-stream-based Herman Hollerith *29 Feb 1860 Buffalo 60 years later the von Neumann (vN) model took over instruction-stream-based 24

Reconfigurable Computing came back
9 November 2018 Reconfigurable Computing came back As a separate community – the clash of paradigms 1960 „fixed plus variable structure computer“ proposed by G. Estrin 1970 PLD (programmable logic device*) 1985 FPGA (Field Programmable Gate Array) 1989 Anti Machine Model – counterpart of von Neumann 1990 Coarse-grained Reconfigurable Datapath Array Wann? Foundation of PACT Wann reconfigurable address generator – MoPL does not support massive parallelism in large systems ...... *) Boolean equations in sum of products form implemented by AND matrix and OR matrix AND matrix OR matrix PLA reconfigurable ePROM fixed PAL structured VLSI design like memory chips: integration density very close to Moore curve 25

Outline von Neumann overhead hits the memory wall
9 November 2018 Outline von Neumann overhead hits the memory wall The manycore programming crisis Reconfigurable Computing is the solution We need a twin paradigm approach Conclusions 26

The spirit of the Mainframe Age
9 November 2018 The spirit of the Mainframe Age For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps … … finally tending to code sizes of astronomic dimensions Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view 1951: Hardware Design going von Neumann (Microprogramming) 27

von Neumann: array of massive overhead phenomena
9 November 2018 von Neumann: array of massive overhead phenomena … piling up to code sizes of astronomic dimensions overhead von Neumann machine instruction fetch instruction stream state address computation data address computation data meet PU i/o - to / from off-chip RAM multi-threading overhead … other overhead 28

9 November 2018 von Neumann: array of massive overhead phenomena piling up to code sizes of astronomic dimensions overhead von Neumann machine instruction fetch instruction stream state address computation data address computation data meet PU i/o - to / from off-chip RAM multi-threading overhead … other overhead temptations by von Neumann style software engineering [Dijkstra 1968] the “go to” considered harmful massive communication congestion [R.H. 1975] universal bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style 29

9 November 2018 von Neumann: array of massive overhead phenomena piling up to code sizes of astronomic dimensions Dijkstra 1968 R.H., Koch 1975 Backus 1978 Arvind 1983 overhead von Neumann machine instruction fetch instruction stream state address computation data address computation data meet PU i/o - to / from off-chip RAM multi-threading overhead … other overhead temptations by von Neumann style software engineering [Dijkstra 1968] the “go to” considered harmful massive communication congestion [R.H. 1975] universal bus considered harmful 30

von Neumann overhead: just one example
9 November 2018 von Neumann overhead: just one example 94% computation load only for moving this window overhead von Neumann machine instruction fetch instruction stream state address computation data address computation data meet PU i/o - to / from off-chip RAM multi-threading overhead … other overhead [1989]: 94% computation load (image processing example) 31

instruction stream code size of astronomic dimensions …..
9 November 2018 instruction stream code size of astronomic dimensions ….. … needs off-chip RAM which fully hits ends in 2005 2005: ~1000 the Memory Wall 1 10 100 1000 Performance 1980 1990 2000 DRAM CPU µProc 60%/yr.. CPU clock speed ≠ performance: processor’s silicon is mostly cache Dave Patterson’s Law - “Performance” Gap: better compare off-chip vs. fast on-chip memory processors are not that good growth 50% / year DRAM 7%/yr.. 32

Benchmarked Computational Density
DEC alpha 9 November 2018 Benchmarked Computational Density stolen from Bob Colwell CPU caches ... CPU clock speed ≠ performance: processor’s silicon is mostly cache [BWRC, UC Berkeley, 2004] 1990 1995 2000 2005 200 100 50 150 75 25 125 175 SPECfp2000/MHz/Billion Transistors alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs SUN intel curve removed, meanwhile all curves removed from RAMP website IBM HP 33

9 November 2018 The Manycore future we are embarking on a new computing age the age of massive parallelism [Burton Smith] everyone will have multiple parallel computers [B.S.] Even mobile devices will exploit multicore processors, also to extend battery life [B.S.] multiple von Neumann CPUs on the same µprocessor chip lead to exploding (vN) instruction stream overhead [R.H.] 35

von Neumann parallelism
the sprinkler head has only a single whole: the von Neumann bottleneck the watering pot model [Hartenstein] 36

Several overhead phenomena
The instruction-stream-based parallel von Neumann approach: the watering pot model [Hartenstein] per CPU! has several von Neumann overhead phenomena CPU 37

Explosion of overhead by von Neumann parallelism
CPU overhead von Neumann machine monoprocessor local overhead instruction fetch instruction stream state address computation data address computation data meet PU i / o to / from off-chip RAM … other overhead parallel global inter PU communication message passing proportionate to the number of processors disproportionate to the number of processors [R.H. 2006] MPI considered harmful 38

Rewriting Applications
CPU more processors means rewriting applications we need to map an application onto different size manycore configurations most applications are not readily mappable onto a regular array. rDPU Mapping is much less problematic with Reconfigurable Computing 39

Disruptive Development
9 November 2018 Disruptive Development Computer industry is probably going to be disrupted by some very fundamental changes. [Ian Barron] We must reinvent computing. [Burton J. Smith] A parallel [vN] programming model for manycore machines will not emerge for five to 10 years [experts from Microsoft Corp]. does not support massive parallelism in large systems ...... I don‘t agree: we have a model. Reconfigurable Computing: Technology is Ready, Users are Not The Education Wall It‘s mainly an education problem 40

The Reconfigurable Computing Paradox
9 November 2018 The Reconfigurable Computing Paradox Bad FPGA technology: reconfigurability overhead, wiring overhead, routing congestion, slow clock speed Up to 4 orders of magnitude speedup + tremendously slashing the electricity bill by migration to FPGA The reason of this paradox ? There is something fundamentally wrong in using the von Neumann paradigm The spirit from the Mainframe Age is collapsing under the von Neumann syndrome 42

beyond von Neumann Parallelism
the watering pot model [Hartenstein] The instruction-stream-based von Neumann approach: We need an approach like this: per CPU! it’s data-stream-based RC* has several von Neumann overhead phenomena *) “RC” = Reconfigurable Computing 43

beyond von Neumann Parallelism
9 November 2018 beyond von Neumann Parallelism the watering pot model [Hartenstein] instead of this instruction-stream-based parallelism we need an approach like this: per CPU! several von Neumann overhead phenomena it’s data-stream-based Recondigurable Computing 44

von Neumann overhead vs. Reconfigurable Computing
rDPU rDPA: reconfigurable datapath array (coarse-grained rec.) von Neumann overhead vs. Reconfigurable Computing using program counter using data counters using reconfigurable data counters overhead von Neumann machine hardwired anti machine reconfigurable anti machine instruction fetch instruction stream none* state address computation data address computation data meet PU + other overh. i / o to / from off-chip RAM Inter PU communication message passing overhead no instruction fetch at run time *) configured before run time 45

von Neumann overhead vs. Reconfigurable Computing
rDPU von Neumann overhead vs. Reconfigurable Computing (coarse-grained rec.) using program counter using data counters using reconfigurable data counters rDPA: reconfigurable datapath array overhead von Neumann machine hardwired anti machine reconfigurable anti machine instruction fetch instruction stream none* state address computation data address computation data meet P + other overh. i / o to / from off-chip RAM Inter PU communication message passing overhead [1989]: x 17 speedup by GAG** (image processing example) [1989]: x 15,000 total speedup from this migration project *) configured before run time **) just by reconfigurable address generator 46

Reconfigurable Computing means …
9 November 2018 Reconfigurable Computing means … For HPC run time is more precious than compiletime Reconfigurable Computing means moving overhead from run time to compile time** Reconfigurable Computing replaces “looping” at run time* … … by configuration before run time **) or, loading time *) e. g. complex address computation 47

Reconfigurable Computing means …
9 November 2018 Reconfigurable Computing means … For HPC run time is more precious than compiletime Reconfigurable Computing means moving overhead from run time to compile time** Reconfigurable Computing replaces “looping” at run time* … … by configuration before run time **) or, loading time *) e. g. complex address computation 48

Data meeting the Processing Unit (PU)
... explaining the RC advantage We have 2 choices routing the data by memory-cycle-hungry instruction streams thru shared memory by Software by Configware (data) data-stream-based: placement* of the execution locality ... (PU) pipe network generated by configware compilation *) before run time 49

Generalization* of the systolic array
What pipe network ? array port receiving or sending a data stream rDPA rDPU pipe network, organized at compile time depending on connect fabrics Generalization* of the systolic array rDPA rDPA = rDPU array, i. e. coarse-grained rDPU [R. Kress, 1995] *) supporting non-linear pipes on free form hetero arrays rDPU = reconf. datapath unit (no program counter) 50

Migration benefit by on-chip RAM
Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM so that the drastic code size reduction by software to configware migration can beat the memory wall multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions GAGs inside ASMs generate the data streams ASM data counter GAG RAM ASM: Auto-Sequencing Memory rDPA rDPU ASM GAG = generic address generator rDPA = rDPU array, i. e. coarse-grained rDPU = reconf. datapath unit (no program counter) 51

Coarse-grained Reconfigurable Array example
9 November 2018 Coarse-grained Reconfigurable Array example image processing: SNN filter ( mainly a pipe network) coming close to programmer‘s mind set (much closer than FPGA) compiled by Nageldinger‘s KressArray Xplorer (Juergen Becker‘s CoDe-X inside) array size: 10 x 16 = 160 such rDPUs rout thru only not used backbus connect ASM rDPU . . . . . 32 bits wide mesh-connected; exceptions: see 3 x 3 fast on-chip RAM note: kind of software perspective, but without instruction streams  datastreams+ pipelining 52

Software / Configware Co-Compilation
9 November 2018 apropos compilation: Software / Configware Co-Compilation Analyzer / Profiler SW code SW compiler para d igm “vN" machine CW Code CW anti machine paradigm Partitioner C language source FW Code Juergen Becker 1996 The CoDe-X co-compiler But we need a dual paradigm approach: to run legacy software together w. configware Reconfigurable Computing: Technology is Ready. -- Users are Not ? Both, partitioner and DPSS, use simulated annealing for mapping and optimization.. 54

Curricula from the mainframe age
(procedural) structurally disabled the education wall the main problem non-von-Neumann accelerators (this is not a lecture on brain regions) no common model the common model is ready, but users are not not really taught 55

We need a twin paradigm education
each side needs its own common model procedural structural (this is not a lecture on brain regions) Brain Usage: both Hemispheres 56

RCeducation 2008 http://fpl.org/RCeducation/ teaching RC ?
9 November 2018 RCeducation 2008 teaching RC ? The 3rd International Workshop on Reconfigurable Computing Education April 10, 2008, Montpellier, France 57

9 November 2018 We need new courses 2007 We need undergraduate lab courses with HW / CW / SW partitioning We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW co-design “We urgently need a Mead-&-Conway-like text book “ [R. H., Dagstuhl Seminar 03301,Germany, 2003] Here it is ! 58

9 November 2018 Conclusions We need to increase the population of HPC-competent people [B.S.] We need to increase the population of RC-competent people [R.H.] Data streaming is the key model of parallel computation – not vN Von-Neumann-type instruction streams considered harmful [RH] But we need it for some small code sizes, old legacy software, etc. … The twin paradigm approach is inevitable, also in education [R. H.]. 60

An Open Question Which effect is delaying the break-through?
Coarse-grained arrays: technology ready*, users not ready *) offered by startups (PACT Corp. and others) **) “FPGAs? Do we need to learn hardware design?” Much closer to programmer’s mind set: really much closer than FPGAs** Which effect is delaying the break-through? please, reply to: 61

9 November 2018 thank you 62

9 November 2018 END 63

The von Neumann Syndrome calls for a Revolution

Similar presentations

Presentation on theme: "The von Neumann Syndrome calls for a Revolution"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The von Neumann Syndrome calls for a Revolution

Similar presentations

Presentation on theme: "The von Neumann Syndrome calls for a Revolution"— Presentation transcript:

Similar presentations

About project

Feedback