Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa.

Similar presentations


Presentation on theme: "ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa."— Presentation transcript:

1 ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa

2 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 2 my relations to South Africa October 1981: reporting about KARL language and CHDL 1981 at Kaiserslautern http://hartenstein.de/Star.jpg Un. Stellenbosch: an early KARL licensee http://hartenstein.de/KARL-users.html

3 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 3 >> The methodology gap << http://www.uni-kl.de The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks

4 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 4 FPGA with island architecture rLB: reconfigurable logic box reconfigurable interconnect fabrics logic design issue: far from computing mind set

5 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 5 software to configware migration: speed-up examples straight forward x16 MOPS/mW16 tap FIR filter PACT Xtreme 4-by-4 array [2003] application examplemethodspeed-up factorplatform multiple aspects > x1000 (computation time) grid-based DRC** 1-metal 1-poly nMOS *** 256 reference patterns MoM anti machine with DPLA* [1983] *) DPLA: MPC fabr. via E.I.S. multi univ. project key issue: algorithmic cleverness **) Design Rule Check hi level synthesis x7 – x46 (compute time) migrate several simple application exampes CPU 2 FPGA [FPGA 2004] ***) for 10-metal 3-poly cMOS expected: >> x10,000 not spec. X 100 (compute time) from fastest DSP: 10 gMACs to 1 teraMAC DSP 2 FPGA [Xilinx 2004 2 ] 2) Wim Roelandts

6 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 6 FPGAs are mainstream my talk is not about FPGAs (reconfigurable logic) Predicted to grow to $5 billion by 2007 FPGA market is worth $3 billion fastest growing segmentof IC market soft hardware morphware [DARPA] its‘ about coarse grain: Reconfigurable Computing

7 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 7 >> The wrong roadmap << http://www.uni-kl.de The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks

8 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 8 data are moved around by software (slower than CPU clock by 2 orders of magnitude) i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall extremely unbalanced stolen from Bob Colwell CPU

9 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 9 The wrong roadmap But HPC and supercomputing have stubbornly avoided configware use for the past 20 years Educational deficits are one reason An exception: Splash - it has been discarded Adopting a new mindset: the brain hurts Configware methodology moves data around more efficiently

10 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 10 It‘s not Software The programming sources for morphware and Reconfigurable Computing are fundamentally different: Software to Configware migration includes a time to space conversion it‘s Configware structural programming instead of procedural.... data-stream-based instead of instruction-stream-based

11 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 11 no instruction fetch Configware execution does not need instruction fetch... which saves memory cycles, and...... because a (super) instruction fetch happens before run time: (re)configuration,... more performance benefit comes from other acceleration mechanisms which stem from time to space migration by Reconfigurable Computing *) also programming for HPC ! You don‘t believe ?

12 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 12 path of least resistance * : avoiding a paradigm shift Many researchers seem never to stop working on sophisticated solutions for marginal improvements...... continously ignoring methodologies promising speed-ups by orders of magnitude....... continue to bang their heads against the memory wall instead of *) [Michel Dubois] blinders to ignore the impact of morphware

13 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 13 the data-stream-based approach has no von Neumann bottle- neck … understand only this parallelism solution: the instruction-stream-based approach von Neumann bottle- necks... cannot cope with this one

14 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 14 Living dangerously … me talking to HPC people: „ the wrong roadmap the past 20 years “

15 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 15 the hardware / Software Chasm: typical programmers don‘t understand function evaluation without machine mechanisms (counters, state registers) It‘s the gap between procedural (instruction-stream- based) and structural (datastream-based) mind set accelerators µ processor

16 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 16 >> HPC goesReconfigurable << http://www.uni-kl.de The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks

17 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 17 By the way... http://fpl.org 15th International Conference on Field-Programmable Logic and Applications (FPL) Aug. 24 – 26 2005, Tampere, Finland in 2004: 288 submissions !... the oldest and largest conference in the field: accel. µ Proc.... going into every type of application they all work on high performance

18 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 18 another conference... http://hartenstein.de/raw05.html April 4 – 5, 2005, Denver, Colorado in conjunction with IPDPS 2004 ! 99 submissions ! accel. µ Proc.... going into every type of application they all work on high performance

19 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 19 more conferences... FPL series: 15th at Tampere Finland RAW series: 12th at Denver, Colorado DRS W. on Dynamically Reconf. Systems, Innsbruck, Austria 2005 W. Archit. Res. using FPGA platforms in conj. w. HPCA-11. SFO, Febr 2005 PACT'98 w. RC, Paris 1998. MAPLD (mil appl....) USA only - 8th FPGA, Monterey, California FCCM, Napa, California, April 2005 FPT 2004 Int‘l Conf. ERSA, Las Vegas, Nevada, USA RHPC - in conj. w. HPC Asia ARC 2005 Int‘l w. Applied RC, Algarve, Portugal, February 22, 2005 ReConFig04 Univ. of Colima, Mexico DFG w. 04 at DaimlerChrysler, Germany ICES'05 Int‘l C. on Evolvable Systems, From Biology To Hardware, special tracks in other conferences: ICECS, DATE, DAC, others Research programs: DARPA, EU, DFG... EuroGP - European Conference on Genetic Programming GECCO - Genetic Evolutionary Computation Conference, CEC - Congress on Evolutionary Computation SEAL - Asia-Pacific Conf. on Simulated Evolution And Learning EA - International Conference on Artificial Evolution, 7th, October 26-28 2005, Lille, France, ECML - European Conference on Machine Learning IEEE Conference on Evolutionary Computation International Conference on Evolutionary Programming (EP), European Conference on Artificial Evolution (AE) EUNITE - European Symp. on Intelligent Technologies, Hybrid Systems and their Implementation in Smart Adaptive Systems EUROGEN - Evolutionary Methods for Design, Optimisation and Control with Applications to Industrial Problems ACDM - International Conference on Adaptive Computing in Design and Manufacture EvoRobot - European Workshop on Evolutionary Robotics.......

20 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 20 First Indications of Change 10th RAW at IPDPS, Nice, France, April 2003: after a decade of non-overlap: first IPDPS people coming HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing, July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan: Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address * HPCA-11, 11th International Symposium on High-Performance Computer Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely: Embedded and reconfigurable architectures SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29, 2004: topic area explicitely: Reconfigurable Systems *) keynote speaker: PARS & Speed-up, Basel, Switzerland, March 2003: keynote address * IPDPS, Santa Fe, NM, USA, April 2004: keynote address * PDP’04, La Coruna, Spain, Febr. 2004: keynote address *

21 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 21 Cray XD1 FPGAs as programmable co-processors for parallelization 6 optional DSPs as accelerators Xlinx Virtex-II Pro FPGAs cooperate with AMD Opteron FPGAs dynamically programmable by Configware. Cray provides a configware library with special algorithms for search and Sort, DSP, and Encryption. Obtains in Genome-Sequenciung a speed-up of >100.

22 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 22 >> Coarse grain is the way togo << http://www.uni-kl.de The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks

23 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 23 rDPA (coarse grain) vs. FPGA (fine grain) roughly: area efficiency (transistors/chip, orders of magnitude) hardwired 4 FPGA 2 µProc 0 (coarse grain) rDPA 4 roughly: performanc e (MOPS/mW, orders of magnitude) hardwired 3 FPGA 2 µProc 0 (coarse grain) rDPA 3 DSP 1

24 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 24 array size: 10 x 16 = 160 rDPUs Coarse grain is about computing, not logic rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] Example: mapping onto rDPA by DPSS: based on simulated annealing reconfigurable function block, e. g. 32 bits wide no CPU

25 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 25 branching example: time-to-space migration *) if no intermediate storage in register file on a very simple CPU C = 1 memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & register transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & register transfers store result1100 total5500 S = R + (if C then A else B endif); S + A B R C Clock 200 MHz (5 nanosec) =1 section of a very large pipe network: no memory cycles: Speed-up factor = 100

26 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 26 commercial rDPA example: PACT XPP - XPU128 XPP128 rDPA Evaluation Board available, and XDS Development Tool with Simulator buses not shown rDPU Full 32 or 24 Bit Design working silicon 2 Configuration Hierarchies © PACT AG, Munich http://pactcorp.com (r) DPA

27 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 27  64 ALU-PAEs  16 RAM-PAEs  24-bit architecture  Split (12,12)-bit opcodes  complex addition  complex multiplication  conditional sign-flip  JTAG debug interface  Synthesis, P&R, Layout from ACCENT  0.13µm silicon from STMicro, Crolles  Available since March 2003 XPP64A: Coarse-grain Reconfigurable Array

28 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 28 XPP64A: Platform Development Board - SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab - Assembly & Test / Available since March 2003

29 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 29 SMeXPP: Multimedia Co-Processor Concept Games MusicVideos SMeXPP Camera Baseband- Processor Radio- Interface Audio - Interface SD/MMC Cards LCD DISPLAY SMeXPP Variable resolutions and refresh rates Variable scan mode characteristics Noise Reduction and Artifact Removal High performance requirements Variable file encoding formats Variable content security formats Variable Displays Luminance processing Detail enhancement Color processing Sharpness Enhancement Shadow Enhancement Differentiation Programmable deinterlacing heuristics Frame rate detection and conversion Motion detection & estimation & compensation Different standards (MPEG2/4, H.264) A single device handles all modes

30 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 30 >> Curricula are obsolete << The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks http://www.uni-kl.de

31 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 31 Completely wrong mind set The key problem, the memory wall, cannot be solved by new CPU technology We need a 2 nd machine paradigm (a 2 nd mind set...) The vN paradigm is not a communication paradigm Its monopoly creates a completely wrong mind set We need an architectural communication paradigm But we need both paradigms: a dichotomy beef up old architecture principles by new technology?

32 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 32 von Neumann is not the common model progra m counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: configware age: CPU accelerator reconfigurable morphware software/configware co-compiler software configware

33 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 33 Why a new machine paradigm ??? The anti machine as the 2 nd machine paradigm is the key to curricular innovation rDPA CPU... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers RAM-based platform needed for: flexibility, programmability avoiding the need of specific silicon mask cost: > 2 mio $ - growing 2nd machine paradigm needed as a common model: to avoid the need of circuit expertize needed to educate zillions of programmers progra m counter DPU CPU data counter memory bank asM programmed by flowware

34 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 34 >> Final Remarks << The methodology gap The wrong Roadmap HPC goes reconfigurable Coarse grain is the way to go Curricula are obsolete Final remarks http://www.uni-kl.de

35 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 35 http://configware.org/

36 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 36 END

37 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 37 CS Education procedural have not You cannot * teach Hardware to a Programmer *) efficiently But to a Hardware Guy or gal you always can teach Programming stems from vN monopoly structural have hardware guy or gal natural

38 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 38 Growth Rate of Embedded Software 1 2 0 101218 months factor *) Department of Trade and Industry, London (1.4/year) [Moore ’ s law] >10 times more programmers will write embedded applications than computer software by 2010 Embedded software [DTI*] (~2.5/yr) already to-day, more than 98% of all microprocessors are used within embedded systems

39 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 39 de facto Duality of RAM-based platforms We now have 2 types of programmable platforms anti machine: data-stream-based von Neumann etc.: instruction-stream-based machine paradigm configware software „running“ on it: morphware (FPGA, rDPA..) CPURAM-based platform 2nd paradigmtraditional synthesis flowware

40 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 40 [Gordon Bell]... going into every type of application [Gordon Bell].... the brain hurts CW has become mainstream... Others experienced, that the brain hurts, when trying the paradigm shift The HPC scene believed to be smart, when smiling about us CW guys morphware: fastest growing sector of the IC market

41 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 41 configware resources: variable Nick Tredennick’s Paradigm Shifts explain the differences 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

42 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 42 Compilation: Software vs. Configware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data

43 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 43 Compilation: Software vs. Flowware source program software compiler software code Software Engineering flowware compiler scheduler flowware code source „ program “ Flowware Engineering data for hardwired anti machine

44 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 44 DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # Flowware defines:... which data item at which time at which port Flowware programs data streams

45 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 45 data streams * : not new 1980: data streams (Kung, Leiserson: systolic arrays) 1989: data-stream-based Xputer architecture 1990: rDPU (Rabaey) 1994: Flowware Language MoPL (Becker et al.) 1995: super systolic array (rDPA) + DPSS tool (Kress) 1996+: Streams-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), DSP-C, Brook,... 1996: configware / software partitioning compiler (Becker) *) please, don ‘ t confuse with „ data flow “

46 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 46 >> Dual Machine Paradigms << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks http://www.uni-kl.de

47 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 47 von Neumann vs. anti machine progra m counter DPU CPU RAM memory von Neumann bottleneck (r) DPA without sequencer no CPU ! asMA: auto-sequencing Memory Array........ asM (r) DPA........ data stream machine (anti machine) data counter memory bank asM asM: auto-sequencing Memory instruction stream machine (von Neumann etc.)

48 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 48 Behavior of the Ccounter data counter memory bank asM........ progra m counter DPU CPU programmed by flowware data streams programmed by software (r) DPA programmed by flowware

49 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 49 Behavior of the Ccounter data counter memory bank asM........ progra m counter DPU CPU programmed by flowware data streams programmed by software (r) DPA programmed by flowware

50 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 50 Counters: the same micro architecture ? data stream machine (anti machine) data counter memory bank asM progra m counter DPU CPU instruction stream machine: (von Neumann etc.) yes, is possible, but for data counters... *) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia... a much better AGU methodology is available* AGU: address generator unit

51 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 51 Software / Configware Co-Compilation Analyzer / Profiler SW code SW compiler paradigm “vN" machine CW Code CW compiler anti machine paradigm Partitioner Resource Parameters supporting different platforms Juergen Becker’s CoDe-X, 1996 High level PL source FW Code

52 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 52 Better solutions by Configware Memory cycles minimized e.g.: no instruction fetch at run time & other effects Memory access for data: caches do not help anyhow Loop xforms: no intra-stream data memory cycles Complex address computation: no memory cycles No cache misses! instead of software methodologies not new: high level synthesis (1980+) loop transformations (1970+) many other areas

53 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 53 Why the speed-up...... although FPGA is clock slower by x 3 or even more (most know-how from „ high level synthesis “ discipline) moving operator to the data stream (before run time) support operations: no clock nor memory cycle decisions without memory cycles nor clock cycles most „ data fetch “ without memory cycle

54 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 54 HPC experts coming... Simulation of Star Clusters: x10 speed-up by supercomputer-to-morphware migration (also molecular biology et al.) Rainer Spurzem, University of Heidelberg Reinhard Maenner, University of Mannheim HPC pioneer since 1976 (Physics Dept Heidelberg) Configware by Astrophysics by example: N-body problem going configware paper already at FPL 1999 http://fpl.org ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by August Kopff Gottfried Kirch August Kopff

55 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 55 August Kopff 18 th Director, Astrononisches Rechen-Institut (ARI) 1924 - 1954 discovered the Kopff comet, Koenigstuhl Observatory, Heidelberg, Germany, 1906 Copyright © 1996 by Masayuki Suzuki The Galileo spacecraft's 14-year odyssey came to an end on Sunday, Sept. 21, 2003 discovered the asteriod 631 Philippina, 21 March 1907, which became the first asteroid ever visited by a spacecraft - on the Galileo mission to Jupiter

56 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 56 Conclusions We need an academic grass roots movement, for.... RC has become mainstream in all kinds of applications... by a merger with the embedded systems mind set CS education deficits: a curricular revision is overdue...free material & tools for undergraduate lab courses to program and emulate small SW/CW/HW examples all know-how needed readily available: get involved !

57 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 57 500MHz Flexible Soft Logic Architecture 200KLogic Cells 500MHz Programmable DSP Execution Units 0.6-11.1Gbps Serial Transceivers 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 1Gbps Differential I/O 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management State-of-the-art FPGA [courtesy Xilinx Corp.]

58 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 58 The Platform FPGA MGTsI/OsMemoryPowerPC Soft Logic DSP Communication Port Custom Logic Internal Memory External Memory Port DSP Accelerator µP [courtesy Xilinx Corp.]

59 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 59 Domain A Domain B Domain x Virtex-4 – First Embodiment of ASMBL Platform APlatform BPlatform x... Logic Domain Highest logic density DSP Domain Highest DSP performance Embedded Processing Domain Embedded Processors High-speed Serial I/O Virtex-4 LX Logic Platform Virtex-4 SX Signal Processing Platform Virtex-4 FX Full Featured Platform One Family – Multiple Platforms Logic DSP Memory Legend CPUs Gbps I/O

60 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 60 Comparison of Hardware Solutions General Purpose Processors CISC (Complex Instruction Set Computing) RISC (Reduced Instruction Set Computing) Special Purpose Processors Microcontroller DSPs (Digital Signal Processors) Application-Specific Instruction Set Processors (ASIPs) Programmable Hardware FPGA (Field-Programmable Gate Arrays) FPFA (Field-Programmable Function Arrays) Application-Specific Integrated Circuits (ASICs) Performance Flexibility Power Consumption

61 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 61 Reconfigurable Hardware (FPL) PALs, PLAs: -> 10 - 100 Gate Equivalents Fine-grain: FPGAs, FPLDs –Altera MAX Family -> FPLD (EEPROM) –Actel Programmable Gate-Array -> FPGA (Anti-Fuse) –Xilinx Logical Cell Array -> FPGA (RAM-based) Several Thousand to multiple Millions Gate Equivalents -> e. g. Xilinx Virtex XCV2000 -> 2 Millions Gate Equivalents Coarse-grain: Field Programmable Functional Arrays = FPFAs –Company PACT XPP Technologies AG (Munich, Germany): Xtreme Processing Platform: XPP Architecture –Universitaet Karlsruhe (TH): New multi-grain adaptive Architectures XC4085XL 1997 1998 199 Virtex Dichte (System Gates) 10M Gates In 2002 Virtex XC40250XV 2M 1M 250k 180k 500k XC4085XL 1997 1998 1999 2000 2004 Virtex Dichte (System Gates) 10M Gates In 2004 Virtex XC40250XV 2M 1M 250k 180k 500k 10M 2M 1M 250k 180k 500k 8M

62 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 62 Terminology Configurable: Programmable: Reconfigurable: Dynamical Reconfigurable: Type of Reconfigurations, which realizes Modifications of Configurations during Run-time of the System. This is also called run-time reconfiguration, on-the-fly reconfiguration or in-circuit reconfiguration General Term, which expresses the Features of a Hardware Architecture to be configured more than once (-> Technology dependent) Type of flexible Computations, whereas only one or a few Instructions per Processing Element are loaded and the Execution is performed in the Dimensions Time and Space (-> Area) concurrently Type of flexible Computations, whereas a Sequence of Instructions is loaded and executed in the Dimension Time by using one or several Processing Elements

63 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 63 Collection of programmable “Gates” embedded in a flexible Interconnect Network … a “user programmable” Alternative to Gate Arrays Xilinx Virtex FPGA: Logic Realization Solution: Programmable Gate 2-LUT

64 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 64 Source: Xilinx Virtex-II-Pro Documentation Integration Process Substrate Poly Metal 1 Metal 2 Metal 3 Metal 4 Hard IP-block (Power PC) Poly Metal 1 Metal 2 Metal 3 Metal 4 Metal 5 Metal 6 Metal 7 Metal 8 Metal 9 On Chip Memory Controller Power PC Core BRAM Embeded RAM It contains PowerPC ® 405 RISC CPU (PPC405) Cores FPGA Fabric-based on Virtex-II Architecture Xilinx Virtex-II Pro FPGA: Architecture

65 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 65 PACT XPP with RISC Core PACT XPP will boost RISC cores The PACT XPP Technology will allow RISC IP Manufacturer to conquer new markets –RISC manufacturers will be able to extend their road map –RISC manufacturer will regain significance in silicon occupancy Control Flow Data Flow Performance MIPS ARM FPGA DSP ASIC DP

66 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 66. Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld FPGA Performance vs. KressArray physical logical KressArray memory 1980 1990 2000 2010 Transistors/chip 100 000 000 10 000 000 1000 000 100 000 10 000 1000 100 10 1 FPGA logical FPGA routed FPGA physical > 10 000 e. g. KressArray :  18 bit PU cell  8 NNports  4 buses  0.18  m CMOS: 0.06 mm 2 area < 10

67 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 67 Computer Performance Trend History Supercomputers Mainframes Minicomputers Microprocessors [Hennessy, Jouppi, 1991] 19651970197519801985199019952000 2005 1000 100 0,1 10 1 performance year

68 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 68 Software to Configware Migration this talk will illustrate the performance benfit which may be obtained from Reconfigurable Computing stressing coarse grain Reconfigurable Computing (RC), point of view, this talk hardly mentions FPGAs (But coarse grain may be always mapped onto FPGAs) Software to Configware Migration is the most important source of speed-up Hardware is just frozen Configware

69 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 69 Reconfigurable Computing (1) Goal: software flexibility with hardware performance –Microprocessors for any software, but slow –ASICs fully specialized, but very fast –Reconfigurable computing to bridge this gap Reconfiguration to execute a wide variety of applications on the same hardware platform Hardware yields higher performance than software Microprocessors Highest Flexibility Performance? ASICs Highest Performance Lowest Flexibility Reconfigurable Computing High Flexibility High Performance

70 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 70 Reconfigurable Computing (2) Configuration –“Programming” the hardware –Implements desired functionality in hardware For example, if addition necessary, adder made in hardware Reconfiguration –Changing the configuration of the device –Typical overhead: ~ 3 to 100 ms –Two types: static and dynamic Reconfigurable Hardware Functionality (a design) Reconfiguration Reconfigurable Hardware New functionality (a new design)

71 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 71 Types of Reconfiguration Static reconfiguration –Configure the device once for the application –Do not reconfigure until the application is finished or do not reconfigure at all Hardware still flexible in the design phase –Application: high performance computing that needs hardware performance without the high cost of designing an ASIC Dynamic/run-time reconfiguration –Hardware is reconfigured while the application is executing. –Partial reconfiguration Part of the reconfigurable hardware is reconfigured while the rest stays the same and continues to execute –Duration between reconfigurations varies Long duration: e.g. router updating tables and/or protocols, cell phone switching protocols Short duration: e.g. regular expression matching, DSP application switching between stages

72 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 72 FPGA Basics FPGA consists of –Matrix of programmable logic cells Implement any logic function –AND, OR, NOT, etc Can also implement storage –Registers or small SRAMs Groups of cells make up higher level structures –Adders, multipliers, etc. –Programmable interconnects Connect the logic cells to one another –Embedded features ASICs within the FPGA fabric for specific functions –Hardware multipliers, dedicated memory, microprocessors FPGAs are SRAM-based –Configure device by writing to configuration memory Logic Cells Interconnects

73 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 73 History of Machine Models 1957 1967 1977 1987 1997 2007 mainframe age main frame. compile procedural mind set: instruction-stream-based (coordinates by Makimtos wave) computer age (PC age) accel. µ Proc. compile structural mind set: data-stream-based by hardware guys design e. g. GRAPE RIKEN institute

74 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 74 Flowware: not new 1957 1967 1977 1987 1997 2007 computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile DPA r r µ Proc. morphware age *) no confusion, please: no „ dataflow machine “ !!! data stream*... Flowware: around 1980

75 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 75 3 rd machine model became mainstream 1957 1967 1977 1987 1997 2007 computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile instruction- stream-based DPA r r µ Proc. programmable most CS curricula & HPC are still here morphware age

76 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 76 symbiosis of machine models 1957 1967 1977 1987 1997 2007 computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile morphware age DPA r r µ Proc. replace PC by PS co-compiler symbiosis

77 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 77 DPA morphware age r r From Software to Configware Industry structural personalization: RAM-based Repeat Success Story by a 2 nd Machine Paradigm ! Growing Configware Industry 1957 1967 1977 1987 1997 2007 computer age (PC age) µ Proc. compile Procedural personalization via RAM-based. Machine Paradigm Software Industry 1) 2) Software Industry’s Secret of Success anti machine

78 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 78 Earth Simulator 5120 Processors, 5000 pins each ES 20: TFLOPS Crossbar weight: 220 t, 3000 km of cable, moving data around inside the

79 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 79 Hardware / Configware / Software Partitioning skills urgently needed Algorithm partitioning HW CW SW to cope with each of it: SW, CW, HW. SW / HW SW / CW / HW SW / CW CW / HW or: to cope with any combination of co-design. Software to Configware Migration is the most important source of speed-up Hardware is just frozen Configware

80 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 80 typical CS graduates: the „havenots“ To-day, „ typical “ CS graduates are unqualified for this labor market … cannot cope with Hardware / Configware / Software partitioning issues … cannot implement Configware

81 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 81 The „havenots“ Configware methodology to move data around more efficiently: Configware engineering as a qualification for programming embedded systems * : „ havenots “ are found in the HPC community The „ havenots “ are our typical CS graduates *) also programming for HPC !

82 © 2004, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 82 ### - SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab - Assembly & Test / Available since March 2003


Download ppt "ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa."

Similar presentations


Ads by Google