IPDPS 2004 Software or Configware? About the Digital Divide of Parallel Computing Reiner Hartenstein TU Kaiserslautern Santa Fe, NM, April 26 - 30, 2004.

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern TU Delft, Sept 28, (v.2)
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.
Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez
ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa.
MSE 2005 Reconfigurable Computing (RC) being Mainstream: Torpedoed by Education Reiner Hartenstein TU Kaiserslautern International Conference on Microelectronic.
IPDPS 2004 Software or Configware? About the Digital Divide of Parallel Computing Reiner Hartenstein TU Kaiserslautern Santa Fe, NM, April , 2004.
From Organic Computing to Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern PASA, Frankfurt, March 16, 2006.
Reconfigurable HPC Reconfigurable HPC part 1 Introduction Reiner Hartenstein TU Kaiserslautern May 14, 2004, TU Tallinn, Estonia.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Seminar at Kyushu University Reconfigurable Technologies (1) Reiner Hartenstein TU Kaiserslautern July 23, 2004, Fukuoka, Japan.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Memory Management 2010.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Computer Organization and Assembly language
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
DRRA Dynamically Reconfigurable Resource Array
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Computer Organization and Design Computer Abstractions and Technology
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
VLSI-SoC 2001 IFIP - LIRMM Stream-based Arrays: Converging Design Flows for both, Reiner Hartenstein University of Kaiserslautern December 2- 4, 2001,
EE3A1 Computer Hardware and Digital Design
Computer Organization & Assembly Language © by DR. M. Amer.
Introduction to Microprocessors
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Reconfigurable HPC Notes on datastream-based FFT
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Computer Architecture Furkan Rabee
Computer Organization and Architecture Lecture 1 : Introduction
ECE354 Embedded Systems Introduction C Andras Moritz.
Architecture & Organization 1
Memory Organisation for Datastream-based Reconfigurable Computing
Introduction to cosynthesis Rabi Mahapatra CSCE617
Architecture & Organization 1
Embedded Architectures: Configurable, Re-configurable, or what?
Presentation transcript:

IPDPS 2004 Software or Configware? About the Digital Divide of Parallel Computing Reiner Hartenstein TU Kaiserslautern Santa Fe, NM, April , 2004

© 2004, TU Kaiserslautern 2 Preface The White House, Sept 2000: Bill Clinton condemns the Digital Divide in America: World Economic Forum 2002: The Global Digital Divide, disparity between the "haves" and "have nots“ The Digital Divide of Parallel Computing: Access to Configware (CW) Solutions access to the internet

© 2004, TU Kaiserslautern 3 The „havenots“ Configware methodology to move data around more efficiently: Configware engineering as a qualification for programming embedded systems * : „ havenots “ are found in the HPC community The „ havenots “ are our typical CS graduates Reconfigurable HPC is torpedoed by deficits in education: curricular revisions are overdue *) also HPC !

© 2004, TU Kaiserslautern 4 Software to Configware Migration this talk will illustrate the performance benfit which may be obtained from Reconfigurable Computing stressing coarse grain Reconfigurable Computing (RC), point of view, this talk hardly mentions FPGAs (But coarse grain may be always mapped onto FPGAs) Software to Configware Migration is the most important source of speed-up Hardware is just frozen Configware

© 2004, TU Kaiserslautern 5 >> HPC << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 6 Earth Simulator 5120 Processors, 5000 pins each ES 20: TFLOPS Crossbar weight: 220 t, 3000 km of cable, moving data around inside the

© 2004, TU Kaiserslautern 7 data are moved around by software (slower than CPU clock by 2 orders of magnitude) i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall extremely unbalanced stolen from Bob Colwell CPU

© 2004, TU Kaiserslautern 8 path of least resistance * : avoiding a paradigm shift Many researchers seem never to stop working on sophisticated solutions for marginal improvements continously ignoring methodologies promising speed-ups by orders of magnitude.... blinders to ignore the impact of morphware... continue to bang their heads against the memory wall instead of *) [Michel Dubois]

© 2004, TU Kaiserslautern 9 the data-stream-based approach has no von Neumann bottle- neck … understand only this parallelism solution: the instruction-stream-based approach von Neumann bottle- necks... cannot cope with this one

© 2004, TU Kaiserslautern 10 more offending statements to come speaker

© 2004, TU Kaiserslautern 11 >> Embedded Computing << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 12 History of Machine Models mainframe age main frame. compile procedural mind set: instruction-stream-based (coordinates by Makimtos wave) computer age (PC age) accel. µ Proc. compile structural mind set: data-stream-based by hardware guys design e. g. GRAPE RIKEN institute

© 2004, TU Kaiserslautern 13 the hardware / Software Chasm: typical programmers don‘t understand function evaluation without machine mechanisms (counters, state registers) It‘s the gap between procedural (instruction-stream- based) and structural (datastream-based) mind set accelerators µ processor

© 2004, TU Kaiserslautern 14 Growth Rate of Embedded Software months factor *) Department of Trade and Industry, London (1.4/year) [Moore ’ s law] >10 times more programmers will write embedded applications than computer software by 2010 Embedded software [DTI*] (~2.5/yr) already to-day, more than 98% of all microprocessors are used within embedded systems

© 2004, TU Kaiserslautern 15 typical CS graduates: the „havenots“ To-day, „ typical “ CS graduates are unqualified for this labor market … cannot cope with Hardware / Configware / Software partitioning issues … cannot implement Configware

© 2004, TU Kaiserslautern 16 the current CS mind set is based on the Submarine Model Hardware invisible: under the surface Hardware Algorithm Assembly Language procedural high level Programming Language Software This model does not suport Hardware / Configware / Software partitioning

© 2004, TU Kaiserslautern 17 Hardware / Configware / Software Partitioning skills urgently needed Algorithm partitioning HW CW SW to cope with each of it: SW, CW, HW. SW / HW SW / CW / HW SW / CW CW / HW or: to cope with any combination of co-design. Software to Configware Migration is the most important source of speed-up Hardware is just frozen Configware

© 2004, TU Kaiserslautern 18 By the way... International Conference on Field-Programmable Logic and Applications (FPL) Aug. 20 – Sept 1, 2004, Antwerp, Belgium 288 submissions !... the oldest and largest conference in the field: accel. µ Proc.... going into every type of application they all work on high performance

© 2004, TU Kaiserslautern 19 Dominance of the Submarine Model... Hardware... indicates, that our CS education system produces zillions of mentally disabled CS graduates (procedural) structurally disabled … disabled to cope with solutions other than instruction-stream-based

© 2004, TU Kaiserslautern 20 CS Education procedural have not You cannot * teach Hardware to a Programmer *) efficiently But to a Hardware Guy you always can teach Programming structural have natural

© 2004, TU Kaiserslautern 21 >> the wrong Roadmap << HPC Embedded Computing the wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 22 Completely wrong roadmap beef up old architectural principles by new technology? growth factor µm 0.1 performance area efficiency „Pollack‘s Law“ (simplified) [intel]... the CPU is a methusela, the steam engine of the silicon age

© 2004, TU Kaiserslautern 23 Completely wrong mind set The key problem, the memory wall, cannot be solved by new CPU technology We need a 2 nd machine paradigm (a 2 nd mind set...) The vN paradigm is not a communication paradigm Its monopoly creates a completely wrong mind set We need an architectural communication paradigm But we need both paradigms: a dichotomy

© 2004, TU Kaiserslautern 24 3 rd machine model became mainstream computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile instruction- stream-based DPA r r µ Proc. programmable most CS curricula & HPC are still here morphware age

© 2004, TU Kaiserslautern 25 >> Configware Engineering << Supercomputing (HPC) Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 26 de facto Duality of RAM-based platforms traditionalnew RAM-based platformCPUmorphware (FPGA, rDPA..) „running“ on it: softwareconfigware machine paradigmvon Neumann etc.: instruction-stream-based anti machine: data-stream-based 2 nd paradigm We now have 2 types of programmable platforms soft hardware morphware [DARPA] hardware viewed as frozen configware: Just earlier binding

© 2004, TU Kaiserslautern 27 [Gordon Bell]... going into every type of application [Gordon Bell].... the brain hurts CW has become mainstream... Others experienced, that the brain hurts, when trying the paradigm shift The HPC scene believed to be smart, when smiling about us CW guys morphware: fastest growing sector of the IC market

© 2004, TU Kaiserslautern 28 DPA morphware age r r From Software to Configware Industry structural personalization: RAM-based Repeat Success Story by a 2 nd Machine Paradigm ! Growing Configware Industry computer age (PC age) µ Proc. compile Procedural personalization via RAM-based. Machine Paradigm Software Industry 1) 2) Software Industry’s Secret of Success anti machine

© 2004, TU Kaiserslautern 29 benefit from RAM-based & 2 nd paradigm RAM-based platform needed for: flexibility, programmability avoiding the need of specific silicon mask cost: currently 2 mio $ - rapidly growing 1) simple 2nd machine paradigm needed as a common model: to avoid the need of circuit expertize needed to to educate zillions of programmers 2) (vN relocatability is based on the von Neumann bottleneck) high price By the way: relocatability is more difficult, but not hopeless

© 2004, TU Kaiserslautern 30 configware resources: variable Nick Tredennick’s Paradigm Shifts explain the differences 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

© 2004, TU Kaiserslautern 31 Compilation: Software vs. Configware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data

© 2004, TU Kaiserslautern 32 DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # Flowware defines:... which data item at which time at which port Flowware programs data streams

© 2004, TU Kaiserslautern 33 Flowware: not new computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile DPA r r µ Proc. morphware age *) no confusion, please: no „ dataflow machine “ !!! data stream*... Flowware: around 1980

© 2004, TU Kaiserslautern 34 data streams * : not new 1980: data streams (Kung, Leiserson: systolic arrays) 1989: data-stream-based Xputer architecture 1990: rDPU (Rabaey) 1994: Flowware Language MoPL (Becker et al.) 1995: super systolic array (rDPA) + DPSS tool (Kress) 1996+: Stream-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), : streaming languages (Stanford et al.) 1996: configware / software partitioning compiler (Becker) *) please, don ‘ t confuse with „ data flow “

© 2004, TU Kaiserslautern 35 >> Dual Machine Paradigms << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 36 Why a new machine paradigm ??? The anti machine as the 2 nd paradigm is the key to curricular innovation rDPA µ processor... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers Programming by flowware instead of software is very easy to learn Flowware education: no fully fledged hardware expert needed to program embedded systems (... same language primitives)

© 2004, TU Kaiserslautern 37 von Neumann vs. anti machine distributed asMA: also see books by Francky Catthoor et al. Data stream machine data counter memory bank asM asMA: auto-sequencing Memory Array progra m counter DPU CPU instruction stream machine: routing congestion control flow overhead other overhead phenomena instruction fetch overhead RAM memory von Neumann bottleneck programmed by flowware data streams programmed by software (r) DPA without sequencer asM : auto-sequencing Memory no CPU !

© 2004, TU Kaiserslautern 38 Counters: the same micro architecture ? data stream machine (anti machine) data counter memory bank asM progra m counter DPU CPU instruction stream machine: (von Neumann etc.) yes, is possible, but for data counters... *) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia... a much better AGU methodology is available* AGU: address generator unit

© 2004, TU Kaiserslautern 39 commercial rDPA example: PACT XPP - XPU128 XPP128 rDPA Evaluation Board available, and XDS Development Tool with Simulator buses not shown rDPU Full 32 or 24 Bit Design working silicon 2 Configuration Hierarchies © PACT AG, Munich (r) DPA

© 2004, TU Kaiserslautern 40 array size: 10 x 16 = 160 rDPUs mapping algorithms efficently onto rDPA rout thru only not used backbus connect SNN filter on KressArray [Ulrich Nageldinger] by DPSS: based on simulated annealing

© 2004, TU Kaiserslautern 41 symbiosis of machine models computer age (PC age) accel. design µ Proc. compile (Makimtos wave) mainframe age main frame compile morphware age DPA r r µ Proc. replace PC by PS co-compiler symbiosis

© 2004, TU Kaiserslautern 42 Software / Configware Co-Compilation Analyzer / Profiler SW code SW compiler paradigm “vN" machine CW Code CW compiler anti machine paradigm Partitioner Resource Parameters supporting different platforms Jürgen Becker’s CoDe-X, 1996 High level PL source

© 2004, TU Kaiserslautern 43 >> Speed-up Examples << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 44 Better solutions by Configware Memory cycles minimized e.g.: no instruction fetch at run time & other effects Memory access for data: caches do not help anyhow Loop xforms: no intra-stream data memory cycles Complex address computation: no memory cycles No cache misses! instead of software methodologies not new: high level synthesis (1980+) loop transformations (1970+) many other areas

© 2004, TU Kaiserslautern 45 speed-up examples platformapplication examplespeed-up factormethod PACT Xtreme 4-by-4 array [2003] 16 tap FIR filterx16 MOPS/mW straight forward MoM anti machine with DPLA* [1983] grid-based DRC** 1-metal 1-poly nMOS *** 256 reference patterns > x1000 (computation time) multiple aspects *) DPLA: MPC fabr. via E.I.S. multi univ. project key issue: algorithmic cleverness **) Design Rule Check CPU 2 FPGA [FPGA 2004] migrate several simple application exampes x7 – x46 (compute time) hi level synthesis ***) for 10-metal 3-poly cMOS expected: >> x10,000 DSP 2 FPGA [Xilinx ] from fastest DSP: 10 gMACs to 1 teraMAC X 100 (compute time) not spec. 2) Wim Roelands

© 2004, TU Kaiserslautern 46 Software to Configware Migration: ( RAW’99 at Orlando) question by a highly respected industrial senior researcher: Ulrich Nageldinger‘s talk about KressArray Xplorer: „But you can‘t implement decisions!“ (symptom of...)

© 2004, TU Kaiserslautern 47 conditional operation example *) if no intermed. storage in register file on a very simple CPU C = 1 memory cycles if C then read A read instruction1 instruction decoding read operand*1 operate & register transfers if not C then read B read instruction1 instruction decoding add & store read instruction1 instruction decoding operate & register transfers store result1 total5 S = R + (if C then A else B endif); S + A B R C clock =1 on rDPU no memory cycles

© 2004, TU Kaiserslautern 48 rDPA (coarse grain) vs. FPGA (fine grain) commodity roughly: area efficiency (trans/chip, o ‘ o ‘ magnitude) hardwired 4 FPGA 2 µProc 0 rDPA 4 roughly: performanc e (MOPS/mW, o ‘ o ‘ magnitude) hardwired 3 FPGA 2 µProc 0 rDPA 3 DSP 1 Status: ~1998

© 2004, TU Kaiserslautern 49 Why the speed-up although FPGA is clock slower by x 3 or even more (most know-how from „ high level synthesis “ discipline) moving operator to the data stream (before run time) support operations: no clock nor memory cycle decisions without memory cycles nor clock cycles most „ data fetch “ without memory cycle

© 2004, TU Kaiserslautern 50 >> Final Remarks << HPC Embedded Computing The wrong Roadmap Configware Engineering Dual Machine Paradigms Speed-up Examples Final Remarks

© 2004, TU Kaiserslautern 51 First Indications of Change 10th RAW at IPDPS, Nice, France, April 2003: after a decade of non-overlap: first IPDPS people coming HPC Asia th Int‘l Conference on High Performance Computing, July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan: Workshop on Reconfigurable Systems for HPC (RHPC) -> keynote * HPCA th International Symposium on High-Performance Computer Architecture, San Francisco, February 12-16, 2005: topic area Embedded and reconfigurable architectures SBAC-PAD th Symposium on Computer Architecture and High Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29, 2004: topic area Reconfigurable Systems *) keynote speaker: PARS & Speed-up, Basel, Switzerland, March 2003: keynote * IPDPS, Santa Fe, NM, USA, April 2004: keynote * PDP’04, La Coruna, Spain, Febr. 2004: keynote *

© 2004, TU Kaiserslautern 52 HPC experts coming... ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by August Kopff Gottfried Kirch Simulation of Star Clusters: x10 speed-up by supercomputer-to-morphware migration (also molecular biology et al.) Rainer Spurzem, University of Heidelberg Reinhard M ä nner, University of Mannheim HPC pioneer since 1976 (Physics Dept Heidelberg) Configware by Astrophysics by example: N-body problem went configware paper already at FPL

© 2004, TU Kaiserslautern 53 LANL also working on RC

© 2004, TU Kaiserslautern 54 Conclusions We need an academic grass roots movement, for.... RC has become mainstream in all kinds of applications... by a merger with the embedded systems mind set CS education deficits: a curricular revision is overdue...free material & tools for undergraduate lab courses to program and emulate small SW/CW/HW examples all know-how needed readily available: get involved !

© 2004, TU Kaiserslautern 55 thank you for your patience

© 2004, TU Kaiserslautern 56 END

© 2004, TU Kaiserslautern 57 END MD-GRAPE-2 PCI board 4 MD-GRAPE-2 chips for N-body simulation Converts a PC to 64 GFlops

© 2004, TU Kaiserslautern 58 for discussion

© 2004, TU Kaiserslautern 59 … a curricular revision is overdue Very high throughput on low power slow FPGAs may be obtained only by algorithmic cleverness* -– under the mind set of CW *) still mainly ignored by our CS curricula

© 2004, TU Kaiserslautern 60 not your job? next winner of the „Not My Job“ Award ?

© 2004, TU Kaiserslautern 61 FPGA Fabric-based on Virtex-II Architecture Source: Ivo Bolsens, Xilinx On Chip Memory Controller Power PC Core Embeded RAM Rocket IO entire system on a single chip all you need on board Xilinx Virtex-II Pro FPGA Architecture PowerPC 405 RISC CPU (PPC405) cores

© 2004, TU Kaiserslautern 62 Mega-rGAs planned Virtex II XC 40250XV Virtex XC 4085XL 100 System gates per rGA chip Jahr [Xilinx Data]

© 2004, TU Kaiserslautern 63 asM distributed memory architecture Configware / Flowware Co-Compilation r. Data Path Array rDPA intermediate high level source wrapper flowware scheduler MMMM MMMM M M M M M M M M data streams data sequencer address generator „instruction“ fetch before runtime configware mapper

© 2004, TU Kaiserslautern 64 Software / Configwre Co-compilation source program software compiler software code configware code mapper configware compiler scheduler flowware code partitioner Software/ Configware Co-compiler

© 2004, TU Kaiserslautern 65 Loop Transformation Examples loop 1-8 body endloop loop 1-8 body endloop loop 9-16 body endloop fork join strip mining loop 1-4 trigger endloop loop 1-2 trigger endloop loop 1-8 trigger endloop reconf.array: host: loop 1-16 body endloop sequential processes: resource parameter driven Co-Compilation loop unrolling

© 2004, TU Kaiserslautern 66 Data stream machine Speedup by Xputers data counter memory bank asM asMA distributed memory (r) DPU smart memory interface MoM architecture: 2-Dmemory space, adj. scan window example: 4x4 scan window grid-based design rule check example speed-up: >1000 complex boolean expressions in 1 clock cycle address computation overhead: 94 %

© 2004, TU Kaiserslautern 67 paradigm

© 2004, TU Kaiserslautern 68 Reconfigurable Computing: a second programming domain Migration of programming to the structural domain The opportunity to introduce the structural domain to programmers... The structural domain has become RAM-based... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

© 2004, TU Kaiserslautern 69 The dichotomy of models Note for von Neumann: state register is with the CPU Note for the anti machine: state register is with memory bank / state register s are within memory bank s

© 2004, TU Kaiserslautern 70 control-procedural vs. data-procedural The structural domain is primarily data-stream-based:..... mostly not yet modelled that way: most flowware is hidden by its indirect instruction-stream-based implementation Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“... Flowware

© 2004, TU Kaiserslautern 71 Traditional Environment machine paradigm programmable yesno instruction-stream-based Software data-stream-based Hardware

© 2004, TU Kaiserslautern 72 Machine Paradigms ( “instruction fetch” ) also hardwired implementations* *) e g. Bee project Prof. Broderson

© 2004, TU Kaiserslautern 73 Importance of binding time Configuration: like a kind of pre-packed frozen-in „super instruction fetch“ load time compile time Reconfigurable Computing Configware not all switching is done by Configware run time Microprocessor Parallel Computer Software time of “instruction fetch” read new instruction c 01 Hardware for a pipe network 1 0 c 1 0 c configure datapaths fabrication time Full custom oder ASIC fabricate a datapath

© 2004, TU Kaiserslautern 74 roadmap old CS lab course philosophy: given an application: implement it by a program -/- new CS freshman lab course environment: Given an application: a)implement it by writing a program b)implement it as a morphware prototype c)Partition it into P and Q c.1) implement P by software c.2) implement Q by morphware c.3) implement P / Q communication interface

© 2004, TU Kaiserslautern 75 All enabling technologies are available anti machine and all its architectural resources parallel memory IP cores and generators anything else needed languages & (co-)compilation techniques morphware vendors like PACT.... literature from last 30 years

© 2004, TU Kaiserslautern 76 „EDA industry shifts into CS mentality“ [Wojciech Maly] Microprogramming to replace FSM design Hardware languages replace EE-type schematics EDA Software and its interfacing languages Newer system level languages like systemC etc. Small and large module re-use Hierarchical organization of designs, EDA, et al

© 2004, TU Kaiserslautern 77 Reconfigurable Computing: a second programming domain Migration of programming to the structural domain The opportunity to introduce the structural domain to programmers... The structural domain has become RAM-based... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

© 2004, TU Kaiserslautern 78 Reconfigurable Computing Using coarse grain morphware platforms leads to Reconfigurable Computing, which is really Computing, whereas physical use of fine grain morphware (FPGAs etc.) means kind of Logic Design on a strange platform.

© 2004, TU Kaiserslautern 79 flowware

© 2004, TU Kaiserslautern 80 Configware and Flowware are the sources for programming morphware. Software is the source for programming traditional hardwired processors (instruction-stream-driven: von Neumann machine paradigm and its derivatives) For Configware and Flowware we prefer the anti machine paradigm – conterpart of von Neumann

© 2004, TU Kaiserslautern 81 Software vs Flowware and Configware Programming source for instruction-stream-based computing (von Neumann etc.): The programming source for data-stream-based computing operations (the anti machine paradigm): Programming sources for Reconfigurable Computing (morphware): Software Flowware Flowware and Configware Sources for Embedded Systems: Flowware, Configware and Software

© 2004, TU Kaiserslautern 82 Flowware Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers Flowware instead of software the digital divide Flowware education: no fully fledged hardware expert needed to prügram embedded systems

© 2004, TU Kaiserslautern 83 Flowware heading toward mainstream Data-stream-based Computing is heading for mainstream –1997 SCCC (LANL) Streams-C Configurabble Computing –SCORE (UCB) Stream Computations Organized for Reconfigurable Execution –ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing –2000 Bee (UCB),... –Most stream-based multimedia systems, etc. –Many other areas.... Flowware: managing data streams Software: managing instruction streams

© 2004, TU Kaiserslautern 84 Flowware-based paradigms, methodologies 1946: (von Neumann machine paradigm) 1980: data streams (Kung, Leiserson) 1989: anti machine paradigm 1990: rDPA (Rabaey: coarse grain reconfigurable array ) 1994: anti machine high level programming language 1995: super systolic array (rDPA) 1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), : discipline of distributed memory architecture 1997: configware / software partitioning compiler systolic arrays: for regular data dependencies only

© 2004, TU Kaiserslautern 85 Programming Language Paradigms very easy to learn multiple GAGs much more simple much more powerful

© 2004, TU Kaiserslautern 86 Flowware Languages MoPL: fully supporting the anti machine paradigm – the counterpart of the von Neumann paradigm general purpose: Streams-C: defines 1-D streams; generates VHDL DSP-C: allows to describe key features of DSPs specialized: Brook: for modern graphics hardware

© 2004, TU Kaiserslautern 87 Gokhale: Streaming Languages: a different mind set "After a few years of looking at the problem, I realized that most of the application space could be described well with a stream-oriented communicating sequential processes model“ Gokhale said, the compiler makes FPGA design available to software engineers, but they should still have an "abstract notion" of hardware: "One way to get performance is to tile application-specific arithmetic units across a chip," she noted. "Telling the compiler to unroll inner loops is a way to do that." What's not needed, she said, is a knowledge of the hardware at a clock cycle level. In contrast to SystemC, which provides both behavioral and structural views, Streams-C is purely behavioral. It assigns operations to clock cycles, thus providing behavioral synthesis.

© 2004, TU Kaiserslautern 88 Stanford Streaming Languages: DSP-C set of language extensions to ISO C programming language allows application programmers describe key features of DSPs that enable efficient source code compilation: fixed point data types, divided memory spaces, circular arrays and pointers DSP-C uses arrays Sections of the arrays are selected for use in calculations using array indices or array range specifications

© 2004, TU Kaiserslautern 89 Stanford Streaming Languages: Brook Brook defines much more abstract streams Dynamic length streams Could be multidimensional (but only fixed length??) Entire stream is a monolithic object Stream programming language for modern graphics hardware

© 2004, TU Kaiserslautern 90 Stanford Streaming Languages: Streams-C Streams-C defines 1-D, array-like streams –Arbitrary sections of the base arrays can be selected in advance for streaming into execution kernels Open-source C compiler targets FPGAs LANL open-source C compiler for reconfigurable logic, called Streams-C, accepts a subset of the C programming language, performs behavioral synthesis, and outputs synthesizable RTL VHDL code. It currently targets Xilinx Virtex-2000 devices on Annapolis Microsystems' Firebird board, "After a few years of looking at the problem, I realized that most of the application space could be described well with a stream-oriented communicating sequential processes model“ Gokhale said the compiler makes FPGA design available to software engineers, but they should still have an "abstract notion" of hardware: "One way to get performance is to tile application-specific arithmetic units across a chip," she noted. "Telling the compiler to unroll inner loops is a way to do that." What's not needed, she said, is a knowledge of the hardware at a clock cycle level. In contrast to SystemC, which provides both behavioral and structural views, Streams-C is purely behavioral. It assigns operations to clock cycles, thus providing behavioral synthesis.

© 2004, TU Kaiserslautern 91 Streaming Languages: Parallelism Independent thread parallelism –Stick with pthreads or other high-level definition Loop iteration, data-division parallelism –Divide up loop iterations among functional units –Loop iterations must be data-independent (no critical dependencies) Pipelining of segments of “serial” code –Find places to overlap non-dependent portions of serial code Ex. 1: Start a later loop before earlier one finishes Ex. 2: Start different functions on different processors –Harder than loop iteration parallelism because of load balancing Pipelining between timesteps –Run multiple timesteps in parallel, using a pipeline –Doesn’t necessarily require finding overlap of loops or functions — running them on different timesteps makes them data parallel –StreaMIT is best example of a language designed to optimize for this, and up to now I don’t think any of our proposals have addressed it

© 2004, TU Kaiserslautern 92 AGUs

© 2004, TU Kaiserslautern 93 Speed-up Enablers Hier eine Liste DRC 4 orders of magnitude Address computation overhead Translate into super-systolic rather than into instruction streams Determine interconnect fabrics by compilation, but not before fabrication Determine memory architecture by compilation, but not before fabrication

© 2004, TU Kaiserslautern 94 application-specific distributed memory Application-specific memory: rapidly growing markets: –IP cores –Module generators –EDA environments Optimization of memory bandwidth for application-specific distributed memory Power and area optimization as a further benefit Key issues of address generators will be discussed

© 2004, TU Kaiserslautern 95 Acceleration Mechanisms parallelism by multi bank memory architecture auxiliary hardware for address calculation address calculation before run time avoiding multiple accesses to the same data. avoiding memory cycles for address computation improve parallelism by storage scheme transformations improve parallelism by memory architecture transformations alleviate interconnect overhead (delay, power and area)

© 2004, TU Kaiserslautern 96 Significance of Address Generators Address generators have the potential to reduce computation time significantly. In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750 Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead

© 2004, TU Kaiserslautern 97 Smart Address Generators 1983 The Structured Memory Access (SMA) Machine 1984The GAG (generic address generator) 1989 Application-specific Address Generator (ASAG) 1990The slider method: GAG of the MoM-2 machine 1991The AGU 1994The GAG of the MoM-3 machine 1997The Texas Instruments TMS320C54x DSP 1997Intersil HSP45240 Address Sequencer 1999Adopt (IMEC)

© 2004, TU Kaiserslautern 98 Adopt (from IMEC) cMMU synthesis environment: application-specific ACUs for array index reference ACU as a counter modified by multi-level logic filter ACU with ASUs from a Cathedral-3 library distributed ACU alleviates interconnect overhead (delay, power, area) nested loop minimization by algebraic transformations AE splitting/clustering AE multiplexing to obtain interleaved ASs other features customized MMU (cMMU) address expression (AE) Address Sequence (AS) Address Calculation Unit (ACU) Application-Specific Unit (ASU) For more details on Adopt see paper in proceedings CD-ROM

© 2004, TU Kaiserslautern 99 Linear Filter Application after inner scan line loop unrolling final design after scan line unrolling hardw. level access optim. initial design Parallelized Merged Buffer Linear Filter Application with example image of x=22 by y=11 pixel

© 2004, TU Kaiserslautern 100 super systolic

© 2004, TU Kaiserslautern 101 Supersystolic Array Principles take systolic array principles replace classical synthesis by simulated annealing yields the supersystolic array a generalization of the systolic array no more restricted to regular data dependencies now reconfigurability makes sense: use morphware

© 2004, TU Kaiserslautern 102 KressArray Family generic Fabrics: a few examples Examples of 2 nd Level Interconnect: layouted over rDPU cell - no separate routing areas ! + rout-through and function rout- through only more NNports: rich Rout Resources Select Function Repertory select Nearest Neighbour (NN) Interconnect: an example rDPU Select mode, number, width of NNports

© 2004, TU Kaiserslautern 103 Super Pipe Networks The key is mapping, rather than architecture * *) KressArray [1995]

© 2004, TU Kaiserslautern 104 KressArray principles take systolic array principles replace classical synthesis by simulated annealing yields the super systolic array a generalization of the systolic array no more restricted to regular data dependencies now reconfigurability makes sense

© 2004, TU Kaiserslautern 105 Hardware Hardware / Configware / Software Partitioning Algorithm Software only procedural structural Brain Usage: both Hemispheres Hardware only Configware partitioning instruction- stream-based data- stream- based Configware only

© 2004, TU Kaiserslautern 106 Hardware Hardware / Configware / Software Partitioning Algorithm Software procedural structural Brain Usage: both Hemispheres Configware partitioning instruction- stream-based data- stream- based Configware & Software SW / CW Co-Design

© 2004, TU Kaiserslautern 107 Hardware Hardware / Configware / Software Partitioning Algorithm Software procedural structural Brain Usage: both Hemispheres Configware partitioning instruction- stream-based data- stream- based Hardware & Configware HW / CW Co-Design

© 2004, TU Kaiserslautern 108 Configware Hardware Hardware / Configware / Software Partitioning Algorithm Software procedural structural Brain Usage: both Hemispheres partitioning instruction- stream-based data- stream- based Hardware & Software HW / SW Co-Design

© 2004, TU Kaiserslautern 109 Hardware Hardware / Configware / Software Partitioning Algorithm Software procedural structural Brain Usage: both Hemispheres Configware partitioning instruction- stream-based data- stream- based Hardware & Configware & Software HW / CW / SW Co-Design

© 2004, TU Kaiserslautern 110 Hardware / Configware / Software Partitioning skills needed Algorithm partitioning procedural Brain Usage: both Hemispheres instruction- stream-based data- stream- based structural HW CW SW to cope with each of it or any combination of co-design SW / HW SW / CW / HW SW / CW CW / HW.

© 2004, TU Kaiserslautern 111 please, excuse My mission intends to contribute to dissolve this digital divide Because of the limited time slot of my talk some of the models and perspectives used are a kind of dramatized In cases this looks offending, please, forgive me