Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Slides:



Advertisements
Similar presentations
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Advertisements

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.
Configurable System-on-Chip: Xilinx EDK
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Paper Review: XiSystem - A Reconfigurable Processor and System
Automated Design of Custom Architecture Tulika Mitra
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Exploiting Parallelism
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Memory Buffering Techniques
Dynamo: A Runtime Codesign Environment
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
A High Performance SoC: PkunityTM
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: July 2005 – June 2008 Ph.D. students: Greg Stitt – Ph.D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville Ann Gordon-Ross – Ph.D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville David Sheldon Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Dave Clark, Darshan Patra, Intel Jeff Welser, Scott Lekuch, IBM

Frank Vahid, UCR 2 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Frank Vahid, UCR 3 Binary Translation VLIW µP Background Motivated by commercial dynamic binary translation of early 2000s x86 Binary VLIW Binary FPGA µP Binary Warp processing (Lysecky/Stitt/Vahid ): dynamically translate binary to circuits on FPGAs Performance e.g., Transmeta Crusoe “code morphing” Binary “Translation”

Frank Vahid, UCR 4 µP FPGA On-chip CAD Warp Processing Background Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UCR 5 µP FPGA On-chip CAD Warp Processing Background Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

Frank Vahid, UCR 6 µP FPGA On-chip CAD Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UCR 7 µP FPGA On-chip CAD Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UCR 8 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

Frank Vahid, UCR 9 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UCR 10 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA Lean place&route/FPGA  10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

Frank Vahid, UCR 11 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps

Frank Vahid, UCR 12 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase On-chip CAD Single-execution speedup FPGA Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist,... Warping takes time – when useful?

Frank Vahid, UCR 13 µP Thread Warping - Overview FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms  multi- threaded apps

Frank Vahid, UCR 14 Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Thread Warping Tools Invoked by OS Uses pthread library (POSIX) Mutex/semaphore for synchronization Defined methods/algorithms of a thread warping framework Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile On-chip CAD FPGA µPµP Accelerator Synthesis Memory Access Synchronization

Frank Vahid, UCR 15 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) ….

Frank Vahid, UCR 16 Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads for (i = 0; i < 100; i++) { thread_create( f, a, i ); } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Thread Group Def-Use: a is constant for all threads Addresses of a[0-9] are constant for thread group f() ……………… f() DMA RAM A[0-9] Before MAS: 1000 memory accesses After MAS: 100 memory accesses Data fetched once, delivered to entire group 2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access Execution synchronized by OS enable (from OS)

Frank Vahid, UCR 17 Memory Access Synchronization (MAS) Also detects overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] Caches reused data, delivers windows to threads Each thread accesses different addresses – but addresses may overlap enable

Frank Vahid, UCR 18 Framework Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile Also developed initial algorithms for: Queue analysis Accelerator instantiation OS scheduling of threads to accelerators and cores

Frank Vahid, UCR 19 Thread Warping Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() Thread Queue Queue Analysis Thread functions: filter() filter() threads execute on available cores Remaining threads added to queue OS invokes CAD (due to queue size or periodically) CAD tools identify filter() for synthesis

Frank Vahid, UCR 20 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() filter() binary Decompilation CDFG Memory Access Synchronization MAS detects overlapping windows MAS detects thread group CAD reads filter() binary

Frank Vahid, UCR 21 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() filter() binary Decompilation CDFG Memory Access Synchronization High-level Synthesis >> 2 filter()..... Smart Buffer RAM Accelerator Library filter Synthesis creates pipelined accelerator for filter() group: 8 accelerators Stored for future use Accelerators loaded into FPGA

Frank Vahid, UCR 22 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-52] a[2-5] a[9-12] Smart buffer streams a[] data After buffer fills, delivers a window to all eight accelerators OS schedules threads to accelerators enable (from OS)

Frank Vahid, UCR 23 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] a[10-13] a[17-20] Each cycle, smart buffer delivers eight more windows – pipeline remains full

Frank Vahid, UCR 24 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] b[2-9] Accelerators create 8 outputs after pipeline latency passes

Frank Vahid, UCR 25 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] b[10-17] Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles 72x cycle count improvement Additional 8 outputs each cycle

Frank Vahid, UCR 26 Experiments to Determine Thread Warping Performance: Simulator Setup main filter ………… main …… Parallel Execution Graph (PEG) – represents thread level parallelism Nodes: Sequential execution blocks (SEBs) Edges: pthread calls Generate PEG using pthread wrappers Determine SEB performances Sw: SimpleScalar Hw: Synthesis/simulation (Xilinx) Event-driven simulation – use defined algoritms to change architecture dynamically Simulation Summary Complete when all SEBs simulated Observe total cycles 1) 2) 3) 4) Optimistic for Sw execution (no memory contention) Pessimistic for warped execution (accelerators/microprocessors execute exclusively) int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } }

Frank Vahid, UCR 27 Experiments Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions Base architecture – 4 ARM cores Focus on recurring applications (embedded) TW: FPGA running at whatever frequency determined by synthesis On-chip CAD FPGA µP 4 ARM MHz Compared to 4 ARM MHz + FPGA (synth freq) Multi-core Thread Warping µP

Frank Vahid, UCR 28 Speedup from Thread Warping Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

Frank Vahid, UCR 29 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators, … Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Custom interconnections, tuned processors, …

Frank Vahid, UCR 30 µP Cache Warp Processing Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic Planning MICRO submission

Frank Vahid, UCR 31 Expandable Logic Used our simulation framework Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform User selects required amount of FPGA No need to recompile/synthesize

Frank Vahid, UCR 32 Dynamic enables Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2

Frank Vahid, UCR 33 Dynamic enables Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP

Frank Vahid, UCR 34 Industrial Interactions Year 2 / 3 Freescale Research visit: F. Vahid to Freescale, Chicago, Spring’06. Talk and full-day research discussion with several engineers. Internships –Scott Sirowy, summer 2006 in Austin (also 2005) Intel Chip prototype: Participated in Intel’s Research Shuttle to build prototype warp FPGA fabric – continued bi-weekly phone meetings with Intel engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at UofA), several day visit to Intel by Lysecky to simulate design, ready for tapout. June’06–Intel cancelled entire shuttle program as part of larger cutbacks. Research discussions via with liaison Darshan Patra (Oregon). IBM Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights. Caleb Leak, summer/fall Platform: IBM’s Scott Lekuch and Kai Schleupen 2-day visit to UCR to set up Cell development platform having FPGAs. Technical discussion: Numerous ongoing and phone interactions with S. Lekuch regarding our research on Cell/FPGA platform. Several interactions with Xilinx also

Frank Vahid, UCR 35 Patents “Warp Processor” patent Filed with USPTO summer 2004 Granted winter 2007 SRC has non-exclusive royalty-free license

Frank Vahid, UCR 36 Year 1 / 2 publications New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), (Co-authored paper with Freescale) Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.

Frank Vahid, UCR 37 Year 2 / 3 publications Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. F. Vahid, G. Stitt, and R. Lysecky.. IEEE Computer, 2008 (to appear). C is for Circuits: Capturing FPGA Circuits as Sequential Code for Portability. S. Sirowy, G. Stitt, and F. Vahid. Int. Symp. on FPGAs, Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators. G. Stitt and F. Vahid.. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007, pp A Self-Tuning Configurable Cache. A. Gordon-Ross and F. Vahid. Design Automation Conference (DAC), Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), Aug Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International Embedded Systems Symposium (IESS), Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F. Vahid and S. Lonardi. Design Automation and Test in Europe, A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross, P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid. Design Automation and Test in Europe, Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D.M. Tullsen, R. Lysecky. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D.M. Tullsen. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp