Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Hardware/Software Partitioning Greg Stitt ECE Department University of Florida.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Exploiting Parallelism

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Memory Buffering Techniques Greg Stitt ECE Department University of Florida.

Sunpyo Hong, Hyesoon Kim

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Memory Buffering Techniques

Multiscalar Processors

Martin Rinard Laboratory for Computer Science

Improving java performance using Dynamic Method Migration on FPGAs

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

From C to Elastic Circuits

Hardware Multithreading

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

Chapter 1 Introduction.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Presentation transcript:

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation Frank Vahid Dept. of CS&E University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine

2/30 Binary Translation VLIW µP Background Motivated by commercial dynamic binary translation of early 2000s x86 Binary VLIW Binary FPGA µP Binary Warp processing (Lysecky/Stitt/Vahid ): dynamically translate binary to circuits on FPGAs Performance e.g., Transmeta Crusoe “code morphing” Binary “Translation”

3/30 µP FPGA On-chip CAD Warp Processing Background Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

4/30 µP FPGA On-chip CAD Warp Processing Background Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

5/30 µP FPGA On-chip CAD Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

6/30 µP FPGA On-chip CAD Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

7/30 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

8/30 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

9/30 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA Lean place&route/FPGA  10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

10/30 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps

11/30 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save FPGA configurations) Common in embedded systems Might view as (long) boot phase On-chip CAD Single-execution speedup FPGA Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist,... Warping takes time – when useful?

12/30 µP Thread Warping - Overview FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms  multithreaded apps

13/30 Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Thread Warping Tools Invoked by OS Uses pthread library (POSIX) Mutex/semaphore for synchronization Defined methods/algorithms of a thread warping framework Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile On-chip CAD FPGA µPµP Accelerator Synthesis Memory Access Synchronization

14/30 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same data Solution: Fetch data once, broadcast to multiple threads (MAS) ….

15/30 Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads for (i = 0; i < 100; i++) { thread_create( f, a, i ); } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Thread Group Def-Use: a is constant for all threads Addresses of a[0-9] are constant for thread group f() ……………… f() DMA RAM A[0-9] Before MAS: 1000 memory accesses After MAS: 100 memory accesses Data fetched once, delivered to entire group 2) Identify constant memory addresses in thread function Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access Execution synchronized by OS enable (from OS)

16/30 Memory Access Synchronization (MAS) Also detects overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] Caches reused data, delivers windows to threads Each thread accesses different addresses – but addresses may overlap enable

17/30 Framework Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile Also developed initial algorithms for: Queue analysis Accelerator instantiation OS scheduling of threads to accelerators and cores

18/30 Thread Warping Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() Thread Queue Queue Analysis Thread functions: filter() filter() threads execute on available cores Remaining threads added to queue OS invokes CAD (due to queue size or periodically) CAD tools identify filter() for synthesis

19/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() filter() binary Decompilation CDFG Memory Access Synchronization MAS detects overlapping windows MAS detects thread group CAD reads filter() binary

20/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter() filter() binary Decompilation CDFG Memory Access Synchronization High-level Synthesis >> 2 filter()..... Smart Buffer RAM Accelerator Library filter Synthesis creates pipelined accelerator for filter() group: 8 accelerators Stored for future use Accelerators loaded into FPGA

21/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-52] a[2-5] a[9-12] Smart buffer streams a[] data After buffer fills, delivers a window to all eight accelerators OS schedules threads to accelerators enable (from OS)

22/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] a[10-13] a[17-20] Each cycle, smart buffer delivers eight more windows – pipeline remains full

23/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] b[2-9] Accelerators create 8 outputs after pipeline latency passes

24/30 Example int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } FPGA µP OS µP main() filter() µP On-chip CAD filter()..... Smart Buffer RAM filter a[0-53] b[10-17] Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles 72x cycle count improvement Additional 8 outputs each cycle

25/30 Experiments to Determine Thread Warping Performance: Simulator Setup main filter ………… main …… Parallel Execution Graph (PEG) – represents thread level parallelism Nodes: Sequential execution blocks (SEBs) Edges: pthread calls Generate PEG using pthread wrappers Determine SEB performances Sw: SimpleScalar Hw: Synthesis/simulation (Xilinx) Event-driven simulation – use defined algoritms to change architecture dynamically Simulation Summary Complete when all SEBs simulated Observe total cycles 1) 2) 3) 4) Optimistic for Sw execution (no memory contention) Pessimistic for warped execution (accelerators/microprocessors execute exclusively) int main( ) { for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } }

26/30 Experiments Benchmarks: Image processing, DSP, scientific computing Highly parallel examples to illustrate thread warping potential We created multithreaded versions Base architecture – 4 ARM cores Focus on recurring applications (embedded) TW: FPGA running at whatever frequency determined by synthesis On-chip CAD FPGA µP 4 ARM MHz Compared to 4 ARM MHz + FPGA (synth freq) Multi-core Thread Warping µP

27/30 Speedup from Thread Warping Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s

28/30 Limitations Dependent on coding practices Assumes boss/worker thread model Not all apps amenable to FPGA speedup Commercial CAD slow – warping takes time But in worst case, FPGA just not used by application

29/30 Why not Partition Statically? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Thread Synthesis Dynamic Thread Warping Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor Can adapt to changing workloads Smaller & more accelerators, fewer & large accelerators,... Memory-access synchronization applicable to static approach

30/30 Conclusions Thread warping framework dynamically synthesizes accelerators for thread functions Memory Access Synchronization Helps reduce memory bottleneck problem 130x speedups for chosen examples Future work Handle wider variety of coding constructs Improve for different thread models Numerous open problems E.g., dynamic reallocation of FPGA resources

31/30 Accelerator Synthesis Decompilation Memory Access Synchronization High-level Synthesis Thread Functions Netlist Binary Updater Updated Binary Hw/Sw Partitioning Hw Sw Thread Group Table Memory Access Synchronization

32/30 Operating System Scheduler Schedulable Resource List Scheduler On-chip CAD µP a() – 1 accelerator b() – 1 accelerator 4 microprocessors scheduler( queue ) begin thread = GetQueueHead( queue ); while no thread scheduled f = GetThreadFunction( thread ); if (SRL[f] > 0) scheduleHw(); else if (SRL[uP] > 0) scheduleSw(); thread = GetNextThread( queue, thread ); end while end Schedulable resource list (SRL) specifies resource amounts for each synthesized thread function First checks if accelerators available, then checks microprocessors If current thread not scheduled, tries the next thread – prevents blocking of other threads Complexity: O(n), n is queue size – rarely occurs a() b() ? OS Discussed at poster

33/30 Accelerator Instantiation Map to 0-1 knapsack problem Place items (accelerators) into knapsack (FPGA) until capacity (FPGA area) exceeded Profit: Accelerator speedup Weight: Accelerator area Determined during synthesis Thread Queue How many accelerators for each function? a() b() a() b() Accelerator Instantiation Thread Queue Thread Functions Thread Counts Accelerator Synthesis Accelerator Library FPGA Not In Library? Done Accelerators Synthesized? Queue Analysis false true Updated Binary Schedulable Resource List Place&Route Thread Group Table Netlist Bitfile Discussed at poster

34/30 Operating System Scheduler: Thread Groups Synthesis creates accelerators for group (or subset) enable scheduleHw( thread ) begin if (TGT[thread] != NULL) then group = TGT[thread]; group.notReadyThreads -= 1; if (group.notReadyThreads = 0) then scheduleGroup(); group.notReadyThreads = group.size; end if else // schedule threads not in any group end if end Scheduler uses thread group table (TGT) to schedule groups – identified by instruction address Waits until all threads in group are ready, then enables accelerators a() OS schedules group by enabling multiple threads simultaneously Discussed at poster