Presentation is loading. Please wait.

Presentation is loading. Please wait.

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Similar presentations


Presentation on theme: "Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,"— Presentation transcript:

1 Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st, 2015

2 Current Trend- Multithreading  SMT – Simultaneous MT  Complex pipeline  Number of thread is limited  High performance  Interleaved MT– Switch each Cycle  Simple pipeline  No switch penalty  Poor resource utilization  Block MT – Switch on Event  Simple pipeline – 1 thread active  Pipe is flushed each switch  High penalty

3 Cache Miss 22 1 1 2 1 1 2 12 31 23 41 2 34 51 2 3 45 6 3 2 Memory unit 6 5 4 3 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 End of Memory Operation 2 1 2 3 4 5 6 2 3 4 5 6 7 6 5 4 3 3 4 5 6 7 8 CFMT–  Store instruction state during thread switch  Reduce switch penalty to pipe restore time  Stands for Continuous Flow Multithreading an SoE Improvement Cache Miss again

4 Memory Intensive  CFMT consumes a lot of memory to store pipe  Memory is currently considered a bottleneck  Cheap, fast, on Die memory is required!

5 Memristors  Envisioned in 1971 by Leon Chua  Discovered in 2008 by HP Labs  Provides abundant non-volatile memory

6 Goals  Evaluate CFMT performance on SPEC CPU 2006 benchmark suite  Implement on FPGA  Memristors modeled by standard registers with an estimated latency  Construct test environment

7 Our Contribution  Design was ported from a previous project:  Partial support of Alpha ISA  Functional Simulation  G. Satat, N. Wald and S. Kvatinsky, "SOE Multithreading with Memristors," Technion- IIT, 2012.  We extended the design by:  Further validation  Modified to comply with synthesis design rules  Configured FPGA  Built collection toolchain

8 Development Environment  Design written in System Verilog  Simulated on Mentor ModelSim  Synthesized with Xilinx Vivado  Data gathered through Xilinx ChipScope  Programmed on Virtex-7 FPGA  SPEC2006 Compiled with Dan Kegel's Crosstool  Automated with Python and TCL

9 Pipe Line Execution Unit Controller (EUC) Core- High Level Design FetchDepend ancy check Addr calc. Exe. unit route int Write Back Thread memory controller (TMC)Thread Switch Controller (TSC) Thread State Table (TST) Pipe control Decode FP Mem FP Regsiter Read MPR

10 Multistate Pipeline Register  Contains multiple shadowed values in one register  Building block for state storing mechanism thread_rd_en thread_st_en select_data signals_from_pipe signals_to_next_level

11 Uncore - Xilinx IPs  Data Memory and Instruction Cahce with blk_mem_gen  Block Memory Generator 2 18 deep  Virtual Input for asynchronous reset with vio  Clk signal scaling to 10Mhz with clk_wiz

12 Design Deficiencies  Design does not implement the complete Alpha ISA  Emulated execution:  Floating Point  Memory  Static branch prediction

13 Memory Emulation  Memory access is emulated  4 MB of physical memory with a single cycle latency  Latency determined by hit rate  Hit rate defined per thread  Increase in Thread Num degrades hit rate

14 SPEC CPU 2006  Issued by Standard Performance Evaluation Corporation  Used to evaluate CPU’s worldwide  A collection of 29 representative programs  444.namd was used for our tests  Biomolecular systems simulation  ALU bound  Compiled to Alpha with Optimized Space strategy

15 Testing Methodology  Test Parameters  Number of threads  Pipe depth  Number of execution units  Cache Hit rate  100 million instructions per thread  Or until trace halts

16 Project Toolchain  run_exp.py  Creates Vivado projects with predefined parameters  synth_design.tcl  synthesis and implementation  program_design.tcl  Controls run through VIO

17 Output Samples  Performance Counters  clock_cycle_counter X #threads  pipe_unutilized_cycles_counter  long_exe_switch_counter  no_exe_unit_switch_counter  thread_switch_counter X #threads  branch_miss_counter X #threads  dep_stall_counter X #threads  no_thread_in_pipe_cycles_counter  units_busy_counter X #units

18 Output Samples – Cont.  Utilization Report

19 Utilization Overhead per Thread Num

20 Expected Performance  Performance should increase with thread count until saturation is met  Saturation met when the gap is filled Gap Threads Performan ce CFMT Saturation SoE Saturation

21 Stall Types  Not all stalls can be masked by thread switch  Branch Misprdeiction  Stalls shorter than MPR load time (like Dependency Stalls)  Some are not masked if pipe is flushed  ALU operations  L2 Cache access SoE performance depends on this ability!

22 IPC per Thread Num (Multiple Exe Units)

23 IPC per Thread Num (Single Exe Units)

24 IPC per Pipe Length

25 Failure Points  Single test level  One thread takes over the machine  Per Trace data is identical  Experiment level  SoE and CFMT do not always behave the same for single thread  Increase in thread number leads to performance drop, while increased speedup and saturation is expected  SoE performs better than CFMT in some cases

26 Thank You

27 Backup

28 Vivado Project Experiment Process Verilog sources Xilinx IPGeneric TCL run_exp.py Experiment Definitions Vivado Project Design Parameters Constraints Trace Trace Repository Impl_design.tcl Bitstream Virtex 7 program_design.tcl Chipscope Results


Download ppt "Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,"

Similar presentations


Ads by Google