Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st, 2015

Current Trend- Multithreading  SMT – Simultaneous MT  Complex pipeline  Number of thread is limited  High performance  Interleaved MT– Switch each Cycle  Simple pipeline  No switch penalty  Poor resource utilization  Block MT – Switch on Event  Simple pipeline – 1 thread active  Pipe is flushed each switch  High penalty

Cache Miss 22 1 1 2 1 1 2 12 31 23 41 2 34 51 2 3 45 6 3 2 Memory unit 6 5 4 3 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 End of Memory Operation 2 1 2 3 4 5 6 2 3 4 5 6 7 6 5 4 3 3 4 5 6 7 8 CFMT–  Store instruction state during thread switch  Reduce switch penalty to pipe restore time  Stands for Continuous Flow Multithreading an SoE Improvement Cache Miss again

Memory Intensive  CFMT consumes a lot of memory to store pipe  Memory is currently considered a bottleneck  Cheap, fast, on Die memory is required!

Memristors  Envisioned in 1971 by Leon Chua  Discovered in 2008 by HP Labs  Provides abundant non-volatile memory

Goals  Evaluate CFMT performance on SPEC CPU 2006 benchmark suite  Implement on FPGA  Memristors modeled by standard registers with an estimated latency  Construct test environment

Our Contribution  Design was ported from a previous project:  Partial support of Alpha ISA  Functional Simulation  G. Satat, N. Wald and S. Kvatinsky, "SOE Multithreading with Memristors," Technion- IIT, 2012.  We extended the design by:  Further validation  Modified to comply with synthesis design rules  Configured FPGA  Built collection toolchain

Development Environment  Design written in System Verilog  Simulated on Mentor ModelSim  Synthesized with Xilinx Vivado  Data gathered through Xilinx ChipScope  Programmed on Virtex-7 FPGA  SPEC2006 Compiled with Dan Kegel's Crosstool  Automated with Python and TCL

Pipe Line Execution Unit Controller (EUC) Core- High Level Design FetchDepend ancy check Addr calc. Exe. unit route int Write Back Thread memory controller (TMC)Thread Switch Controller (TSC) Thread State Table (TST) Pipe control Decode FP Mem FP Regsiter Read MPR

Multistate Pipeline Register  Contains multiple shadowed values in one register  Building block for state storing mechanism thread_rd_en thread_st_en select_data signals_from_pipe signals_to_next_level

Uncore - Xilinx IPs  Data Memory and Instruction Cahce with blk_mem_gen  Block Memory Generator 2 18 deep  Virtual Input for asynchronous reset with vio  Clk signal scaling to 10Mhz with clk_wiz

Design Deficiencies  Design does not implement the complete Alpha ISA  Emulated execution:  Floating Point  Memory  Static branch prediction

Memory Emulation  Memory access is emulated  4 MB of physical memory with a single cycle latency  Latency determined by hit rate  Hit rate defined per thread  Increase in Thread Num degrades hit rate

SPEC CPU 2006  Issued by Standard Performance Evaluation Corporation  Used to evaluate CPU’s worldwide  A collection of 29 representative programs  444.namd was used for our tests  Biomolecular systems simulation  ALU bound  Compiled to Alpha with Optimized Space strategy

Testing Methodology  Test Parameters  Number of threads  Pipe depth  Number of execution units  Cache Hit rate  100 million instructions per thread  Or until trace halts

Project Toolchain  run_exp.py  Creates Vivado projects with predefined parameters  synth_design.tcl  synthesis and implementation  program_design.tcl  Controls run through VIO

Output Samples  Performance Counters  clock_cycle_counter X #threads  pipe_unutilized_cycles_counter  long_exe_switch_counter  no_exe_unit_switch_counter  thread_switch_counter X #threads  branch_miss_counter X #threads  dep_stall_counter X #threads  no_thread_in_pipe_cycles_counter  units_busy_counter X #units

Output Samples – Cont.  Utilization Report

Utilization Overhead per Thread Num

Expected Performance  Performance should increase with thread count until saturation is met  Saturation met when the gap is filled Gap Threads Performan ce CFMT Saturation SoE Saturation

Stall Types  Not all stalls can be masked by thread switch  Branch Misprdeiction  Stalls shorter than MPR load time (like Dependency Stalls)  Some are not masked if pipe is flushed  ALU operations  L2 Cache access SoE performance depends on this ability!

IPC per Thread Num (Multiple Exe Units)

IPC per Thread Num (Single Exe Units)

IPC per Pipe Length

Failure Points  Single test level  One thread takes over the machine  Per Trace data is identical  Experiment level  SoE and CFMT do not always behave the same for single thread  Increase in thread number leads to performance drop, while increased speedup and saturation is expected  SoE performs better than CFMT in some cases

Thank You

Backup

Vivado Project Experiment Process Verilog sources Xilinx IPGeneric TCL run_exp.py Experiment Definitions Vivado Project Design Parameters Constraints Trace Trace Repository Impl_design.tcl Bitstream Virtex 7 program_design.tcl Chipscope Results

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Similar presentations

Presentation on theme: "Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Similar presentations

Presentation on theme: "Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,"— Presentation transcript:

Similar presentations

About project

Feedback