Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st, 2015
Current Trend- Multithreading SMT – Simultaneous MT Complex pipeline Number of thread is limited High performance Interleaved MT– Switch each Cycle Simple pipeline No switch penalty Poor resource utilization Block MT – Switch on Event Simple pipeline – 1 thread active Pipe is flushed each switch High penalty
Cache Miss Memory unit End of Memory Operation CFMT– Store instruction state during thread switch Reduce switch penalty to pipe restore time Stands for Continuous Flow Multithreading an SoE Improvement Cache Miss again
Memory Intensive CFMT consumes a lot of memory to store pipe Memory is currently considered a bottleneck Cheap, fast, on Die memory is required!
Memristors Envisioned in 1971 by Leon Chua Discovered in 2008 by HP Labs Provides abundant non-volatile memory
Goals Evaluate CFMT performance on SPEC CPU 2006 benchmark suite Implement on FPGA Memristors modeled by standard registers with an estimated latency Construct test environment
Our Contribution Design was ported from a previous project: Partial support of Alpha ISA Functional Simulation G. Satat, N. Wald and S. Kvatinsky, "SOE Multithreading with Memristors," Technion- IIT, We extended the design by: Further validation Modified to comply with synthesis design rules Configured FPGA Built collection toolchain
Development Environment Design written in System Verilog Simulated on Mentor ModelSim Synthesized with Xilinx Vivado Data gathered through Xilinx ChipScope Programmed on Virtex-7 FPGA SPEC2006 Compiled with Dan Kegel's Crosstool Automated with Python and TCL
Pipe Line Execution Unit Controller (EUC) Core- High Level Design FetchDepend ancy check Addr calc. Exe. unit route int Write Back Thread memory controller (TMC)Thread Switch Controller (TSC) Thread State Table (TST) Pipe control Decode FP Mem FP Regsiter Read MPR
Multistate Pipeline Register Contains multiple shadowed values in one register Building block for state storing mechanism thread_rd_en thread_st_en select_data signals_from_pipe signals_to_next_level
Uncore - Xilinx IPs Data Memory and Instruction Cahce with blk_mem_gen Block Memory Generator 2 18 deep Virtual Input for asynchronous reset with vio Clk signal scaling to 10Mhz with clk_wiz
Design Deficiencies Design does not implement the complete Alpha ISA Emulated execution: Floating Point Memory Static branch prediction
Memory Emulation Memory access is emulated 4 MB of physical memory with a single cycle latency Latency determined by hit rate Hit rate defined per thread Increase in Thread Num degrades hit rate
SPEC CPU 2006 Issued by Standard Performance Evaluation Corporation Used to evaluate CPU’s worldwide A collection of 29 representative programs 444.namd was used for our tests Biomolecular systems simulation ALU bound Compiled to Alpha with Optimized Space strategy
Testing Methodology Test Parameters Number of threads Pipe depth Number of execution units Cache Hit rate 100 million instructions per thread Or until trace halts
Project Toolchain run_exp.py Creates Vivado projects with predefined parameters synth_design.tcl synthesis and implementation program_design.tcl Controls run through VIO
Output Samples Performance Counters clock_cycle_counter X #threads pipe_unutilized_cycles_counter long_exe_switch_counter no_exe_unit_switch_counter thread_switch_counter X #threads branch_miss_counter X #threads dep_stall_counter X #threads no_thread_in_pipe_cycles_counter units_busy_counter X #units
Output Samples – Cont. Utilization Report
Utilization Overhead per Thread Num
Expected Performance Performance should increase with thread count until saturation is met Saturation met when the gap is filled Gap Threads Performan ce CFMT Saturation SoE Saturation
Stall Types Not all stalls can be masked by thread switch Branch Misprdeiction Stalls shorter than MPR load time (like Dependency Stalls) Some are not masked if pipe is flushed ALU operations L2 Cache access SoE performance depends on this ability!
IPC per Thread Num (Multiple Exe Units)
IPC per Thread Num (Single Exe Units)
IPC per Pipe Length
Failure Points Single test level One thread takes over the machine Per Trace data is identical Experiment level SoE and CFMT do not always behave the same for single thread Increase in thread number leads to performance drop, while increased speedup and saturation is expected SoE performs better than CFMT in some cases
Thank You
Backup
Vivado Project Experiment Process Verilog sources Xilinx IPGeneric TCL run_exp.py Experiment Definitions Vivado Project Design Parameters Constraints Trace Trace Repository Impl_design.tcl Bitstream Virtex 7 program_design.tcl Chipscope Results