Download presentation
Presentation is loading. Please wait.
Published byLoren Tate Modified over 8 years ago
1
Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st, 2015
2
Current Trend- Multithreading SMT – Simultaneous MT Complex pipeline Number of thread is limited High performance Interleaved MT– Switch each Cycle Simple pipeline No switch penalty Poor resource utilization Block MT – Switch on Event Simple pipeline – 1 thread active Pipe is flushed each switch High penalty
3
Cache Miss 22 1 1 2 1 1 2 12 31 23 41 2 34 51 2 3 45 6 3 2 Memory unit 6 5 4 3 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 End of Memory Operation 2 1 2 3 4 5 6 2 3 4 5 6 7 6 5 4 3 3 4 5 6 7 8 CFMT– Store instruction state during thread switch Reduce switch penalty to pipe restore time Stands for Continuous Flow Multithreading an SoE Improvement Cache Miss again
4
Memory Intensive CFMT consumes a lot of memory to store pipe Memory is currently considered a bottleneck Cheap, fast, on Die memory is required!
5
Memristors Envisioned in 1971 by Leon Chua Discovered in 2008 by HP Labs Provides abundant non-volatile memory
6
Goals Evaluate CFMT performance on SPEC CPU 2006 benchmark suite Implement on FPGA Memristors modeled by standard registers with an estimated latency Construct test environment
7
Our Contribution Design was ported from a previous project: Partial support of Alpha ISA Functional Simulation G. Satat, N. Wald and S. Kvatinsky, "SOE Multithreading with Memristors," Technion- IIT, 2012. We extended the design by: Further validation Modified to comply with synthesis design rules Configured FPGA Built collection toolchain
8
Development Environment Design written in System Verilog Simulated on Mentor ModelSim Synthesized with Xilinx Vivado Data gathered through Xilinx ChipScope Programmed on Virtex-7 FPGA SPEC2006 Compiled with Dan Kegel's Crosstool Automated with Python and TCL
9
Pipe Line Execution Unit Controller (EUC) Core- High Level Design FetchDepend ancy check Addr calc. Exe. unit route int Write Back Thread memory controller (TMC)Thread Switch Controller (TSC) Thread State Table (TST) Pipe control Decode FP Mem FP Regsiter Read MPR
10
Multistate Pipeline Register Contains multiple shadowed values in one register Building block for state storing mechanism thread_rd_en thread_st_en select_data signals_from_pipe signals_to_next_level
11
Uncore - Xilinx IPs Data Memory and Instruction Cahce with blk_mem_gen Block Memory Generator 2 18 deep Virtual Input for asynchronous reset with vio Clk signal scaling to 10Mhz with clk_wiz
12
Design Deficiencies Design does not implement the complete Alpha ISA Emulated execution: Floating Point Memory Static branch prediction
13
Memory Emulation Memory access is emulated 4 MB of physical memory with a single cycle latency Latency determined by hit rate Hit rate defined per thread Increase in Thread Num degrades hit rate
14
SPEC CPU 2006 Issued by Standard Performance Evaluation Corporation Used to evaluate CPU’s worldwide A collection of 29 representative programs 444.namd was used for our tests Biomolecular systems simulation ALU bound Compiled to Alpha with Optimized Space strategy
15
Testing Methodology Test Parameters Number of threads Pipe depth Number of execution units Cache Hit rate 100 million instructions per thread Or until trace halts
16
Project Toolchain run_exp.py Creates Vivado projects with predefined parameters synth_design.tcl synthesis and implementation program_design.tcl Controls run through VIO
17
Output Samples Performance Counters clock_cycle_counter X #threads pipe_unutilized_cycles_counter long_exe_switch_counter no_exe_unit_switch_counter thread_switch_counter X #threads branch_miss_counter X #threads dep_stall_counter X #threads no_thread_in_pipe_cycles_counter units_busy_counter X #units
18
Output Samples – Cont. Utilization Report
19
Utilization Overhead per Thread Num
20
Expected Performance Performance should increase with thread count until saturation is met Saturation met when the gap is filled Gap Threads Performan ce CFMT Saturation SoE Saturation
21
Stall Types Not all stalls can be masked by thread switch Branch Misprdeiction Stalls shorter than MPR load time (like Dependency Stalls) Some are not masked if pipe is flushed ALU operations L2 Cache access SoE performance depends on this ability!
22
IPC per Thread Num (Multiple Exe Units)
23
IPC per Thread Num (Single Exe Units)
24
IPC per Pipe Length
25
Failure Points Single test level One thread takes over the machine Per Trace data is identical Experiment level SoE and CFMT do not always behave the same for single thread Increase in thread number leads to performance drop, while increased speedup and saturation is expected SoE performs better than CFMT in some cases
26
Thank You
27
Backup
28
Vivado Project Experiment Process Verilog sources Xilinx IPGeneric TCL run_exp.py Experiment Definitions Vivado Project Design Parameters Constraints Trace Trace Repository Impl_design.tcl Bitstream Virtex 7 program_design.tcl Chipscope Results
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.