Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CSCI 4717/5717 Computer Architecture
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Performance of Cache Memory
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Instruction-Level Parallelism (ILP)
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Chapter 12 Pipelining Strategies Performance Hazards.
Performed by: Kazarinov Yair Instructor: Inna Rivkin
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Experiences Implementing Tinuso in gem5 Maxwell Walter, Pascal Schleuniger, Andreas Erik Hindborg, Carl Christian Kjærgaard, Nicklas Bo Jensen, Sven Karlsson.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Architecture Basics ECE 454 Computer Systems Programming
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
PDR – Preliminary Design Review Gilad Tsoran Benny Fellman Advisor: Shahar Kvatinsky Winter 2013.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:
Sunpyo Hong, Hyesoon Kim
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Lecture 2: Performance Evaluation
Design-Space Exploration
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Instruction Level Parallelism (ILP)
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
How to improve (decrease) CPI
Presentation transcript:

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st, 2015

Current Trend- Multithreading  SMT – Simultaneous MT  Complex pipeline  Number of thread is limited  High performance  Interleaved MT– Switch each Cycle  Simple pipeline  No switch penalty  Poor resource utilization  Block MT – Switch on Event  Simple pipeline – 1 thread active  Pipe is flushed each switch  High penalty

Cache Miss Memory unit End of Memory Operation CFMT–  Store instruction state during thread switch  Reduce switch penalty to pipe restore time  Stands for Continuous Flow Multithreading an SoE Improvement Cache Miss again

Memory Intensive  CFMT consumes a lot of memory to store pipe  Memory is currently considered a bottleneck  Cheap, fast, on Die memory is required!

Memristors  Envisioned in 1971 by Leon Chua  Discovered in 2008 by HP Labs  Provides abundant non-volatile memory

Goals  Evaluate CFMT performance on SPEC CPU 2006 benchmark suite  Implement on FPGA  Memristors modeled by standard registers with an estimated latency  Construct test environment

Our Contribution  Design was ported from a previous project:  Partial support of Alpha ISA  Functional Simulation  G. Satat, N. Wald and S. Kvatinsky, "SOE Multithreading with Memristors," Technion- IIT,  We extended the design by:  Further validation  Modified to comply with synthesis design rules  Configured FPGA  Built collection toolchain

Development Environment  Design written in System Verilog  Simulated on Mentor ModelSim  Synthesized with Xilinx Vivado  Data gathered through Xilinx ChipScope  Programmed on Virtex-7 FPGA  SPEC2006 Compiled with Dan Kegel's Crosstool  Automated with Python and TCL

Pipe Line Execution Unit Controller (EUC) Core- High Level Design FetchDepend ancy check Addr calc. Exe. unit route int Write Back Thread memory controller (TMC)Thread Switch Controller (TSC) Thread State Table (TST) Pipe control Decode FP Mem FP Regsiter Read MPR

Multistate Pipeline Register  Contains multiple shadowed values in one register  Building block for state storing mechanism thread_rd_en thread_st_en select_data signals_from_pipe signals_to_next_level

Uncore - Xilinx IPs  Data Memory and Instruction Cahce with blk_mem_gen  Block Memory Generator 2 18 deep  Virtual Input for asynchronous reset with vio  Clk signal scaling to 10Mhz with clk_wiz

Design Deficiencies  Design does not implement the complete Alpha ISA  Emulated execution:  Floating Point  Memory  Static branch prediction

Memory Emulation  Memory access is emulated  4 MB of physical memory with a single cycle latency  Latency determined by hit rate  Hit rate defined per thread  Increase in Thread Num degrades hit rate

SPEC CPU 2006  Issued by Standard Performance Evaluation Corporation  Used to evaluate CPU’s worldwide  A collection of 29 representative programs  444.namd was used for our tests  Biomolecular systems simulation  ALU bound  Compiled to Alpha with Optimized Space strategy

Testing Methodology  Test Parameters  Number of threads  Pipe depth  Number of execution units  Cache Hit rate  100 million instructions per thread  Or until trace halts

Project Toolchain  run_exp.py  Creates Vivado projects with predefined parameters  synth_design.tcl  synthesis and implementation  program_design.tcl  Controls run through VIO

Output Samples  Performance Counters  clock_cycle_counter X #threads  pipe_unutilized_cycles_counter  long_exe_switch_counter  no_exe_unit_switch_counter  thread_switch_counter X #threads  branch_miss_counter X #threads  dep_stall_counter X #threads  no_thread_in_pipe_cycles_counter  units_busy_counter X #units

Output Samples – Cont.  Utilization Report

Utilization Overhead per Thread Num

Expected Performance  Performance should increase with thread count until saturation is met  Saturation met when the gap is filled Gap Threads Performan ce CFMT Saturation SoE Saturation

Stall Types  Not all stalls can be masked by thread switch  Branch Misprdeiction  Stalls shorter than MPR load time (like Dependency Stalls)  Some are not masked if pipe is flushed  ALU operations  L2 Cache access SoE performance depends on this ability!

IPC per Thread Num (Multiple Exe Units)

IPC per Thread Num (Single Exe Units)

IPC per Pipe Length

Failure Points  Single test level  One thread takes over the machine  Per Trace data is identical  Experiment level  SoE and CFMT do not always behave the same for single thread  Increase in thread number leads to performance drop, while increased speedup and saturation is expected  SoE performs better than CFMT in some cases

Thank You

Backup

Vivado Project Experiment Process Verilog sources Xilinx IPGeneric TCL run_exp.py Experiment Definitions Vivado Project Design Parameters Constraints Trace Trace Repository Impl_design.tcl Bitstream Virtex 7 program_design.tcl Chipscope Results