Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Adding the Jump Instruction
CBP 2009Comp 3014 The Nature of Computing 1 Laundry Model Washer Drier Store Basket Wardrobe.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Chapter 12 Pipelining Strategies Performance Hazards.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Quantum Computing II CPSC 321 Andreas Klappenecker.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Please see “portrait orientation” PowerPoint file for Chapter 8 Figure 8.1. Basic idea of instruction pipelining.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
CMPE 421 Parallel Computer Architecture
Processor Types and Instruction Sets CS 147 Presentation by Koichiro Hongo.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Fetch Directed Prefetching - a Study
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
RISC Pipelining CS 147 Spring 2011 Kui Cheung
CS2100 Computer Organization
RISC Pipelining RISC Pipelining CS 147 Spring 2011 Kui Cheung.
Morgan Kaufmann Publishers The Processor
Software and Hardware Circular Buffer Operations
General Optimization Issues
TigerSHARC processor General Overview.
Generating the “Rectify” code (C++ and assembly code)
Trying to avoid pipeline delays
Ka-Ming Keung Swamy D Ponpandi
Understanding the TigerSHARC ALU pipeline
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Understanding the TigerSHARC ALU pipeline
Moving Arrays -- 2 Completion of ideas needed for a general and complete program Final concepts needed for Final DMA.
Instruction Execution Cycle
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
General Optimization Issues
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
CS 286 Computer Architecture & Organization
Understanding the TigerSHARC ALU pipeline
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
Ka-Ming Keung Swamy D Ponpandi
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm

DC_Removal algorithm performance 2 / 28 To be tackled today Expected and actual cycle count for J- IALU version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Differences between first time into a function (cache empty) and second time into the function

DC_Removal algorithm performance 3 / 28 Set up time In principle 1 cycle / instruction instructions

DC_Removal algorithm performance 4 / 28 First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log 2 N) 4 instructions N * 5 instructions * log 2 N

DC_Removal algorithm performance 5 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2

DC_Removal algorithm performance 6 / 28 TigerSHARC pipeline

DC_Removal algorithm performance 7 / 28 Using the “Pipeline Viewer” Available with the TigerSHARC simulator ONLY VIEW | Debug Windows | Pipeline viewer F1 to F4 – instruction fetch unit pipeline PD, D, I -- Integer ALU pipeline A, EX1, EX2 – Compute Block pipeline

DC_Removal algorithm performance 8 / 28 Pipeline symbols Control - click A – Abort B – Bubble H – BTB Hit (Jumps) S – Stall W – Wait X – Illegal fetch(F1 – F4) X – Illegal instruction (PD – E2)

DC_Removal algorithm performance 9 / 28 Time in theory Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return N * * log 2 N * N N + 2 log 2 N N = 128 – instructions = cycles delay cycles C++ debug mode – 9500 cycles??????? Note other tests executed before this test. Means “cache filled”

DC_Removal algorithm performance 10 / 28 Test environment Examine the pipeline the 2 nd time around the loop “Cache’s filled”?

DC_Removal algorithm performance 11 / 28 Set up time Expected instructions Actual instructions + 2 stalls Why not 4 stalls?

DC_Removal algorithm performance 12 / 28 First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual stalls

DC_Removal algorithm performance 13 / 28 Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual stalls

DC_Removal algorithm performance 14 / 28 Shift Loop – 1 st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts

DC_Removal algorithm performance 15 / 28 Shift loop 2 nd and later times around Expect 2 Get 2

DC_Removal algorithm performance 16 / 28 Store back of &left, &right Expect 6 Actual stalls

DC_Removal algorithm performance 17 / 28 Exercise 1 Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio

DC_Removal algorithm performance 18 / 28 Third key element – FIFO circular buffer -- Order (N) * N 2

DC_Removal algorithm performance 19 / 28 Answer

DC_Removal algorithm performance 20 / 28

DC_Removal algorithm performance 21 / 28

DC_Removal algorithm performance 22 / 28

DC_Removal algorithm performance 23 / 28 Second time into function

DC_Removal algorithm performance 24 / 28 What happens if cache not full? – first time function called? Was stalls in loop Now stalls in loop

DC_Removal algorithm performance 25 / 28 First time function called 2 nd time around the loop Ditto 3, 4, 5, 6, 7, 8 times

DC_Removal algorithm performance 26 / 28 9 th time around the loop ditto 17 th, 25 th, 33 rd, 41 st, 49 th

DC_Removal algorithm performance 27 / 28 What is happening? With cache filled – memory read accesses require 4 cycles Unfilled – first one requires “12 cycles” Then next 7 require 4 cycles Total guess – is extra time associated with doing extra reads to fill the cache?

DC_Removal algorithm performance 28 / 28 Tackled today Expected and actual cycle count for J-IALU version of DC_Removal algorithm Understanding why the stalls occur and how to fix. Differences between first time into a function (cache empty) and second time into the function Further unknowns – how memory operations really work