EENG449b/Savvides Lec 13.1 2/24/05 February 24, 2005 Prof. Andreas Savvides Spring 2005 EENG 449bG/CPSC 439bG.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Chapter 3 - ILP CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David Patterson.
Instructor: Morris Lancaster
Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
Compiler Techniques for ILP
Concepts and Challenges
Instruction-Level Parallelism (ILP)
CSCE430/830 Computer Architecture
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 13 ARM Performance Issues and Programming

EENG449b/Savvides Lec /24/05 ARM Thumb Benchmark Performance Dhrystone benchmark result Memory System Performance

EENG449b/Savvides Lec /24/05

EENG449b/Savvides Lec /24/05 ARM vs. THUMB “THUMB code will provide up to 65% of the code size and 160% of an equivalent ARM connected to a 16-bit memory system”. Advatage of ARM over THUMB –Able to manipulate 32-bit integers in a single instruction THUMB’s advantage over 32-bit architectures with 16-bit instructions –It can swith back and forth between 16-bit and 32-bit instructions –Fast interrupts & DSP Algorithms can be implemented in 32-bits and processor can switch back and forth

EENG449b/Savvides Lec /24/05

EENG449b/Savvides Lec /24/05 Not the case when you have loads and stores!!!!

EENG449b/Savvides Lec /24/05

EENG449b/Savvides Lec /24/05 Optimizing Code Execution in Hardware ARM7 uses a three stage pipeline Each instruction takes 3 cycles to execute but has a CPI of 1 2 Possible ways to increase performance –Increase CPU frequency –Reduce CPI (increase pipeline stages & optimizations)

EENG449b/Savvides Lec /24/05 Where are the other bottlenecks? Exceptions Stalls due to Memory Accesses Inefficient Software

EENG449b/Savvides Lec /24/05 Exceptions & Memory Performance Refer to handout from last class for exceptions discussion –Lec 11 of handout for exceptions –Lec 10 of handout for memory

EENG449b/Savvides Lec /24/05

EENG449b/Savvides Lec /24/05 Exception Priorities & Latencies Exception Priorities 1. Reset 2. Data Abort 3. FIQ 4. IRQ 5. Prefetch Abort 6.Software Interrupt Interrupt Latencies –FIQ Worst case: Time to pass through synchronizer + time for longest instruction to complete +time for data abort entry + time for FIQ entry = 1.4us on a 20MHz processor -IRQ Worst case Same as FIQ but if FIQ occurs right before, then you need 2xFIQ latency

EENG449b/Savvides Lec /24/05 Interrupts FIQ – does not require saving context. –ARM has sufficient state to bypass this –When leaving the interrupt handler, program should execute »SUBS PC, R14_fiq,#4 IRQ – lower priority, masked out when FIQ is entered –When leaving the hander, program should execute »SUBS PC,R14_irq,#4 Software Interrupt: SWI –Returning handler should execute »MOV PC, R14_svc

EENG449b/Savvides Lec /24/05 Exceptions

EENG449b/Savvides Lec /24/05 Instruction Latencies

EENG449b/Savvides Lec /24/05 Bus Cycle Types Nonsequential –requests a transfer to and from an address which is unrelated to the address used in the preceding cycle Sequencial –Requests a transfer to or from an address which is either the same, one word or one halfword grater than the address used in the preceding cycle Internal –Does not require transfer because it is performing an internal function

EENG449b/Savvides Lec /24/05 Optimizing Code In Software Try to perform similar optimizations as the compiler does –Loop unrolling –Eliminate redundant code –Optimize memory accesses »Use multiple load and store instructions »Avoid using 32-bit data types- this will reduce your load and store performance.

EENG449b/Savvides Lec /24/05 Note that we are switching to MIPS architecture to discuss software optimizations…

EENG449b/Savvides Lec /24/05 Running Example This code, adds a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Assume following latency all examples InstructionInstructionExecutionLatency producing resultusing result in cyclesin cycles FP ALU opAnother FP ALU op 4 3 FP ALU opStore double 3 2 Load doubleFP ALU op 1 1 Load doubleStore double 1 0 Integer opInteger op 1 0

EENG449b/Savvides Lec /24/05 FP Loop: Where are the Hazards? Loop:L.DF0,0(R1);F0=vector element ADD.DF4,F0,F2;add scalar from F2 S.D0(R1),F4;store result DADDUIR1,R1,#-8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot Where are the stalls? First translate into MIPS code: -To simplify, assume 8 is lowest address

EENG449b/Savvides Lec /24/05 FP Loop Showing Stalls 10 clocks: Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:L.DF0,0(R1);F0=vector element 2stall 3ADD.DF4,F0,F2;add scalar in F2 4stall 5stall 6 S.DF4, 0(R1);store result 7 DADDUIR1,R1,#-8;decrement pointer 8B (DW) 8stall 9 BNER1,Loop;branch R1!=zero 10stall;delayed branch slot

EENG449b/Savvides Lec /24/05 Revised FP Loop Minimizing Stalls 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:L.DF0,0(R1) 2DADDUIR1,R1,#-8 3ADD.DF4,F0,F2 4stall 5BNER1,R2, Loop;delayed branch 6 S.DF4, 8(R1) Swap BNE and S.D by changing address of S.D

EENG449b/Savvides Lec /24/05 Unroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls? 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.D0(R1),F4 ;drop DADDUI & BNE 4L.DF6,-8(R1) 5ADD.DF8,F6,F2 6S.DF8,-8(R1) ;drop DADDUI & BNE 7L.DF10,-16(R1) 8ADD.DF12,F10,F2 9S.DF12,-16(R1) ;drop DADDUI & BNE 10L.DF14,-24(R1) 11ADD.DF16,F14,F2 12S.DF16,-24(R1) 13DADDUIR1,R1,#-32;alter to 4*8 14BNER1,LOOP 14 + (4 x (1+2))+ 2= 28 clock cycles, or 7 per iteration 1 cycle stall 2 cycles stall 1 cycle stall 1 cycle stall (delayed branch)

EENG449b/Savvides Lec /24/05 Unrolled Loop Detail Do not usually know upper bound of loop Suppose it is n, and we would like to unroll the loop to make k copies of the body Instead of a single unrolled loop, we generate a pair of consecutive loops: –1st executes (n mod k) times and has a body that is the original loop –2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times –For large values of n, most of the execution time will be spent in the unrolled loop Problem: Although it improves execution performance, it increases the code size substantially!

EENG449b/Savvides Lec /24/05 Unrolled Loop That Minimizes Stalls (scheduled based on the latencies from slide 4) What assumptions made when moved code? –OK to move store past DSUBUI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:L.DF0,0(R1) 2L.DF6,-8(R1) 3L.DF10,-16(R1) 4L.DF14,-24(R1) 5ADD.DF4,F0,F2 6ADD.DF8,F6,F2 7ADD.DF12,F10,F2 8ADD.DF16,F14,F2 9S.DF4, 0(R1) 10S.DF8, -8(R1) 11S.DF12, -16(R1) 12DADDUIR1,R1,#-32 13BNER1,LOOP 14S.DF16, 8(R1) ; 8-32 = clock cycles, or 3.5 per iteration Better than 7 before scheduling and 6 when scheduled and not unrolled

EENG449b/Savvides Lec /24/05 Compiler Perspectives on Code Movement Compiler concerned about dependencies in program Whether or not a HW hazard depends on pipeline Try to schedule to avoid hazards that cause performance losses (True) Data dependencies (RAW if a hazard for HW) –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, can’t execute in parallel Easy to determine for registers (fixed names) Hard for memory (“memory disambiguation” problem): –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

EENG449b/Savvides Lec /24/05 Compiler Perspectives on Code Movement Name Dependencies are Hard to discover for Memory Accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1)  -8(R1)  -16(R1)  -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

EENG449b/Savvides Lec /24/05 Steps Compiler Performed to Unroll Check OK to move the S.D after DADDUI and BNEZ, and find amount to adjust S.D offset Determine unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent –requires analyzing memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield same result as the original code

EENG449b/Savvides Lec /24/05 Where are the name dependencies? 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.DF4,0(R1) ;drop DADDUI & BNE 4L.DF0,-8(R1) 5ADD.DF4,F0,F2 6S.DF4, -8(R1) ;drop DADDUI & BNE 7L.DF0,-16(R1) 8ADD.DF4,F0,F2 9S.DF4, -16(R1) ;drop DADDUI & BNE 10L.DF0,-24(R1) 11ADD.DF4,F0,F2 12S.DF4, -24(R1) 13DADDUIR1,R1,#-32;alter to 4*8 14BNER1,LOOP 15NOP How can remove them? (See pg. 310 of text)

EENG449b/Savvides Lec /24/05 Where are the name dependencies? 1 Loop:L.DF0,0(R1) 2ADD.DF4,F0,F2 3S.D0(R1),F4 ;drop DSUBUI & BNEZ 4L.DF6,-8(R1) 5ADD.DF8,F6,F2 6S.D-8(R1),F8 ;drop DSUBUI & BNEZ 7L.DF10,-16(R1) 8ADD.DF12,F10,F2 9S.D-16(R1),F12 ;drop DSUBUI & BNEZ 10L.DF14,-24(R1) 11ADD.DF16,F14,F2 12S.D-24(R1),F16 13DSUBUIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP The Orginal “register renaming” – instruction execution can be overlapped or in parallel

EENG449b/Savvides Lec /24/05 Limits to Loop Unrolling Decrease in the amount of loop overhead amortized with each unroll – After a few unrolls the loop overhead amortization is very small Code size limitations – memory is not infinite especially in embedded systems Compiler limitations – shortfall in registers due to excessive unrolling – register pressure – optimized code may loose its advantage due to the lack of registers

EENG449b/Savvides Lec /24/05 Static Branch Prediction Simplest: Predict taken –average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%. –Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%) Predict on the basis of branch direction? –choosing backward-going branches to be taken (loop) –forward-going branches to be not taken (if) –SPEC programs, however, most forward-going branches are taken => predict taken is better Predict branches on the basis of profile information collected from earlier runs –Misprediction varies from 5% to 22%

EENG449b/Savvides Lec /24/05 Next Time Quiz – no regular lecture March 3 – hardware optimizations and HW ILP