© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Pipelining (Week 8).

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

COMP4611 Tutorial 6 Instruction Level Parallelism

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Computer Organization and Architecture

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer System Overview

Pipelining Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result Overlap these operations.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 12 CPU Structure and Function. Example Register Organizations.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

CH12 CPU Structure and Function

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

IBM Haifa Labs © 2005 IBM Corporation Performance Tools developed in IBM Haifa Gad Haber

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Fetch Directed Prefetching - a Study

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.

PipeliningPipelining Computer Architecture (Fall 2006)

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

Chapter 1 Computer System Overview

CS203 – Advanced Computer Architecture

William Stallings Computer Organization and Architecture 8th Edition

CSCI206 - Computer Organization & Programming

PowerPC 604 Superscalar Microprocessor

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Flow Path Model of Superscalars

Module 3: Branch Prediction

CSCI206 - Computer Organization & Programming

Computer Architecture

Chapter 1 Computer System Overview

Chapter 11 Processor Structure and function

Presentation transcript:

© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies IBM Research – Haifa

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Agenda  Background  Code alignment algorithm –General concepts, code chains –Genetic algorithm  Code alignment for Power 6 –Architecture specifics –Evaluation strategies  Results

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Background  Proper code placement strongly impacts –instruction cache performance –branch prediction –instruction fetch mechanism  Previous works – not many sees code alignment as a code chains placement without reordering, using padding of a certain size – used as a complementary optimization, producing mixed results  We propose a profile-guided generic optimization algorithm, producing stable performance gain

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Code Alignment Algorithm – general concepts  Code chain - is a code sequence which is executed more or less continuously with no significant differences in its instructions frequency  Satisfies one of the following properties: –Terminates with unconditional jump or branch via register –Terminates with a conditional branch whose fallthru is taken infrequently

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Code Alignment Algorithm – working on chains  Aligns each chain by inserting non-executable padding between the chains  Working on chains, not basic blocks, – limits code inflation  Profile allows to focus on frequently executed chains – avoid long run time and code inflation  Tries to determine the best position for each given chain

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Code Chains and Around 0x100 0x120 0x140 0x160 Alignment offset of size of 3 instructions Instruction buffer boundary Chain 2 Chain 1 Gap of size of 4 instructions between the chains

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Code Alignment Algorithm – filtering alignment options  The algorithm works in phases, in each phase a different measure determines the best alignment alternatives  The initial set of alignment options is defined  This set is filtered in several steps with different filter at each step  These filters, or evaluation strategies, are specific to the architecture and model the performance dependency on the code placement  The strategies are applied based on predefined priorities. The next filter will apply only to results which survived previous filters. The next filter results doesn’t override the previous one’s, but refines it

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Power 6 Pipeline The generic pipeline stages of instruction processing:  Fetch : Instructions are copied from the instruction cache or memory into the fetch buffer.  Decode : Instructions in the fetch buffer are interpreted.  Dispatch : Instructions are sent to the appropriate execution units.  Execute : The operations indicated by the instructions are carried out in the execution units.  Complete :At the end of execution, the result of instructions can be forwarded to other pending instructions while the result awaits write back.  Write Back : The results of execution are written to the architected register, cache or memory in program order, and any exceptions are recognized.

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment for Power 6 In-order architecture, static dispatch grouping Very sensitive to code alignment

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment for Power 6 – architecture specifics  Fetch buffer contains 8 instructions  Instructions which are not to be executed are discarded  “Good” instructions are delivered for dispatch  Dispatch groups are formed, each cycle one group is executed  A new dispatch group starts on the instruction buffer boundary

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment for Power 6 – evaluation strategies Start from 8 possible alignment options. Filter them by: 1.dispatch groups - minimize the number of dispatch groups formed within the chain, normalized by their execution frequency

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment Evaluation Using Grouping Analysis 0x00 0x20 0x40 0x60 offset=0 groups=8 offset=1 groups=8 offset=2 groups=7 offset=3 groups=6 offset=4 groups=7 offset=5 groups=7 offset=6 groups=7 offset=7 groups=6 performing grouping analysis Penalty for offset is the sum of execution counters of each created group

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment for Power 6 – evaluation strategies 2. hot targets -Aligns targets of frequently executed branch instructions, that have high incoming control flow -Best case – hot targets are placed on the beginning of the instruction buffer -Worst case – the first instruction of the hot target is the last instruction of the ibuff -a dispatch group with 1 instruction -this is the only executed instruction of the ibuff

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment Evaluation by Aligned Hot Targets 0x100 0x120 0x140 0x160 Frequently taken target Chain 1 Chain 2 Inserting a gap between the chains to place the hot target on the ibuff boundary

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Alignment for Power 6 – evaluation strategies Other possible strategies:  Branch instructions alignment  Reduce dispatch stalls

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Results  The algorithm was implemented into IBM FDPR-Pro, a profile- based post-link optimizer  In some cases of extremely bad code alignment up to 40% improvement is achieved  Stable performance gain on SPEC 2006 INT64 benchmarks, running on AIX 6.1 on Power 6. Applied on top of standard O3 FDPR-Pro optimization set and showed up to 5% improvement.

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Results – SPEC 2006

© 2010 IBM Corporation Code Optimization Technologies IBM Research – Haifa Thanks!