M. Mateen Yaqoob The University of Lahore Spring 2014.

Slides:

Advertisements

Similar presentations

PIPELINE AND VECTOR PROCESSING

Advertisements

CSCI 4717/5717 Computer Architecture

® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Pipeline and Vector Processing (Chapter2 and Appendix A)

Chapter 8. Pipelining.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Processor Technology and Architecture

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Chapter 12 CPU Structure and Function. Example Register Organizations.

DLX Instruction Format

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Appendix A Pipelining: Basic and Intermediate Concepts

Chapter 21 IA-64 Architecture (Think Intel Itanium)

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

What have mr aldred’s dirty clothes got to do with the cpu

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

PIPELINING AND VECTOR PROCESSING

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Pipelining and Parallelism Mark Staveley

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Pipelining Example Laundry Example: Three Stages

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Vector computers.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

PipeliningPipelining Computer Architecture (Fall 2006)

DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.

Advanced Architectures

Computer Architecture Chapter (14): Processor Structure and Function

Central Processing Unit Architecture

William Stallings Computer Organization and Architecture 8th Edition

Parallel Processing - introduction

Advanced Topic: Alternative Architectures Chapter 9 Objectives

5.2 Eleven Advanced Optimizations of Cache Performance

Henk Corporaal TUEindhoven 2009

Pipelining and Vector Processing

Yingmin Li Ting Yan Qi Zhao

Overview Parallel Processing Pipelining

Henk Corporaal TUEindhoven 2011

Computer Architecture

Chapter 8. Pipelining.

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

M. Mateen Yaqoob The University of Lahore Spring 2014

 Introduction  IA-64 Architecture Announcement  IA-64 - Inside the Architecture  Features for E-business  Features for Technical Computing  Summary 2

 Most significant architecture advancement since 32-bit computing with the  80386: multi-tasking, advances from 16 bit to 32 bit  Merced: explicit parallelism, advances from 32 bit to 64 bit  Application Instruction Set Architecture Guide  Complete disclosure of IA-64 application architecture  Result of the successful collaboration between Intel and HP 3

Internet, Enterprise, and Workstation IA-64 Solutions IA-64 Solutions EnterpriseTechnologyCenters Application Solution Centers OperatingSystems Tools Application Instruction Set Architecture Guide Development Systems SoftwareEnablingPrograms High-endPlatformInitiatives Industry wide IA-64 development Intel 64 Fund IntelDeveloperForum ®

IA Server/Workstation Roadmap Madison IA-64 Perf FutureIA-32 Deerfield IA-64 Price/Perf Performance ’02’00’01.25µ.18µ.13µ... McKinley ’03 Merced Pentium ® III Xeon™ Proc. ’98’99 Pentium ® II Xeon TM Processor All dates specified are target dates provided for planning purposes only and are subject to change. Foster IA-64 starts with Merced processor ®

 Application instructions and opcodes  Instructions available to an application programmer  Machine code for these instructions  Unique architecture features & enhancements  Explicit parallelism and templates  Predication, speculation, memory support, and others  Floating-point and multimedia architecture  IA-64 resources available to applications  Large, application visible register set  Rotating registers, register stack, register stack engine  IA-32 & PA-RISC compatibility models 6 Details now available to the broad industry

 Performance barriers :  Memory latency  Branches  Loop pipelining and call / return overhead  Headroom constraints :  Hardware-based instruction scheduling  Unable to efficiently schedule parallel execution  Resource constrained  Too few registers  Unable to fully utilize multiple execution units  Scalability limitations :  Memory addressing efficiency 7 IA-64 addresses these limitations

 Overcome the limitations of today’s architectures  Provide world-class floating-point performance  Support large memory needs with 64-bit addressability  Protect existing investments  Full binary compatibility with existing IA-32 instructions in hardware  Full binary compatibility with PA-RISC instructions through software translation  Support growing high-end application workloads  E-business and internet applications  Scientific analysis and 3D graphics 8 Define the next generation computer architecture

Today’s Processors often 60% Idle parallelizedcode parallelizedcodeparallelizedcode HardwareCompiler multiple functional units functional units Original Source Code Sequential Machine Code Execution Units Available Used Inefficiently

Increases Parallel Execution IA-64 Compiler Views Wider Scope Parallel Machine Code Compiler Original Source Code Compile Hardware multiple functional units More efficient use of execution resources

 Explicitly parallel:  Instruction level parallelism (ILP) in machine code  Compiler schedules across a wider scope  Enhanced ILP :  Predication, Speculation, Software pipelining,...  Fully compatible:  Across all IA-64 family members  IA-32 in hardware and PA-RISC through instruction mapping  Inherently scalable  Massively resourced:  Many registers  Many functional units

 Removes branches, converts to predicated execution  Executes multiple paths simultaneously  Increases performance by exposing parallelism and reducing critical path  Better utilization of wider machines  Reduces mispredicted branches cmp p1 p2 Traditional Architectures IA-64 else then cmp

(p2) p3= (p3)... (p1) p3= Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 p1,p2,<-... (p2) p3,p4 <-cmp.unc... (p3)... (p4)... p2&p3 p2&p4 Opportunity for Even More Parallelism  Two kinds of normal compares  Regular  Unconditional (nested IF’s)

Reduces Critical Path B A C D B AC D  Three new types of compares:  AND: both target predicates set FALSE if compare is false  OR: both target predicates set TRUE if compare is true  ANDOR: if true, sets one TRUE, sets other FALSE

Tbit (Test Bit) Also Sets Predicates  (qp) p1,p2 <- cmp.relation  if(qp) {p1 = relation; p2 = !relation};  (qp) p1,p2 <- cmp.relation.unc  p1 = qp&relation; p2 = qp&!relation;  (qp) p1,p2 <- cmp.relation.and  if(qp & (relation==FALSE)) { p1=0; p2=0; }  (qp) p1,p2 <- cmp.relation.or  if(qp & (relation==TRUE)) { p1=1; p2=1; }  (qp) p1,p2 <- cmp.relation.or.andcm  if(qp & (relation==TRUE)) { p1=1; p2=0; }

 Reduces branches and mispredict penalties  50% fewer branches and 37% faster code*  Parallel compares further reduce critical paths  Greatly improves code with hard to predict branches  Large server apps- capacity limited  Sorting, data mining- large database apps  Data compression  Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication  Cmove: 39% more instructions, 23% slower performance*  Instructions must all be speculative

 Compiler can issue a load prior to a preceding, possibly- conflicting store Unique feature to IA-64 instr 1 instr 2... st8 ld8use Barrier Traditional Architectures ld8.a instr 1 instr 2 st8 ld.cuse IA-64

 Instructions  ld.a - advanced loads  ld.c - check loads  chk.a - advance load checks  Speculative Advanced loads - ld.sa - is an advanced load with deferral

 Reduces impact of memory latency  Study demonstrates performance improvement of 79% when combined with predication*  Greatest improvement to code with many cache accesses  Large databases  Operating systems  Scheduling flexibility enables new levels of performance headroom

 Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time

Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations  IA-64 features that make this possible  Full Predication  Special branch handling features  Register rotation: removes loop copy overhead  Predicate rotation: removes prologue & epilogue  Traditional architectures use loop unrolling  High overhead: extra code for loop body, prologue, and epilogue

 Loop pipelining maximizes performance; minimizes overhead  Avoids code expansion of unrolling and code explosion of prologue and epilogue  Smaller code means fewer cache misses  Greater performance improvements in higher latency conditions  Reduced overhead allows S/W pipelining of small loops with unknown trip counts  Typical of integer scalar codes

 A parallel processing system is able to perform concurrent data processing to achieve faster execution time  The system may have two or more ALUs and be able to execute two or more instructions at the same time  Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time

Single instruction stream, single data stream – SISD Single instruction stream, multiple data stream – SIMD Multiple instruction stream, single data stream – MISD Multiple instruction stream, multiple data stream – MIMD

 Single control unit, single computer, and a memory unit  Instructions are executed sequentially. Parallel processing may be achieved by means of multiple functional units or by pipeline processing

 Represents an organization that includes many processing units under the supervision of a common control unit.  Includes multiple processing units with a single control unit. All processors receive the same instruction, but operate on different data.

 Theoretical only  processors receive different instructions, but operate on same data.

 A computer system capable of processing several programs at the same time.  Most multiprocessor and multicomputer systems can be classified in this category

PIPELINING Decomposes a sequential process into segments. Divide the processor into segment processors each one is dedicated to a particular segment. Each segment is executed in a dedicated segment-processor operates concurrently with all other segments. Information flows through these multiple hardware segments.

PIPELINING  Instruction execution is divided into k segments or stages  Instruction exits pipe stage k-1 and proceeds into pipe stage k  All pipe stages take the same amount of time; called one processor cycle  Length of the processor cycle is determined by the slowest pipe stage k segments

PIPELINING  Suppose we want to perform the combined multiply and add operations with a stream of numbers:  Ai * Bi + Ci for i =1,2,3,…,7

PIPELINING  The suboperations performed in each segment of the pipeline are as follows:  R1  Ai, R2  Bi  R3  R1 * R2 R4  Ci  R5  R3 + R4

 n: instructions  k : stages in pipeline   : clockcycle  T k : total time n is equivalent to number of loads in the laundry example k is the stages (washing, drying and folding. Clock cycle is the slowest task time n k

SPEEDUP  Consider a k-segment pipeline operating on n data sets. (In the above example, k = 3 and n = 4.)  > It takes k clock cycles to fill the pipeline and get the first result from the output of the pipeline.  After that the remaining (n - 1) results will come out at each clock cycle.  > It therefore takes (k + n - 1) clock cycles to complete the task.

SPEEDUP  If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles.  The speedup gained by using the pipeline is:  S = k * n / (k + n - 1 )

SPEEDUP  S = k * n / (k + n - 1 ) For n >> k (such as 1 million data sets on a 3-stage pipeline),  S ~ k  So we can gain the speedup which is equal to the number of functional units for a large data sets. This is because the multiple functional units can work in parallel except for the filling and cleaning-up cycles.

SOME DEFINITIONS  Pipeline: is an implementation technique where multiple instructions are overlapped in execution.  Pipeline stage: The computer pipeline is to divided instruction processing into stages. Each stage completes a part of an instruction and loads a new part in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

Throughput of the instruction pipeline is determined by how often an instruction exits the pipeline. Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. Machine cycle. The time required to move an instruction one step further in the pipeline. The length of the machine cycle is determined by the time required for the slowest pipe stage. SOME DEFINITIONS

Instruction pipeline versus sequential processing sequential processing Instruction pipeline

Instruction pipeline (Contd.) sequential processing is faster for few instructions

 If a complicated memory access occurs in stage 1, stage 2 will be delayed and the rest of the pipe is stalled.  If there is a branch, if.. and jump, then some of the instructions that have already entered the pipeline should not be processed.  We need to deal with these difficulties to keep the pipeline moving

Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2 S5S S1S1 S2S2 S5S5 S3S3 S4S Time

 Fetch instruction  Decode instruction  Fetch operands  Execute instructions  Write result

Instruction Fetch Decode Execution Fetch Operand S3S3 S4S4 S1S1 S2S2 S5S S1S1 S2S2 S5S5 S3S3 S4S Time 6 Write operand Calculate operand S6S6

 Fetch instruction  Decode instruction  Calculate operands (Find effective address)  Fetch operands  Execute instructions  Write result

Flow chart for four segment pipeline

 Branch Difficulties  Data Dependency

 Prefetch the target instruction in addition to the instruction following th branch  If the branch condition is successful, the pipeline continues from the branch target instruction

 BTB is an associative memory  Each entry in the BTB consists of the address of a previously executed branch instruction and the target instruction for the branch

 A pipeline with branch prediction uses some additional logic to guess the outcome of a conditional branch instruction before it is executed

 In this procedure, the compiler detects the branch instruction and rearrange the machine language code sequence by inserting useful instructions that keep the pipeline operating without interrupts  An example of delay branch is presented in the next section

 Long range weather forecasting.  Petroleum explorations.  Seismic data analysis.  Medical diagnosis.  Aerodynamics and space flight simulations.

Operation code Base address Source 1 Base address Source 2 Base address destination Vector length

Source A Source B Multiplier pipeline Adder pipeline

 How can predicates replace a conditional branch instruction?  What is the difference between limited and explicit parallelism? Explain it by using a simple example  A non-pipeline system takes 50 ns to process a task. The same task can be processed in a six-segment pipeline with a clock cycle of 10 ns. Determine the speed up ratio of the pipeline for 100 tasks. What is the maximum speedup that can be achieved?