Presentation is loading. Please wait.

Presentation is loading. Please wait.

M. Mateen Yaqoob The University of Lahore Spring 2014.

Similar presentations


Presentation on theme: "M. Mateen Yaqoob The University of Lahore Spring 2014."— Presentation transcript:

1 M. Mateen Yaqoob The University of Lahore Spring 2014

2  Introduction  IA-64 Architecture Announcement  IA-64 - Inside the Architecture  Features for E-business  Features for Technical Computing  Summary 2

3  Most significant architecture advancement since 32-bit computing with the 80386  80386: multi-tasking, advances from 16 bit to 32 bit  Merced: explicit parallelism, advances from 32 bit to 64 bit  Application Instruction Set Architecture Guide  Complete disclosure of IA-64 application architecture  Result of the successful collaboration between Intel and HP 3

4 Internet, Enterprise, and Workstation IA-64 Solutions IA-64 Solutions EnterpriseTechnologyCenters Application Solution Centers OperatingSystems Tools Application Instruction Set Architecture Guide Development Systems SoftwareEnablingPrograms High-endPlatformInitiatives Industry wide IA-64 development Intel 64 Fund IntelDeveloperForum ®

5 IA Server/Workstation Roadmap Madison IA-64 Perf FutureIA-32 Deerfield IA-64 Price/Perf Performance ’02’00’01.25µ.18µ.13µ... McKinley ’03 Merced Pentium ® III Xeon™ Proc. ’98’99 Pentium ® II Xeon TM Processor All dates specified are target dates provided for planning purposes only and are subject to change. Foster IA-64 starts with Merced processor ®

6  Application instructions and opcodes  Instructions available to an application programmer  Machine code for these instructions  Unique architecture features & enhancements  Explicit parallelism and templates  Predication, speculation, memory support, and others  Floating-point and multimedia architecture  IA-64 resources available to applications  Large, application visible register set  Rotating registers, register stack, register stack engine  IA-32 & PA-RISC compatibility models 6 Details now available to the broad industry

7  Performance barriers :  Memory latency  Branches  Loop pipelining and call / return overhead  Headroom constraints :  Hardware-based instruction scheduling  Unable to efficiently schedule parallel execution  Resource constrained  Too few registers  Unable to fully utilize multiple execution units  Scalability limitations :  Memory addressing efficiency 7 IA-64 addresses these limitations

8  Overcome the limitations of today’s architectures  Provide world-class floating-point performance  Support large memory needs with 64-bit addressability  Protect existing investments  Full binary compatibility with existing IA-32 instructions in hardware  Full binary compatibility with PA-RISC instructions through software translation  Support growing high-end application workloads  E-business and internet applications  Scientific analysis and 3D graphics 8 Define the next generation computer architecture

9 Today’s Processors often 60% Idle parallelizedcode parallelizedcodeparallelizedcode HardwareCompiler multiple functional units functional units Original Source Code Sequential Machine Code........................ Execution Units Available Used Inefficiently

10 Increases Parallel Execution IA-64 Compiler Views Wider Scope Parallel Machine Code Compiler Original Source Code Compile Hardware multiple functional units........................ More efficient use of execution resources

11  Explicitly parallel:  Instruction level parallelism (ILP) in machine code  Compiler schedules across a wider scope  Enhanced ILP :  Predication, Speculation, Software pipelining,...  Fully compatible:  Across all IA-64 family members  IA-32 in hardware and PA-RISC through instruction mapping  Inherently scalable  Massively resourced:  Many registers  Many functional units

12  Removes branches, converts to predicated execution  Executes multiple paths simultaneously  Increases performance by exposing parallelism and reducing critical path  Better utilization of wider machines  Reduces mispredicted branches cmp p1 p2 Traditional Architectures IA-64 else then cmp

13 (p2) p3= (p3)... (p1) p3= Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 p1,p2,<-... (p2) p3,p4 <-cmp.unc... (p3)... (p4)... p2&p3 p2&p4 Opportunity for Even More Parallelism  Two kinds of normal compares  Regular  Unconditional (nested IF’s)

14 Reduces Critical Path B A C D B AC D  Three new types of compares:  AND: both target predicates set FALSE if compare is false  OR: both target predicates set TRUE if compare is true  ANDOR: if true, sets one TRUE, sets other FALSE

15 Tbit (Test Bit) Also Sets Predicates  (qp) p1,p2 <- cmp.relation  if(qp) {p1 = relation; p2 = !relation};  (qp) p1,p2 <- cmp.relation.unc  p1 = qp&relation; p2 = qp&!relation;  (qp) p1,p2 <- cmp.relation.and  if(qp & (relation==FALSE)) { p1=0; p2=0; }  (qp) p1,p2 <- cmp.relation.or  if(qp & (relation==TRUE)) { p1=1; p2=1; }  (qp) p1,p2 <- cmp.relation.or.andcm  if(qp & (relation==TRUE)) { p1=1; p2=0; }

16  Reduces branches and mispredict penalties  50% fewer branches and 37% faster code*  Parallel compares further reduce critical paths  Greatly improves code with hard to predict branches  Large server apps- capacity limited  Sorting, data mining- large database apps  Data compression  Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication  Cmove: 39% more instructions, 23% slower performance*  Instructions must all be speculative

17  Compiler can issue a load prior to a preceding, possibly- conflicting store Unique feature to IA-64 instr 1 instr 2... st8 ld8use Barrier Traditional Architectures ld8.a instr 1 instr 2 st8 ld.cuse IA-64

18  Instructions  ld.a - advanced loads  ld.c - check loads  chk.a - advance load checks  Speculative Advanced loads - ld.sa - is an advanced load with deferral

19  Reduces impact of memory latency  Study demonstrates performance improvement of 79% when combined with predication*  Greatest improvement to code with many cache accesses  Large databases  Operating systems  Scheduling flexibility enables new levels of performance headroom

20  Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time

21 Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations  IA-64 features that make this possible  Full Predication  Special branch handling features  Register rotation: removes loop copy overhead  Predicate rotation: removes prologue & epilogue  Traditional architectures use loop unrolling  High overhead: extra code for loop body, prologue, and epilogue

22  Loop pipelining maximizes performance; minimizes overhead  Avoids code expansion of unrolling and code explosion of prologue and epilogue  Smaller code means fewer cache misses  Greater performance improvements in higher latency conditions  Reduced overhead allows S/W pipelining of small loops with unknown trip counts  Typical of integer scalar codes

23  A parallel processing system is able to perform concurrent data processing to achieve faster execution time  The system may have two or more ALUs and be able to execute two or more instructions at the same time  Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time

24 Single instruction stream, single data stream – SISD Single instruction stream, multiple data stream – SIMD Multiple instruction stream, single data stream – MISD Multiple instruction stream, multiple data stream – MIMD

25  Single control unit, single computer, and a memory unit  Instructions are executed sequentially. Parallel processing may be achieved by means of multiple functional units or by pipeline processing

26  Represents an organization that includes many processing units under the supervision of a common control unit.  Includes multiple processing units with a single control unit. All processors receive the same instruction, but operate on different data.

27  Theoretical only  processors receive different instructions, but operate on same data.

28  A computer system capable of processing several programs at the same time.  Most multiprocessor and multicomputer systems can be classified in this category

29 PIPELINING Decomposes a sequential process into segments. Divide the processor into segment processors each one is dedicated to a particular segment. Each segment is executed in a dedicated segment-processor operates concurrently with all other segments. Information flows through these multiple hardware segments.

30 PIPELINING  Instruction execution is divided into k segments or stages  Instruction exits pipe stage k-1 and proceeds into pipe stage k  All pipe stages take the same amount of time; called one processor cycle  Length of the processor cycle is determined by the slowest pipe stage k segments

31 PIPELINING  Suppose we want to perform the combined multiply and add operations with a stream of numbers:  Ai * Bi + Ci for i =1,2,3,…,7

32 PIPELINING  The suboperations performed in each segment of the pipeline are as follows:  R1  Ai, R2  Bi  R3  R1 * R2 R4  Ci  R5  R3 + R4

33

34

35  n: instructions  k : stages in pipeline   : clockcycle  T k : total time n is equivalent to number of loads in the laundry example k is the stages (washing, drying and folding. Clock cycle is the slowest task time n k

36 SPEEDUP  Consider a k-segment pipeline operating on n data sets. (In the above example, k = 3 and n = 4.)  > It takes k clock cycles to fill the pipeline and get the first result from the output of the pipeline.  After that the remaining (n - 1) results will come out at each clock cycle.  > It therefore takes (k + n - 1) clock cycles to complete the task.

37 SPEEDUP  If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles.  The speedup gained by using the pipeline is:  S = k * n / (k + n - 1 )

38 SPEEDUP  S = k * n / (k + n - 1 ) For n >> k (such as 1 million data sets on a 3-stage pipeline),  S ~ k  So we can gain the speedup which is equal to the number of functional units for a large data sets. This is because the multiple functional units can work in parallel except for the filling and cleaning-up cycles.

39 SOME DEFINITIONS  Pipeline: is an implementation technique where multiple instructions are overlapped in execution.  Pipeline stage: The computer pipeline is to divided instruction processing into stages. Each stage completes a part of an instruction and loads a new part in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

40 Throughput of the instruction pipeline is determined by how often an instruction exits the pipeline. Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. Machine cycle. The time required to move an instruction one step further in the pipeline. The length of the machine cycle is determined by the time required for the slowest pipe stage. SOME DEFINITIONS

41 Instruction pipeline versus sequential processing sequential processing Instruction pipeline

42 Instruction pipeline (Contd.) sequential processing is faster for few instructions

43

44  If a complicated memory access occurs in stage 1, stage 2 will be delayed and the rest of the pipe is stalled.  If there is a branch, if.. and jump, then some of the instructions that have already entered the pipeline should not be processed.  We need to deal with these difficulties to keep the pipeline moving

45 Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2 S5S5 123498765 S1S1 S2S2 S5S5 S3S3 S4S4 12348765 1234765 123465 12345 Time

46  Fetch instruction  Decode instruction  Fetch operands  Execute instructions  Write result

47 Instruction Fetch Decode Execution Fetch Operand S3S3 S4S4 S1S1 S2S2 S5S5 123498765 S1S1 S2S2 S5S5 S3S3 S4S4 12348765 1234765 123465 12345 Time 6 Write operand Calculate operand S6S6

48  Fetch instruction  Decode instruction  Calculate operands (Find effective address)  Fetch operands  Execute instructions  Write result

49 Flow chart for four segment pipeline

50  Branch Difficulties  Data Dependency

51  Prefetch the target instruction in addition to the instruction following th branch  If the branch condition is successful, the pipeline continues from the branch target instruction

52  BTB is an associative memory  Each entry in the BTB consists of the address of a previously executed branch instruction and the target instruction for the branch

53  A pipeline with branch prediction uses some additional logic to guess the outcome of a conditional branch instruction before it is executed

54  In this procedure, the compiler detects the branch instruction and rearrange the machine language code sequence by inserting useful instructions that keep the pipeline operating without interrupts  An example of delay branch is presented in the next section

55  Long range weather forecasting.  Petroleum explorations.  Seismic data analysis.  Medical diagnosis.  Aerodynamics and space flight simulations.

56 Operation code Base address Source 1 Base address Source 2 Base address destination Vector length

57 Source A Source B Multiplier pipeline Adder pipeline

58  How can predicates replace a conditional branch instruction?  What is the difference between limited and explicit parallelism? Explain it by using a simple example  A non-pipeline system takes 50 ns to process a task. The same task can be processed in a six-segment pipeline with a clock cycle of 10 ns. Determine the speed up ratio of the pipeline for 100 tasks. What is the maximum speedup that can be achieved?


Download ppt "M. Mateen Yaqoob The University of Lahore Spring 2014."

Similar presentations


Ads by Google