Download presentation
Presentation is loading. Please wait.
Published byMuriel Russell Modified over 8 years ago
1
M. Mateen Yaqoob The University of Lahore Spring 2014
2
Introduction IA-64 Architecture Announcement IA-64 - Inside the Architecture Features for E-business Features for Technical Computing Summary 2
3
Most significant architecture advancement since 32-bit computing with the 80386 80386: multi-tasking, advances from 16 bit to 32 bit Merced: explicit parallelism, advances from 32 bit to 64 bit Application Instruction Set Architecture Guide Complete disclosure of IA-64 application architecture Result of the successful collaboration between Intel and HP 3
4
Internet, Enterprise, and Workstation IA-64 Solutions IA-64 Solutions EnterpriseTechnologyCenters Application Solution Centers OperatingSystems Tools Application Instruction Set Architecture Guide Development Systems SoftwareEnablingPrograms High-endPlatformInitiatives Industry wide IA-64 development Intel 64 Fund IntelDeveloperForum ®
5
IA Server/Workstation Roadmap Madison IA-64 Perf FutureIA-32 Deerfield IA-64 Price/Perf Performance ’02’00’01.25µ.18µ.13µ... McKinley ’03 Merced Pentium ® III Xeon™ Proc. ’98’99 Pentium ® II Xeon TM Processor All dates specified are target dates provided for planning purposes only and are subject to change. Foster IA-64 starts with Merced processor ®
6
Application instructions and opcodes Instructions available to an application programmer Machine code for these instructions Unique architecture features & enhancements Explicit parallelism and templates Predication, speculation, memory support, and others Floating-point and multimedia architecture IA-64 resources available to applications Large, application visible register set Rotating registers, register stack, register stack engine IA-32 & PA-RISC compatibility models 6 Details now available to the broad industry
7
Performance barriers : Memory latency Branches Loop pipelining and call / return overhead Headroom constraints : Hardware-based instruction scheduling Unable to efficiently schedule parallel execution Resource constrained Too few registers Unable to fully utilize multiple execution units Scalability limitations : Memory addressing efficiency 7 IA-64 addresses these limitations
8
Overcome the limitations of today’s architectures Provide world-class floating-point performance Support large memory needs with 64-bit addressability Protect existing investments Full binary compatibility with existing IA-32 instructions in hardware Full binary compatibility with PA-RISC instructions through software translation Support growing high-end application workloads E-business and internet applications Scientific analysis and 3D graphics 8 Define the next generation computer architecture
9
Today’s Processors often 60% Idle parallelizedcode parallelizedcodeparallelizedcode HardwareCompiler multiple functional units functional units Original Source Code Sequential Machine Code........................ Execution Units Available Used Inefficiently
10
Increases Parallel Execution IA-64 Compiler Views Wider Scope Parallel Machine Code Compiler Original Source Code Compile Hardware multiple functional units........................ More efficient use of execution resources
11
Explicitly parallel: Instruction level parallelism (ILP) in machine code Compiler schedules across a wider scope Enhanced ILP : Predication, Speculation, Software pipelining,... Fully compatible: Across all IA-64 family members IA-32 in hardware and PA-RISC through instruction mapping Inherently scalable Massively resourced: Many registers Many functional units
12
Removes branches, converts to predicated execution Executes multiple paths simultaneously Increases performance by exposing parallelism and reducing critical path Better utilization of wider machines Reduces mispredicted branches cmp p1 p2 Traditional Architectures IA-64 else then cmp
13
(p2) p3= (p3)... (p1) p3= Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 p1,p2,<-... (p2) p3,p4 <-cmp.unc... (p3)... (p4)... p2&p3 p2&p4 Opportunity for Even More Parallelism Two kinds of normal compares Regular Unconditional (nested IF’s)
14
Reduces Critical Path B A C D B AC D Three new types of compares: AND: both target predicates set FALSE if compare is false OR: both target predicates set TRUE if compare is true ANDOR: if true, sets one TRUE, sets other FALSE
15
Tbit (Test Bit) Also Sets Predicates (qp) p1,p2 <- cmp.relation if(qp) {p1 = relation; p2 = !relation}; (qp) p1,p2 <- cmp.relation.unc p1 = qp&relation; p2 = qp&!relation; (qp) p1,p2 <- cmp.relation.and if(qp & (relation==FALSE)) { p1=0; p2=0; } (qp) p1,p2 <- cmp.relation.or if(qp & (relation==TRUE)) { p1=1; p2=1; } (qp) p1,p2 <- cmp.relation.or.andcm if(qp & (relation==TRUE)) { p1=1; p2=0; }
16
Reduces branches and mispredict penalties 50% fewer branches and 37% faster code* Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Large server apps- capacity limited Sorting, data mining- large database apps Data compression Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication Cmove: 39% more instructions, 23% slower performance* Instructions must all be speculative
17
Compiler can issue a load prior to a preceding, possibly- conflicting store Unique feature to IA-64 instr 1 instr 2... st8 ld8use Barrier Traditional Architectures ld8.a instr 1 instr 2 st8 ld.cuse IA-64
18
Instructions ld.a - advanced loads ld.c - check loads chk.a - advance load checks Speculative Advanced loads - ld.sa - is an advanced load with deferral
19
Reduces impact of memory latency Study demonstrates performance improvement of 79% when combined with predication* Greatest improvement to code with many cache accesses Large databases Operating systems Scheduling flexibility enables new levels of performance headroom
20
Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time
21
Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations IA-64 features that make this possible Full Predication Special branch handling features Register rotation: removes loop copy overhead Predicate rotation: removes prologue & epilogue Traditional architectures use loop unrolling High overhead: extra code for loop body, prologue, and epilogue
22
Loop pipelining maximizes performance; minimizes overhead Avoids code expansion of unrolling and code explosion of prologue and epilogue Smaller code means fewer cache misses Greater performance improvements in higher latency conditions Reduced overhead allows S/W pipelining of small loops with unknown trip counts Typical of integer scalar codes
23
A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system may have two or more ALUs and be able to execute two or more instructions at the same time Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time
24
Single instruction stream, single data stream – SISD Single instruction stream, multiple data stream – SIMD Multiple instruction stream, single data stream – MISD Multiple instruction stream, multiple data stream – MIMD
25
Single control unit, single computer, and a memory unit Instructions are executed sequentially. Parallel processing may be achieved by means of multiple functional units or by pipeline processing
26
Represents an organization that includes many processing units under the supervision of a common control unit. Includes multiple processing units with a single control unit. All processors receive the same instruction, but operate on different data.
27
Theoretical only processors receive different instructions, but operate on same data.
28
A computer system capable of processing several programs at the same time. Most multiprocessor and multicomputer systems can be classified in this category
29
PIPELINING Decomposes a sequential process into segments. Divide the processor into segment processors each one is dedicated to a particular segment. Each segment is executed in a dedicated segment-processor operates concurrently with all other segments. Information flows through these multiple hardware segments.
30
PIPELINING Instruction execution is divided into k segments or stages Instruction exits pipe stage k-1 and proceeds into pipe stage k All pipe stages take the same amount of time; called one processor cycle Length of the processor cycle is determined by the slowest pipe stage k segments
31
PIPELINING Suppose we want to perform the combined multiply and add operations with a stream of numbers: Ai * Bi + Ci for i =1,2,3,…,7
32
PIPELINING The suboperations performed in each segment of the pipeline are as follows: R1 Ai, R2 Bi R3 R1 * R2 R4 Ci R5 R3 + R4
35
n: instructions k : stages in pipeline : clockcycle T k : total time n is equivalent to number of loads in the laundry example k is the stages (washing, drying and folding. Clock cycle is the slowest task time n k
36
SPEEDUP Consider a k-segment pipeline operating on n data sets. (In the above example, k = 3 and n = 4.) > It takes k clock cycles to fill the pipeline and get the first result from the output of the pipeline. After that the remaining (n - 1) results will come out at each clock cycle. > It therefore takes (k + n - 1) clock cycles to complete the task.
37
SPEEDUP If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles. The speedup gained by using the pipeline is: S = k * n / (k + n - 1 )
38
SPEEDUP S = k * n / (k + n - 1 ) For n >> k (such as 1 million data sets on a 3-stage pipeline), S ~ k So we can gain the speedup which is equal to the number of functional units for a large data sets. This is because the multiple functional units can work in parallel except for the filling and cleaning-up cycles.
39
SOME DEFINITIONS Pipeline: is an implementation technique where multiple instructions are overlapped in execution. Pipeline stage: The computer pipeline is to divided instruction processing into stages. Each stage completes a part of an instruction and loads a new part in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.
40
Throughput of the instruction pipeline is determined by how often an instruction exits the pipeline. Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. Machine cycle. The time required to move an instruction one step further in the pipeline. The length of the machine cycle is determined by the time required for the slowest pipe stage. SOME DEFINITIONS
41
Instruction pipeline versus sequential processing sequential processing Instruction pipeline
42
Instruction pipeline (Contd.) sequential processing is faster for few instructions
44
If a complicated memory access occurs in stage 1, stage 2 will be delayed and the rest of the pipe is stalled. If there is a branch, if.. and jump, then some of the instructions that have already entered the pipeline should not be processed. We need to deal with these difficulties to keep the pipeline moving
45
Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2 S5S5 123498765 S1S1 S2S2 S5S5 S3S3 S4S4 12348765 1234765 123465 12345 Time
46
Fetch instruction Decode instruction Fetch operands Execute instructions Write result
47
Instruction Fetch Decode Execution Fetch Operand S3S3 S4S4 S1S1 S2S2 S5S5 123498765 S1S1 S2S2 S5S5 S3S3 S4S4 12348765 1234765 123465 12345 Time 6 Write operand Calculate operand S6S6
48
Fetch instruction Decode instruction Calculate operands (Find effective address) Fetch operands Execute instructions Write result
49
Flow chart for four segment pipeline
50
Branch Difficulties Data Dependency
51
Prefetch the target instruction in addition to the instruction following th branch If the branch condition is successful, the pipeline continues from the branch target instruction
52
BTB is an associative memory Each entry in the BTB consists of the address of a previously executed branch instruction and the target instruction for the branch
53
A pipeline with branch prediction uses some additional logic to guess the outcome of a conditional branch instruction before it is executed
54
In this procedure, the compiler detects the branch instruction and rearrange the machine language code sequence by inserting useful instructions that keep the pipeline operating without interrupts An example of delay branch is presented in the next section
55
Long range weather forecasting. Petroleum explorations. Seismic data analysis. Medical diagnosis. Aerodynamics and space flight simulations.
56
Operation code Base address Source 1 Base address Source 2 Base address destination Vector length
57
Source A Source B Multiplier pipeline Adder pipeline
58
How can predicates replace a conditional branch instruction? What is the difference between limited and explicit parallelism? Explain it by using a simple example A non-pipeline system takes 50 ns to process a task. The same task can be processed in a six-segment pipeline with a clock cycle of 10 ns. Determine the speed up ratio of the pipeline for 100 tasks. What is the maximum speedup that can be achieved?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.