CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.

Instruction Level Parallelism (ILP) Colin Stevens.

Krste Asanovic Electrical Engineering and Computer Sciences

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

March 14, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

CS 252 Graduate Computer Architecture Lecture 6: VLIW/EPIC, Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CMPE 421 Parallel Computer Architecture

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

CENG709 Computer Architecture and Operating Systems Lecture 13 - VLIW Machines and Statically Scheduled ILP Murat Manguoglu Department of Computer Engineering.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

Multiscalar Processors

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

COSC3330 Computer Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Krste Asanovic Electrical Engineering and Computer Sciences

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

CSC3050 – Computer Architecture

Krste Asanovic Electrical Engineering and Computer Sciences

Design of Digital Circuits Lecture 19a: VLIW

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 5: Pipeline Wrap-up, Static ILP

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 490/590, Spring Last time… BTB allows prediction very early in pipeline In practice, use BHT and BTB together Speculative store buffer holds store values before commit to allow load-store forwarding Can execute later loads past earlier stores when addresses known, or predicted no dependence

CSE 490/590, Spring Fetch Decode & Rename Reorder BufferPC Branch Prediction Update predictors Commit Datapath: Branch Prediction and Speculative Execution Branch Resolution Branch Unit ALU Reg. File MEM Store Buffer D$ Execute kill

CSE 490/590, Spring Superscalar Control Logic Scaling Each issued instruction must somehow check against W*L instructions, i.e., growth in hardware  W*(W*L) For in-order machines, L is related to pipeline latencies and check is done during issue (interlocks or scoreboard) For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB), and check is done by broadcasting tags to waiting instructions at write back (completion) As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W 2 (~W 3 ) Lifetime L Issue Group Previously Issued Instructions Issue Width W

CSE 490/590, Spring Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ]

CSE 490/590, Spring Check instruction dependencies Superscalar processor Sequential ISA Bottleneck a = foo(b); for (i=0, i< Sequential source code Superscalar compiler Find independent operations Schedule operations Sequential machine code Schedule execution

CSE 490/590, Spring VLIW: Very Long Instruction Word Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: –Parallelism within an instruction => no cross-operation RAW check –No data use before data ready => no data interlocks Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Int Op 2Mem Op 1Mem Op 2FP Op 1FP Op 2Int Op 1

CSE 490/590, Spring 2011 VLIW Compiler Responsibilities Schedules to maximize parallel execution Guarantees intra-instruction parallelism Schedules to avoid data hazards (no interlocks) –Typically separates operations with explicit NOPs 8

CSE 490/590, Spring Early VLIW Machines FPS AP120B (1976) –scientific attached array processor –first commercial wide instruction machine –hand-coded vector math libraries using software pipelining and loop unrolling Multiflow Trace (1987) –commercialization of ideas from Fisher’s Yale group including “trace scheduling” –available in configurations with 7, 14, or 28 operations/instruction –28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) –7 operations encoded in 256-bit instruction word –rotating register file

CSE 490/590, Spring CSE 490/590 Administrivia HW2 is out Midterm solution will be up today Quiz 2 (next Friday 4/8)

CSE 490/590, Spring Loop Execution for (i=0; i<N; i++) B[i] = A[i] + C; Int1Int 2M1M2FP+FPx loop: How many FP ops/cycle? ldadd r1 fadd sdadd r2bne 1 fadd / 8 cycles = loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop Compile Schedule

CSE 490/590, Spring Loop Unrolling for (i=0; i<N; i++) B[i] = A[i] + C; for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } Unroll inner loop to perform 4 iterations at once Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

CSE 490/590, Spring Scheduling Loop Unrolled Code loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop Schedule Int1Int 2M1M2FP+FPx loop: Unroll 4 ways ld f1 ld f2 ld f3 ld f4add r1fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8add r2bne How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36

CSE 490/590, Spring Software Pipelining loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop Int1Int 2M1M2FP+FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 loop: iterate prolog epilog How many FLOPS/cycle? 4 fadds / 4 cycles = 1

CSE 490/590, Spring Software Pipelining vs. Loop Unrolling time performance time performance Loop Unrolled Software Pipelined Startup overhead Wind-down overhead Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

CSE 490/590, Spring What if there are no loops? Branches limit basic block size in control-flow intensive irregular code Difficult to find ILP in individual basic blocks Basic block

CSE 490/590, Spring Trace Scheduling [ Fisher,Ellis] Pick string of basic blocks, a trace, that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole “trace” at once Add fixup code to cope with branches jumping out of trace

CSE 490/590, Spring Problems with “Classic” VLIW Object-code compatibility –have to recompile all code for every machine, even for two machines in same generation Object code size –instruction padding wastes instruction memory/cache –loop unrolling/software pipelining replicates code Scheduling variable latency memory operations –caches and/or memory bank conflicts impose statically unpredictable variability Knowing branch probabilities –Profiling requires an significant extra step in build process Scheduling for statically unpredictable branches –optimal schedule varies with branch path

CSE 490/590, Spring VLIW Instruction Encoding Schemes to reduce effect of unused fields –Compressed format in memory, expand on I-cache refill »used in Multiflow Trace »introduces instruction addressing challenge –Mark parallel groups »used in TMS320C6x DSPs, Intel IA-64 –Provide a single-op VLIW instruction » Cydra-5 UniOp instructions Group 1Group 2Group 3

CSE 490/590, Spring Acknowledgements These slides heavily contain material developed and copyright by –Krste Asanovic (MIT/UCB) –David Patterson (UCB) And also by: –Arvind (MIT) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) MIT material derived from course UCB material derived from course CS252