Download presentation
Presentation is loading. Please wait.
Published byBridget Roads Modified over 10 years ago
1
VLIW Very Large Instruction Word
2
Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The term VLIW refers to the size of each instruction that is carried out by a processor. This instruction is "very long" in comparison to the instruction word size utilized by most current mainstream (superscalar) processors.
3
Introduction Most non-VLIW processors use complex hardware units to schedule processes in an overlapping fashion known as pipelining. This process allows multiple operations to execute simultaneously, in a cascading fashion, to achieve the maximum utilization of processing power. It is implemented at runtime, which has the result that the hardware is under pressure to accurately order instructions as they fly by.
4
Introduction Many techniques are used to predict the upcoming instructions for maximum efficiency in scheduling: what branches the code will take, what registers will be accessed next, what operations will be requested. These algorithms are complicated and tend to bloat the processing hardware. Since the scheduling has to be done on-the-fly, there is potential for time-wasting error.
5
Introduction Since VLIW code is ordered for the processor at compile time, this is all done before the code is ever actually executed. As a VLIW compiler sorts through the code, it examines it to determine which instructions will be able to be executed simultaneously. This is often done via a process called trace scheduling. It pairs these instructions up to form the lengthy instruction words the technology is named for.
6
Introduction The long instructions can be executed easily by the hardware, which in turn is made less complex by the structure of the bits being fed to it. The hardware generally consists of identical multiple execution units.
7
Introduction VLIW processing ideas have roots in Alan Turing's 1946 parallel computing studies and Maurice Wilkes's 1951 microprogramming work.
8
Introduction Microprogrammed CPUs have a macroinstruction that corresponds to each program instruction. Each of these macroinstructions has a corresponding sequence of microinstructions, kept in ROM on the CPU. These microinstructions can be ordered into wide sets of control signals. This is called horizontal microprogramming.
9
Introduction When Joseph Fisher was working on writing horizontal microcode for a CDC-6600 emulator in 1979, he began to work on the problem of generating long instruction words from short sequential instructions. The techniques he developed, called "trace scheduling" were essential for generating VLIW-compatible code.
10
Introduction VLIW has been slow to gain market acceptance due in large part to the human programming difficulties involved. VLIW's advantages come largely from having an intelligent compiler that can schedule many instructions simultaneously (in a large word).
11
Introduction Early VLIW implementations looked only into basic program blocks to obtain instruction level parallelism (ILP), and could not follow complex branches. As such, little optimization was possible.
12
Introduction Authoring a compiler to effectively predict code paths is easily the largest hurdle of VLIW design. Hence the interest in SequenceL as a VLIW language.
13
Introduction Another big problem is that any VLIW- compatible code is largely proprietary to the hardware of the chip it is designed for. Code written for a processor using five execution units will be incompatible with one using seven. The inflexibility inherent in microchip design makes this a problem.
14
Introduction VLIW also has some problems with the inflexibility of its compiler-first design. Since instructions are ordered at compile time, any unanticipated memory conflicts that occur (e.g., latency, cache misses) can not be accounted for without deviation from a pure VLIW design; that is, adding superscalar elements to the processor.
15
Example of a VLIW VLIW instruction Set of independent operations that are to be issued simultaneously (no sequential notion within a VLIW) 1 instruction issued every cycle – provides notion of time Resource assignment indicated by position in VLIW addsubload storempyshiftbranch
16
Add MpyMem Register File addnop loadstore VLIW instruction = 5 independent operations Icache
17
VLIW How can the processing units be kept busy by the compiler?
18
VLIW How can the processing units be kept busy by the compiler? Unroll loops?
19
Unroll loops for(i=0; i < n; i++){ a[i]=b[i]*c[i]; } Becomes for(i=0; i < n; i++){ a[i]=b[i]*c[i]; a[i+1]=b[i+1]*c[i+1]; i++; }
20
Optimizing unrolled loops r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 Unroll = replicate loop body n-1 times. Hope to enable overlap of operation execution from different iterations Not possible! loop: unroll 3 times
21
r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: Register renaming on unrolled loop
22
Register renaming is not enough! Still not much overlap possible Problems r2, r4, r6 sequentialize the iterations Need to rename these 2 specialized renaming optimizations Accumulator variable expansion (r6) Induction variable expansion (r2, r4) r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r6 = r6 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop:
23
Accumulator variable expansion Accumulator variable x = x + y or x = x – y where y is loop variant!! Create n-1 temporary accumulators Each iteration targets a different accumulator Sum up the accumulator variables at the end r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 4 r4 = r4 + 4 r11 = load(r2) r13 = load(r4) r15 = r11 * r13 r16 = r16 + r15 r2 = r2 + 4 r4 = r4 + 4 r21 = load(r2) r23 = load(r4) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26
24
Induction variable expansion Induction variable x = x + y or x = x – y where y is loop invariant!! Create n-1 additional induction variables Each iteration uses and modifies a different induction variable Initialize induction variables to init, init+step, init+2*step, etc. Step increased to n*original step Now iterations are completely independent !! r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r2 = r2 + 12 r4 = r4 + 12 r11 = load(r12) r13 = load(r14) r15 = r11 * r13 r16 = r16 + r15 r12 = r12 + 12 r14 = r14 + 12 r21 = load(r22) r23 = load(r24) r25 = r21 * r23 r26 = r26 + r25 r22 = r22 + 12 r24 = r24 + 12 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26 r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8
25
Better induction variable expansion With base+displacement addressing, often don’t need additional induction variables Just change offsets in each iterations to reflect step Change final increments to n * original step r1 = load(r2) r3 = load(r4) r5 = r1 * r3 r6 = r6 + r5 r11 = load(r2+4) r13 = load(r4+4) r15 = r11 * r13 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 r26 = r26 + r25 r2 = r2 + 12 r4 = r4 + 12 if (r4 < 400) goto loop iter1 iter2 iter3 loop: r16 = r26 = 0 r6 = r6 + r16 + r26
26
Scheduling Loop unrolling that generates straight line code is scheduled for parallel execution using local scheduling techniques. For scheduling code across branches a more complex global scheduling algorithm must be used.
27
Global Scheduling One global scheduling technique is trace scheduling. Trace scheduling utilized two steps 1. Trace selection, trying to find sequences of basic blocks that could be put together into a smaller number of instructions. This sequence is called a trace. 2. Trace compaction, which tries to squeeze the trace into a small number of wide instructions.
28
VLIW Processor Transmeta’s Crusoe line of processors is one of the first all-purpose VLIW architecture implementations to be launched. It was designed with mobile applications in mind, running at low temperatures and consuming little power--60 to 70% less than a comparable RISC chip, according to Transmeta. The chip can be found in notebook computers. Toshiba Satellite R15-829
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.