TI C6701 VLIW MIMD.

TI C6701 VLIW MIMD

Presentation Outline Introduction / Overview Differentiating Features
Assembly Syntax Instruction Flow Pipelining and Optimization Conclusion

Introduction TI’s C6000 family VLIW architectures
Flexibility from Software

Characteristics Chart
Characteristics Chart Architecture VLIW FPU Yes MFLOPs (Peak) 1000 16x16 MACs (MMAC/s) 334 8x8 MACs (MMAC/s) MIPS (Peak) 1336 MOPS (Peak) 336 Memory Bus Bandwidth (MB/s) 332 1K FP cfft (µsec) 108 1K 16 bit cfft (µsec) 1K FP dot product (µsec) 3.07 1K 16 bit dot product (µsec) 512 2 xFP Conv3x3 (msec) 7.11 512 2 x8 bit Conv3x3 (msec) 512 2 x8 bit Erosion/Dilation (msec) 3.62 Figure 1: TI Data Sheet

Basic Overview Eight 32-bit instructions fetched per clock cycle, called a fetch packet Two CPU multipliers , Six ALUs for execution. Two general-purpose register files (A and B), Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2), Two load-from-memory data paths per register file (LD1a, LD1b, LD2a, LD2b), Two data address paths (DA1 and DA2), and Two register file data cross paths (1X and 2X)

Architecture Overview

Differentiating Features
The features that differentiate the TI from other VLIW architectures are: Instructions that can be of varied length Predication in all instructions Pipelining of the branch functions

Assembly Syntax Label Parallel Bars Conditions Instruction
Functional Unit Operands Comments

Assembly Example

Instruction Flow Eight functional units - two separate groups of four
Each group has a separate data path and splits the general-purpose registers the two units are named .L1 and .L2, .M1 and .M2, .S1 and .S2, and .D1 and .D2 The .L units are responsible for Logical operations Data packing and unpacking Some arithmetic.

Instruction Flow 32 General Purpose Registers
64 Bit Operations using the LDDW instruction LD1a manages the least-significant 32 bits and LD1b handles the most-significant 32 bits The .D units are joined so that we can look at either register file for data, regardless of where the data address came from

Instruction Flow Fetch Packets occur at boundaries of 256-bit intervals Important! An execute packet can’t cross the fetch packet boundary The execute packet for parallel instructions is created by looking at the first bit in the instruction (The P bit) Maximum of eight instructions executed in parallel.

Architecture Overview

Pipelining & Optimization
The C6701 doesn’t have the ability to look ahead and schedule The number of instructions in the execute packet is the key to optimizing the code The number of clock cycles used in executing an instruction is called the number of delay slots Multiple cycle instructions will have significant effects on the delay slot count of an instruction

Pipelining & Optimization
Possible to have an execute packet that contains NOPS. By using multiple NOPS in parallel with a multi-cycle instruction we will make the next execute packet capable of using the previous multi-cycle instruction result If we use a cross-path during a multi-cycle instruction then we can’t use that cross path again until the instruction has finished

Execution Pipeline

AD vs. TI vs. Motorola

Conclusion The C6701 allows scheduling of instructions in the assembly code Unfortunately, a good understanding of the hardware is still necessary to be able to schedule instructions in an optimized way Thank You

TI C6701 VLIW MIMD.

Similar presentations

Presentation on theme: "TI C6701 VLIW MIMD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TI C6701 VLIW MIMD.

Similar presentations

Presentation on theme: "TI C6701 VLIW MIMD."— Presentation transcript:

Similar presentations

About project

Feedback