Intro to the “c6x” VLIW processor

Slides:



Advertisements
Similar presentations
Chapter 2: Data Manipulation
Advertisements

Machine cycle.
Chapter 1. Basic Structure of Computers
DSPs Vs General Purpose Microprocessors
Details.L and.S units TMS320C6000 Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
ARM programmer’s model and assembler Embedded Systems Programming.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Cache memory October 16, 2007 By: Tatsiana Gomova.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
PART 6: (1/2) Enhancing CPU Performance CHAPTER 16: MICROPROGRAMMED CONTROL 1.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
MICROPROGRAMMED CONTROL
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Immediate Addressing Mode
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Single Clock Datapath With Control
Henk Corporaal TUEindhoven 2009
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
The fetch-execute cycle
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Lecture 5: Pipelining Basics
CSC 4250 Computer Architectures
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
TI C6701 VLIW MIMD.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
The ARM Instruction Set
Computer System Design (Processor Design)
Superscalar and VLIW Architectures
Compiler Construction
October 29 Review for 2nd Exam Ask Questions! 4/26/2019
Computer Concept and Practice
Introduction to the ARM Instruction Set. Data Processing Instructions Move Instructions Syntax: { }{S} Rd, N.
Presentation transcript:

Intro to the “c6x” VLIW processor Texas Instruments TMSC6000 series TMSC6700 subseries – include floating point VLIW = Very Long Instruction Word

Operations in Parallel registers Function units

Operations in Parallel registers bypassing Function units

Non-orthogonal registers registers Bypass Function units

Non-orthogonal *** See TI's picture *** A B registers registers Bypass Function units L1 S1 M1 D1 L2 S2 M2 D2 *** See TI's picture ***

Specialized Function Units L units: arithmetic, compare, and logical ops S units: arithmetic, logical, branches, constant generation M units: multiplies D units: address generation / memory accesses

Complicated hardware registers registers

Explicit parallelism registers registers

Simple VLIW encoding Slots that cannot be utilized are filled with no-ops Bad for code density, cache utilization, energy, ...

C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel (0 = “EOP”) Any slot can go to any function unit 1 1 1 1 1 1

C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 1 1 1 1 1 1

C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 1 1 1 1 1 1 1 1 1 1 1 1 Packet cannot cross an 8-word boundary Resources constrain which instructions can be combined in the same packet You can branch into the middle of a packet!

Explicit scheduling Delay slots must be respected – no HW interlocks or scoreboarding Multiply – 1 delay slot Load – 4 delay slots Branch – 5 delay slots B5 := B3 * B2 B5 := B3 * B2 B7 := B5 + B1 B7 := B5 + B1 Right Wrong

Predicated execution Example: Why? To get rid of branches (5 delay slots * 8 wide ....) Basic idea: a comparison result is stored to a condition register ; this register is then used as an operand of other instructions, and its value causes those operations to be selectively enabled or squashed. [Condition registers: A1, A2, B0, B1, B2] Example: If (B3<B4) B3++ else B4++

Predicated execution With branches: With predicates: cmp B3, B4 bge L2 <nop> B3 := B3+1 b DONE L2: B4 := B4+1 DONE: cmplt B3, B4 B0 [B0] B3 := B3+1 [!B0] B4 := B4+1 ...and the last two can be issued in parallel! Control dependency has been converted to data dependency...

Assembly details .text .align 32 .global proc proc: mvk 4, b3 cmpgt b3, b4, b0 [ b0] mvk.S2 9, b5 || [!b0] mvk.S1 8, a5 stw a5, *-a15[4] .....

Fetch/execute pipeline PG generate program address PS program address send PW program memory access PR fetch reaches CPU boundary DP instruction dispatch DC instruction decode E1 execute 1 E2 execute 2 E3 execute 3 E4 execute 4 E5 execute 5

Addressing Modes C equivalent *R (*R) *+R[ucst5] (R[ucst5]) *+R[offsetR] (R[offsetR]) *-R[offsetR] (R[-offsetR]) Special case: 15b offsets: *+B15[ucst15] *+B14[ucst15]

Addressing Modes Pre/post increment/decrement *++R , *R++ *++R[ucst5], *R++[ucst5] *--R[ucst5], *R--[ucst5] *++R[offsetR], *R++[offsetR] *--R[offsetR], *R--[offsetR]

Resources http://www.cs.cmu.edu/~tcal/15745/