C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
KeyStone Training Multicore Applications Literature Number: SPRP814
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Lecture 6 Programming the TMS320C6x Family of DSPs.
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
KeyStone C66x CorePac Overview
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Architecture and Data Manipulation Chapter 3.
C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Chapter 12 Pipelining Strategies Performance Hazards.
Alyssa Concha Microprocessors Final Project ADSP – SHARC Digital Signal Processor.
Chapter 12 CPU Structure and Function. Example Register Organizations.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Basics and Architectures
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH-2 32-bit RISC.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Introduction to Microprocessors
Computer Architecture 2 nd year (computer and Information Sc.)
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Playstation2 Architecture Architecture Hardware Design.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
CPUz 4 n00bz.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.
GCSE Computing - The CPU
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
Dr. Michael Nasief Lecture 2
Pipelining: Advanced ILP
CSCE 212 Chapter 5 The Processor: Datapath and Control
Introduction to Digital Signal Processors (DSPs)
Superscalar Processors & VLIW Processors
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
The Von Neumann Model Basic components Instruction processing
* From AMD 1996 Publication #18522 Revision E
TI C6701 VLIW MIMD.
Digital Signal Processors-1
Superscalar and VLIW Architectures
GCSE Computing - The CPU
Guest Lecturer: Justin Hsia
Computer Architecture Assembly Language
Presentation transcript:

C66x CorePac: Achieving High Performance

Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept

CorePac Architecture 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept

C66x CorePac DSP Core Instruction Fetch M S L D bit M S L D Level 1 Data Memory (L1D)  Single-Cycle  Cache / RAM Reg A [32] Reg B [32] Level 1 Program Memory (L1P)  Single-Cycle  Cache / RAM Level 2 Memory (L2)  Program / Data  Cache / RAM Memory Controller CorePac includes: DSP Core Two registers Four functional units per register side L1P memory (Cache/RAM) L1D memory (Cache/RAM) L2 memory (Cache/RAM)

C66x DSP Core Four functional units per side: o Multiplier (.M) o ALU (.L) o Data (.D) o Control (.S) These independent functional units enable efficient execution of parallel specialized instructions: o Multiplier (.M1and.M2) and ALU (.L1 and.L2) provide MAC (multiple accumulation) operations. o Data (.D) provides data input/output. o Control (.S) provides control functions (loop, branch, call). Each DSP core dispatches up to eight parallel instructions each cycle. All instructions are conditional, which enables efficient pipelining. The optimized C compiler generates efficient target code. Memory A0 A31..S1.D1.L1.S2.M1.M2.D2.L2 B0 B31. Controller/Decoder MACs

C66x DSP Core Cross-Path A0 A1 A2 A3 A4 Register File A B0 B1 B2 B3 B4 Register File B A31B31 Any 64-bit pair of registers from A can be one of the inputs to a B functional unit, and vice versa. A A.D1.S1.M1.L1 B B.D1.S1.M1.L1

Partial List of.D Instructions

Partial List of.L Instructions

Partial List of.M Instructions

Partial List of.S Instructions

C66x CorePac Improvements Over C64x+ Wider internal bus – 64 bit for the.L and.S functional units – 128 bit for the.M functional unit Wider cross path – 64 bit for each direction 4x number of multipliers – More SIMD instructions Enhanced instruction set – More than 100 new instructions added (compared to c64+)

Enhanced C66x Instruction Set New SIMD instructions:  QMPY32 – 4-way SIMD of MYP32  DDOTP4H – 2-way SIMD of DOTP4H  DPACKL2 – SIMD version of PACKL2  DAVGU4 – Average of 8 packed unsigned bytes New floating-point instructions:  MPYDP – Double Precision Multiplication  FMPYDP – Fast Double Precision multiplication  DINTSP – 2-Way SIMD Convert 32-bits Unsigned Integer to Single Precision Floating Point

Interesting New C66x Instructions MFENCE (Memory Fence) Stall instruction pipeline until memory system is done. RCPSP (Single-Precision Floating-Point Reciprocal Approximation) RSQRSP (Single-Precision Floating-Point Square-Root Reciprocal Approximation)

Single Instruction Multiple Data (SIMD) 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept

C66x SIMD Instructions: Examples ADDDP – Add Two Double-Precision Floating-Point Values DADD2 – 4-Way SIMD Addition, Packed Signed 16-bit – Performs 4 additions of two sets of 4 16-bit numbers packed into 64- bit registers. – The 4 results are rounded to 4 packed 16-bit values – unit =.L1,.L2,.S1,.S2 FMPYDP - Fast Double-Precision Floating Point Multiply QMPY Way SIMD Multiply, Packed Signed 32-bit. – Performs 4 multiplications of two sets of 4 32-bit numbers packed into 128-bit registers. – The 4 results are packed 32-bit values. – unit =.M1 or.M2

C66x SIMD Instruction: CMATMPY Many applications use complex matrix arithmetic. CMATMPY – 2x1 Complex Vector Multiply 2x2 Complex Matrix – Results in 1x2 signed complex vector. – All values are 16-bit (16-bit real/16-bit Imaginary) – unit =.M1 or.M2 How many multiplications are complex multiplication, where each complex multiplication has the following? – 4 complex multiplications (4 real multiplications each) – Two M units (16 multiplications each) = 32 multiplications – Core cycles per second (1.25 G) – Total multiplications per second = 40 G multiplications – 8 cores = 320 G multiplications The issue here is, can we feed the functional units data fast enough?

Feeding the Functional Units There are two challenges: How to provide enough data from memory to the core – Access to L1 memory is wide (2 x 64 bit) and fast (0 wait state) – Multiple mechanisms are used to efficiently transfer new data to L1 from L2 and external memory. How to get values in and out of the functional units – Hardware pipeline enables execution of instructions every cycle. – Software pipeline enables efficient instruction scheduling to maximize functional unit throughput.

Memory Access 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept

Internal Buses PC Program Addressx32 Program Data x256 A Regs A Regs B Regs B Regs Data Address - T1 x32 Data Data - T1 x32/64 Data Address - T2x32 Data Data - T2 x32/64 L1 Memories L2 and External Memory Peripherals C62x: Dual 32-Bit Load/Store C67x: Dual 64-Bit Load / 32-Bit Store C64x, C674x, C66x: Dual 64-Bit Load/Store Fetch

Pipeline Concept 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept

Non-Pipelined vs. Pipelined CPU CPU Type F 2 D 2 E 2 F 3 D 3 E 3 F 1 D 1 E 1 Non-Pipelined Clock Cycles Pipeline full Now look at the C66x pipeline. StagePipeline Function F Fetch Generate program fetch address Read opcode D Decode Route opcode to functional units Decode instructions E Execute Execute instructions F 1 D 1 E 1 F 2 D 2 E 2 F 3 D 3 E 3 Pipelined

Program Fetch Phases PW C66x Core PS Memory PG PhaseDescription PGGenerate fetch address PSSend address to memory PWWait for data ready PRRead opcode Functional Units PR

Pipeline Phases - Review  Single-cycle performance is not affected by adding three program fetch phases.  That is, there is still an execute every cycle. PGPSPWPRDE Program Fetch Execute Decode How about decode? Is it only one cycle?

Decode Phases Decode PhaseDescription DPIntelligently routes instruction to functional unit (dispatch) DCInstruction decoded at functional unit (decode) PW C66x Core PS Memory PR PG Functional Units DP DC

Pipeline Full PG PS PW PR DP DC E1 Program Fetch Execute Decode Pipeline Phases How many cycles does it take to execute an instruction?

All C66x instructions require only one cycle to execute, but some results are delayed. Instruction Delays DescriptionInstruction ExampleDelay Single CycleAll instructions except0 Integer multiplication and new floating point MPY, FMPYSP1 Legacy floating point multiplication MPYSP2 LoadLDW4 BranchB5

Software Pipeline Example LDH || LDH MPY ADD LDH || LDH MPY ADD How many cycles would it take to perform this loop five times? (Disregard delay slots). ______________ cycles Dot product; A typical DSP MAC operation.

Software Pipeline Example LDH || LDH MPY ADD LDH || LDH MPY ADD How many cycles would it take to perform this loop five times? (Disregard delay slots). 5 x 3 = 15 cycles Dot product; A typical DSP MAC operation.

Non-Pipelined Code.M1.M2.L1.L2.S1.S2.D1.D2 1 Cycle ldh 2 mpy 3 add 4 ldh 5 mpy 6 add 7 ldh 8 mpy 9 add

Pipelining Code.M1.M2.L1.L2.S1.S2.D1.D21 Cycle ldh 2 mpy ldh 3 add mpy ldh 4 add mpy ldh 5 add mpy ldh 6 add mpy 7 add No LDHs? Pipelining these instructions took 1/2 the cycles!

Software Pipeline Support The compiler is smart enough to schedule instructions efficiently. DSP algorithms are typically loop intensive. Generally speaking, servicing of interrupts is not allowed in the middle of the loop because fixed timing is essential. The C66x hardware SPLOOP enables servicing of interrupts in the middle of loops. NOTE: For more information on SPLOOP, refer to Chapter 8 of the C66x CPU and Instruction Set Reference Guide.C66x CPU and Instruction Set Reference Guide

For More Information For more information, refer to the C66x CPU and Instruction Set Reference Guide.C66x CPU and Instruction Set Reference Guide For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website. TI E2E Community