© 2007 Elsevier Chapter 2, part 1: CPUs High Performance Embedded Computing Wayne Wolf.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

DSPs Vs General Purpose Microprocessors

PIPELINE AND VECTOR PROCESSING

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Lecture 6 Programming the TMS320C6x Family of DSPs.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.

© 2008 Wayne Wolf Overheads for Computers as Components 2nd ed. Instruction sets Computer architecture taxonomy. Assembly language. 1.

Computer Architecture and Data Manipulation Chapter 3.

C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.

1 Microprocessor-based Systems Course 4 - Microprocessors.

Embedded Systems Programming

Processor Technology and Architecture

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

1 Architectural Analysis of a DSP Device, the Instruction Set and the Addressing Modes SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software.

1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.

Chapter 17 Parallel Processing.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

GCSE Computing - The CPU

© 2007 Elsevier Lecture 6: Embedded Processors Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne.

Ehsan Shams Saeed Sharifi Tehrani. What is DSP ? Digital Signal Processing (DSP) is used in a wide variety of applications, and it is hard to find a good.

Advanced Computer Architectures

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

ICME 2004© 2004 Marilyn Wolf Architectures for Multimedia Computing 3: DSPs and Systems-on- Chips Marilyn Wolf Dept. of Electrical Engineering Princeton.

CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Ch. 2 Data Manipulation 4 The central processing unit. 4 The stored-program concept. 4 Program execution. 4 Other architectures. 4 Arithmetic/logic instructions.

What is µP? “An integrated circuit containing … a central processing unit (CPU) and a means to access external memory” -- (Ball 2000)

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Pipelining and Parallelism Mark Staveley

DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Overview von Neumann Architecture Computer component Computer function

EKT303/4 Superscalar vs Super-pipelined.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2: Data Manipulation

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Fundamentals of Programming Languages-II

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Protection in Virtual Mode

Advanced Architectures

Instruction Level Parallelism

Embedded Systems Design

Getting the Most Out of Low Power MCUs

Chapter 2: Data Manipulation

* From AMD 1996 Publication #18522 Revision E

TI C6701 VLIW MIMD.

Chapter 2: Data Manipulation

Overheads for Computers as Components 2nd ed.

Digital Signal Processors-1

Superscalar and VLIW Architectures

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Chapter 2: Data Manipulation

Presentation transcript:

© 2007 Elsevier Chapter 2, part 1: CPUs High Performance Embedded Computing Wayne Wolf

High Performance Embedded Computing © 2007 Elsevier Topics CPU metrics. Categories of CPUs. CPU mechanisms.

High Performance Embedded Computing © 2007 Elsevier Performance as a design metric Performance = speed:  Latency.  Throughput. Average vs. peak performance. Worst-case and best- case performance.

High Performance Embedded Computing © 2007 Elsevier Other metrics Cost (area). Energy and power. Predictability. Security.

High Performance Embedded Computing © 2007 Elsevier Flynn’s taxonomy of processors Single-instruction single-data (SISD): RISC, etc. Single-instruction multiple-data (SIMD): all processors perform the same operations. Multiple-instruction multiple-data (MIMD): homogeneous or heterogeneous multiprocessor. Multiple-instruction multiple data (MISD).

High Performance Embedded Computing © 2007 Elsevier Other axes of comparison RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple- issue machines. Vector processing. Multithreading.

High Performance Embedded Computing © 2007 Elsevier Embedded vs. general-purpose processors Embedded processors may be optimized for a category of applications.  Customization may be narrow or broad. We may judge embedded processors using different metrics:  Code size.  Memory system performance.  Preditability.

High Performance Embedded Computing © 2007 Elsevier RISC processors RISC generally means highly-pipelinable, one instruction per cycle. Pipelines of embedded RISC processors have grown over time:  ARM7 has 3-stage pipeline.  ARM9 has 5-stage pipeline.  ARM11 has eight-stage pipeline. ARM11 pipeline [ARM05].

High Performance Embedded Computing © 2007 Elsevier RISC processor families ARM: ARM7 is relatively simple, no memory management; ARM11 has memory management, other features. MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security. PowerPC: 400 series includes several embedded processors; MPD7410 is two- issue machine; 970FX has 16-stage pipeline.

High Performance Embedded Computing © 2007 Elsevier Digital signal processors First DSP was AT&T DSP16:  Hardware multiply- accumulate unit.  Harvard architecture. Today, DSP is often used as a marketing term. Modern DSPs are heavily pipelined.

High Performance Embedded Computing © 2007 Elsevier Example: TI C5x DSP 40-bit arithmetic unit (32-bit values with 8 guard bits). Barrel shifter. 17 x 17 multiplier. Comparison unit for Viterbi encoding/decoding. Single-cycle exponent encoder for wide- dynamic-range arithmetic. Two address generators.

High Performance Embedded Computing © 2007 Elsevier TI C55x microarchitecture

High Performance Embedded Computing © 2007 Elsevier TI C55x pixel interpretation co-processor Designed to support motion estimation. Interpolates U, M, R values given A, B, C, D pixels. AUB MR CD

High Performance Embedded Computing © 2007 Elsevier TI C55x DCT co-processor Performs 1-D 8x8 DCT. Pipelines several DCT iterations.

High Performance Embedded Computing © 2007 Elsevier Parallelism extraction Static:  Use compiler to analyze program.  Simpler CPU.  Can make use of high- level language constructs.  Can’t depend on data values. Dynamic:  Use hardware to identify opportunities.  More complex CPU.  Can make use of data values.

High Performance Embedded Computing © 2007 Elsevier Simple VLIW architecture Large register file feeds multiple function units. Register file E box Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP ALU Load/store FU

High Performance Embedded Computing © 2007 Elsevier Clustered VLIW architecture Register file, function units divided into clusters. Execution Register file Execution Register file Cluster bus

High Performance Embedded Computing © 2007 Elsevier TI C62/C67 Up to 8 instructions/cycle bit registers. Function units:  Two multipliers.  Six ALUs. All instructions execute conditionally.

High Performance Embedded Computing © 2007 Elsevier TI C6x data operations 8/16/32-bit arithmetic. 40-bit operations. Bit manipulation operations.

High Performance Embedded Computing © 2007 Elsevier C6x system On-chip RAM. 32-bit external memory: SDRAM, SRAM, etc. Host port. Multiple serial ports. Multichannel DMA. 32-bit timer.

High Performance Embedded Computing © 2007 Elsevier C6x block diagram Data path 1/ Reg file 1 Data path 2/ Reg file 2 Execute DMA timers Serial Program RAM/cache 512K bits Data RAM 512K bits JTAG PLL bus

High Performance Embedded Computing © 2007 Elsevier C6x data paths General-purpose register files (A and B, 16 words each). Eight function units: .L1,.L2,.S1,.S2,.M1,.M2,.D1,.D2 Two load units (LD1, LD2). Two store units (ST1, ST2). Two register file cross paths (1X and 2X). Two data address paths (DA1 and DA2).

High Performance Embedded Computing © 2007 Elsevier C6x function units.L  32/40-bit arithmetic.  Leftmost 1 counting.  Logical ops..S  32-bit arithmetic.  32/40-bit shift and 32-bit field.  Branches.  Constants..M  16 x 16 multiply..D  32-bit add, subtract, circular address.  Load, store with 5/15-bit constant offset.

High Performance Embedded Computing © 2007 Elsevier Motorola Starcore SC140 DALU includes 4 ALUs, 1 register file. AGU includes 2 address arithmetic units (AAU), 1 address register file. Program sequencer and control unit (PSEQ). Performance:  4 million MACs per second per MHz clock.  10 RISC MIPS per MHz clock.

High Performance Embedded Computing © 2007 Elsevier Typical SC140 configuration Level 1 memory expansion RAM, ROM peripherals DMA, Cache, Interrupts, Level 2 mem, Etc. Program sequencer ALU AAU

High Performance Embedded Computing © 2007 Elsevier SC140 Core Program sequencer Power mgt Clock/PLL AGU DALU Address Register file 2 AAUsBMU Data Register file 4 ALUs

High Performance Embedded Computing © 2007 Elsevier Instruction format 16-bit instructions. Up to 6 instructions per cycle. Instructions are grouped to define allowable simultaneous operations. MACR –D0,D1,D7 AND D4,D5MOVE.L (R0),+N0,R6ADDA R2,R3 DALU AGU

High Performance Embedded Computing © 2007 Elsevier AGU addressing Allowable addressing modes:  Linear.  Modulo.  Reverse-carry.  Automatic updating during register indirect.  Stack. Array addressing: base, offsett, modifier registers.

High Performance Embedded Computing © 2007 Elsevier Motorola MC8126 multiprocessor Four SC140 cores.  224 KB M1 memory with zero wait states. Shared M2 memory.  476 KB running at core frequency.  M2-accessible multi-core bus. 64/32-bit data, 32-bit address 60x bus. Eight-bank memory controller. Hardware semaphores.

High Performance Embedded Computing © 2007 Elsevier TM-1 characteristics Characteristics  Floating point support  Sub-word parallelism support  If Conversion  Additional custom operations

High Performance Embedded Computing © 2007 Elsevier Trimedia TM-1 memory interface video in audio in I2CI2C timers image co-p PCI video out audio out serial VLD co-p VLIW CPU

High Performance Embedded Computing © 2007 Elsevier TM-1 VLIW CPU register file read/write crossbar FU1FU27 slot 1slot 2slot 3slot 4slot 5...

High Performance Embedded Computing © 2007 Elsevier Superscalar processors Instructions are dynamically scheduled.  Dependencies are checked at run time in hardware. Used to some extent in embedded processors.  Embedded Pentium is two-issue in-order.

High Performance Embedded Computing © 2007 Elsevier SIMD and subword parallelism Many special-purpose SIMD machines. Subword parallelism is widely used for video.  ALU is divided into subwords for independent operations on small operands. Vector processing is widely used for integer values.

High Performance Embedded Computing © 2007 Elsevier Multithreading Low-level parallelism mechanism. Hardware multithreading alternately fetches instructions from separate threads. Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle.

High Performance Embedded Computing © 2007 Elsevier Example: Sandbridge Sandblaster

High Performance Embedded Computing © 2007 Elsevier Available parallelism in multimedia applications (Talla et al.)

High Performance Embedded Computing © 2007 Elsevier Operand characteristics in MediaBench (Fritts)

High Performance Embedded Computing © 2007 Elsevier Dynamic behavior of loops in MediaBench (Fritts) Path ratio = (instructions executed per iteration) / (total number of loop instructions). MediaBench shows small path ratio -> considerable conditional behavior in loops.

High Performance Embedded Computing © 2007 Elsevier Dynamic voltage scaling (DVS) Power scales with V 2 while performance scales roughly as V. Reduce operating voltage, add parallel operating units to make up for lower clock speed. DVS doesn’t work in high-leakage processors.

High Performance Embedded Computing © 2007 Elsevier Dynamic voltage and frequency scaling (DVFS) Scale both voltage and clock frequency. Can use control algorithms to match performance to application, reduce power.

High Performance Embedded Computing © 2007 Elsevier Razor architecture Critical path not always executed Reduce clock frequency to match average path Used specialized latch to detect errors. Recovers only on errors, gains average- case performance.