© 2007 Elsevier Lecture 6: Embedded Processors Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Parallell Processing Systems1 Chapter 4 Vector Processors.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
© 2008 Wayne Wolf Overheads for Computers as Components 2nd ed. Instruction sets Computer architecture taxonomy. Assembly language. 1.
Computer Architecture and Data Manipulation Chapter 3.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
Instruction Level Parallelism (ILP) Colin Stevens.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Chapter 17 Parallel Processing.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
© 2007 Elsevier Chapter 2, part 1: CPUs High Performance Embedded Computing Wayne Wolf.
Computer Organization and Assembly language
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Basics and Architectures
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Pipelining and Parallelism Mark Staveley
IBM/Motorola/Apple PowerPC
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
EKT303/4 Superscalar vs Super-pipelined.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2: Data Manipulation
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Topics to be covered Instruction Execution Characteristics
Advanced Architectures
Parallel Processing - introduction
Embedded Systems Design
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Vector Processing => Multimedia
Superscalar Processors & VLIW Processors
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Chapter 2: Data Manipulation
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Chapter 2: Data Manipulation
Overheads for Computers as Components 2nd ed.
Superscalar and VLIW Architectures
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Chapter 2: Data Manipulation
Presentation transcript:

© 2007 Elsevier Lecture 6: Embedded Processors Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

High Performance Embedded Computing © 2007 Elsevier Topics Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms.

High Performance Embedded Computing © 2007 Elsevier Demand for Embedded Processors Embedded processors account for  Over 97% of total processors sold  Over 60% of total sales from processors Sales expected to increase by roughly 15% each year

High Performance Embedded Computing © 2007 Elsevier Flynn’s taxonomy of processors Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction multiple-data (MIMD) Multiple-instruction single data (MISD) What is an example of each? Which would you expect to see in embedded systems?

High Performance Embedded Computing © 2007 Elsevier Other axes of comparison RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple- issue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories.

High Performance Embedded Computing © 2007 Elsevier Embedded vs. general-purpose processors Embedded processors may be customized for a category of applications.  Customization may be narrow or broad. We may judge embedded processors using different metrics:  Code size.  Energy efficiency.  Memory system performance.  Predictability.

High Performance Embedded Computing © 2007 Elsevier Embedded RISC processors RISC processors often have simple, highly- pipelinable instructions Pipelines of embedded RISC processors have grown over time:  ARM7 has 3-stage pipeline.  ARM9 has 5-stage pipeline  ARM11 has 8-stage pipeline. ARM11 pipeline [ARM05].

High Performance Embedded Computing © 2007 Elsevier RISC processor families ARM:  ARM7 has in-order execution, and no memory management or branch prediction;  ARM9 ARM11 has out of order execution, memory management, and branch prediction, MIPS:  MIPS32 4K has 5-stage pipeline;  4KE family has DSP extension;  4KS is designed for security. PowerPC:  PowerPC 400 series includes several embedded processors;  Motorola and IBM offer superscalar versions of the PowerPC

High Performance Embedded Computing © 2007 Elsevier Embedded DSP Processors DSP processors feature  Deterministic execution times  Fast multiply-accumulate instructions  Multiple data accesses per cycle  Specialized addressing modes  Efficient support for loops and interrupts  Efficient processing of “streaming” data n Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms

High Performance Embedded Computing © 2007 Elsevier Example: TI C55x/C54x DSPs 40-bit arithmetic (32-bit values + 8 guard bits). Barrel shifter. 17 x 17 multiplier. Two address generators. Lots of special purpose registers and addressing modes Coprocessors for compute-intensive functions including pixel interpolation, motion estimation, and DCT/IDCT computations

High Performance Embedded Computing © 2007 Elsevier TI C55x microarchitecture

High Performance Embedded Computing © 2007 Elsevier Parallelism extraction Static:  Use compiler to analyze program.  Simpler CPU.  Can’t depend on data values.  VLIW Dynamic:  Use hardware to identify opportunities.  More complex CPU.  Can make use of data values.  Superscalar

High Performance Embedded Computing © 2007 Elsevier VLIW architectures Each very long instruction word (VLIW) erforms multiple operations in parallel Needs a good compiler that understands the architecture Allows deterministic execution times Code growth can be reduced by allowing  Operations within an instruction to be performed sequentially  A given field to specify different types of operations BranchMemory ArithmeticLogic Vector Branch/Mem Mem/Arith Vector Arith/Logic Seq

High Performance Embedded Computing © 2007 Elsevier Simple VLIW architecture Large register file feeds multiple function units. Register file E box Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP ALU Load/store FU

High Performance Embedded Computing © 2007 Elsevier Clustered VLIW architecture Register file, function units divided into clusters. What are advantages/disadvantages of having clusters in VLIW architectures? Execution Register file Execution Register file Cluster bus

High Performance Embedded Computing © 2007 Elsevier TI C62x/C67x DSPs VLIW with up to 8 instructions/cycle bit registers. Function units:  Two multipliers.  Six ALUs. All instructions execute conditionally.

High Performance Embedded Computing © 2007 Elsevier TI C6x data operations 8/16/32-bit arithmetic. 40-bit operations. Bit manipulation operations. C67x processors add floating-point arithmetic.

High Performance Embedded Computing © 2007 Elsevier C6x block diagram Data path 1/ Reg file 1 Data path 2/ Reg file 2 Execute DMA timers Serial Program RAM/cache 512K bits Data RAM 512K bits JTAG PLL bus

High Performance Embedded Computing © 2007 Elsevier Texas Instruments C62x N. Seshan, “High VelociTI processing [Texas Instruments VLIW DSP architecture]”, IEEE Signal Processing Magazine, v. 15, no. 2, pp , 117, 1998.

High Performance Embedded Computing © 2007 Elsevier Emerging DSP Architectures Parallelism at multiple levels  Multiple processors System-on-a-chip designs  Multiple simultaneous tasks Multithreaded processors  Multiple instruction per cycle Very Long Instruction Word (VLIW) architectures  Multiple operation per instruction Single Instruction Multiple Data (SIMD) instructions Architecture/compiler pairs improve performance and help manage application complexity

High Performance Embedded Computing © 2007 Elsevier Superscalar processors Instructions are dynamically scheduled.  Dependencies are checked at run time in hardware. Used to some extent in embedded processors.  Embedded Pentium is two-issue in-order.  Some PowerPCs are superscalar What advantages/disadvantages do VLIW processors compared to superscalar?

High Performance Embedded Computing © 2007 Elsevier SIMD and subword parallelism Many special-purpose SIMD machines  All processors perform same operation on different data Subword parallelism is widely used for video.  ALU is divided into subwords for independent operations on small operands. Vector processing is another form of SIMD processing Lots of times these terms are interchanged

High Performance Embedded Computing © 2007 Elsevier SIMD Instructions Recent multimedia processors commonly support Single Instruction Multiple data (SIMD) instructions The same operation is performed on multiple data operands using a single instruction Exploits low precision and high data parallelism of multimedia applications A3A2A1A0 B3B2B1B0 A3+B3A2+B2A1+B1A0+B0

High Performance Embedded Computing © 2007 Elsevier Operand characteristics in MediaBench

High Performance Embedded Computing © 2007 Elsevier Dynamic behavior of loops in MediaBench The loops of media applications in many cases are not very deep Path ratio = (instructions executed per iteration) / (total number of loop instructions). What does the path ratio reveal?

High Performance Embedded Computing © 2007 Elsevier TriMedia TM-1 characteristics Characteristics  Floating point support  Sub-word parallelism support  VLIW  Additional custom operations

High Performance Embedded Computing © 2007 Elsevier Trimedia TM-1 memory interface video in audio in I2CI2C timers image co-p PCI video out audio out serial VLD co-p VLIW CPU

High Performance Embedded Computing © 2007 Elsevier TM-1 VLIW CPU register file read/write crossbar FU1FU27 slot 1slot 2slot 3slot 4slot 5...

High Performance Embedded Computing © 2007 Elsevier Multithreading Low-level parallelism mechanism. Interleaved multithreading (IMT) alternately fetches instructions from separate threads.  Often used with VLIW and vector processors Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle.  Often used with superscalar processors What advantages/disadvantages does IMT have relative to SMT?

High Performance Embedded Computing © 2007 Elsevier Dynamic voltage scaling (DVS) Power scales with V 2 while performance scales roughly as V. Reduce operating voltage, add parallel operating units to make up for lower clock speed. DVS doesn’t work well in processors with high- leakage power.

High Performance Embedded Computing © 2007 Elsevier Dynamic voltage and frequency scaling (DVFS) Scale both voltage and clock frequency. Can use control algorithms to match performance to application, reduce power.

High Performance Embedded Computing © 2007 Elsevier Razor architecture Razor runs clock faster than worst case allows Used specialized latch to detect errors. Recovers only on errors, gains average- case performance.