By : Majid Namaki Custom Implementation of DSP Systems, Spring 2010 Instructor: Dr S. M. Fakhraei May 2010 1 Nathan J. Ickes, “A Micropower DSP for Sensor.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

CENTRAL PROCESSING UNIT
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Smart Dust Mote Core Architecture Brett Warneke, Sunil Bhave CS252 Spring 2000.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
GCSE Computing - The CPU
16/07/2015CSE1303 Part B lecture notes 1 Hardware Implementation Lecture B17 Lecture notes section B17.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CPU Describe the purpose of the CPU
1 Energy Efficient Communication in Wireless Sensor Networks Yingyue Xu 8/14/2015.
Processor Structure & Operations of an Accumulator Machine
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Computing hardware CPU.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Low-Power Wireless Sensor Networks
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
System Architecture of Sensor Network Processors Alan Pilecki.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Computer Organization & Assembly Language © by DR. M. Amer.
RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer operation is of how the different parts of a computer system work together to perform a task.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
The Central Processing Unit (CPU)
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Data Word Length Reduction for Low- Power DSP Software Kyungtae Han March 24, 2004.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
COSC 3330/6308 Second Review Session Fall Instruction Timings For each of the following MIPS instructions, check the cycles that each instruction.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Computer Hardware What is a CPU.
GCSE Computing - The CPU
Variable Word Width Computation for Low Power
ECE354 Embedded Systems Introduction C Andras Moritz.
Low-power Digital Signal Processing for Mobile Phone chipsets
ARM Organization and Implementation
Improving Memory Access 1/3 The Cache and Virtual Memory
Embedded Systems Design
Digital Signal Processors
Pipelining: Advanced ILP
Superscalar Processors & VLIW Processors
Getting the Most Out of Low Power MCUs
Overheads for Computers as Components 2nd ed.
Guest Lecturer TA: Shreyas Chand
T.H.A.D.D. GROUP TOM DUAN HELEN YU ANDY LEE DANNY HUANG DAWEY HUANG
Central Processing Unit
ARM ORGANISATION.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Digital Circuits and Logic
GCSE Computing - The CPU
WJEC GCSE Computer Science
ADSP 21065L.
Presentation transcript:

by : Majid Namaki Custom Implementation of DSP Systems, Spring 2010 Instructor: Dr S. M. Fakhraei May Nathan J. Ickes, “A Micropower DSP for Sensor Applications,” PhD thesis, MIT, 2008.

Introduction Why low power? Heat dissipation limits and battery lifetime concerns. Wireless microsensor networks, implanted medical devices are two examples of such applications. 2

Microsensor Applications Microsensor networks may consist of many-perhaps hundreds or thousands-of miniature sensor nodes scattered throughout an area of interest and linked by a wireless network. The network of sensors collaborates as a whole, combining measurements made by each individual node and delivering high-quality observations to a central base station. Large number of nodes in a microsensor network => high- resolution, multi-dimensional observations and fault- tolerance superior to more traditional sensing systems. 3

Microsensor Applications (cont.) Applications: inventory tracking, environmental monitoring, machine-mounted sensing, medical monitoring, and building climate control. Primary advantage of microsensor networks: the spatial diversity of the data collected by the network as a whole. Alternatively, the sensor network may be used to imitate a single very large sensor, one that might be impractically large to build or deploy 4

Microsensor Applications (cont.) Extremely small, yet long-lived sensor => power efficiency (the central issue in design of microsensors) Self-powered node : scavenging energy from ambient solar, thermal, or mechanical sources; But it is physically large and limited to outdoor applications. 5

Common Characteristics Low duty cycle: Nodes can be idle over 99% of the time => Minimizing standby power Event driven: Typical events handled by nodes include sending or receiving radio data, and collecting measurement data => Events must be handled quickly and efficiently to maximize node lifetime. 6

Common Characteristics (cont.) Localized data processing: Preliminary signal processing and data analysis occurs within the network. E.g. To save energy nearby nodes might aggregate their data, so reducing amount of data that must be sent to the network base station => Increase the peak processing capability required on each node. Unpredictable performance requirements: Performance demands on any given node are variable and unpredictable before deployment.=> variations in the nodes required radio transmission power, variations in the amount and type of signal processing required 7

Acoustic Tracking Application 8

The µAMPS DSP MIT µAMPS (micro, adaptive, multi-domain, power aware sensors) project. µAMPS microsensors are designed for acoustic tracking and other applications requiring sensor sampling rates of kS/s and significant post-acquisition signal processing, such as filtering, compression, or spectral analysis 4 MIPS, 10 pJ per instruction DSP designed to form the core of a µAMPS sensor node. The DSP is implemented in 90 nm low-power CMOS. 6.3 million transistors (6 million of which are contained in the on-chip memory). 9

µAMPS Sensor Node Architecture The node consists of three primary components: the DSP, a custom 12-bit 100 kSPS ADC, and a commercial ZigBee radio (the ChipCon CC2420) 10

DSP Block Diagram 11

Performance 12

Main Contributions Memory power optimization Instruction cache design Modeling of power-gating Hardware accelerators 13

Miniature Instruction cache The cache is direct-mapped and organized as sixteen lines of four words. The cache memory is implemented using flip-flops (rather than SRAM), allowing it to operate at the lower logic power supply voltage. The tag comparison and valid-flag logic is asynchronous, so that in the event of a cache miss, a main memory access can be initiated on the same cycle. An instruction can therefore be fetched on every clock cycle, regardless of whether a cache hit or miss occurs. 14

Power Gating Clock Gating => reduces dynamic power consumption in idle logic Power Gating => reduces leakage idle-mode power consumption, particularly for deep-sleep states and modern sub-100 nm process technologies. Power Gating is complicated: Power cannot be turned on and off on a cycle-by-cycle basis as is the case in clock gating. Some amount of planning ahead is required before powering off a logic block, to ensure that power can be restored in time before the logic is needed again. I. Higher threshold voltage device for the power switch II. Boosting the gate voltage to the power switch 15

Power Gating (cont.) 12 independent power domains: nine memory banks, the FFT and FIR accelerator cores, and the CPU. 16

µAMPS CPU Architecture Primary design strategy was to minimize the complexity of the control logic in the processor => All instructions execute in one clock cycle (CPI=1) All instructions have the same 16-bit length. A second design goal was to minimize the number of data memory accesses The processor contains three functional units: an ALU implementing add, subtract, and bitwise logical operations (AND, OR, XOR, NOT), a barrel shifter, and a multiply- accumulate (MAC) unit. The MAC consists of a 16 x 16-bit single- cycle multiplier and a a 48-bit accumulator register. The accumulator is readable and writable as special purpose registers r8, r9, and r10. 3-stage (fetch, execute, and write back) pipeline 17

Accelerator Cores The µAMPS DSP, being designed for acoustic sensing applications, incorporates accelerators for both FIR filtering and FFTs. The accelerators are implemented as memory-mapped devices. Energy savings obtained by using a hardware accelerator: Intrinsic savings in performing the actual computation (e.g., reduced cycle count and control logic overhead), Extrinsic savings from reduced utilization of global resources (e.g., reducing the number of main memory accesses). 18

FIR Accelerator An FIR filter accelerator implements up to 16- tap (symmetric) filters. The accelerator consists of a register file holding up to eight 16-bit tap coefficients, a 16×16 circular buffer for holding the input samples, a single multiply accumulate unit, an adder/subtracter, and a control state machine. Due to their small size, the sample and coefficient memories are implemented using Flip-Flops, rather than SRAM macros. 19

FFT Accelerator The FFT core computes transforms on 128-, 256-, 512- or 1024-point real-valued inputs, with 16-bit precision. The accelerator performs a complete butterfly in one clock cycle, compared to ~95 cycles per butterfly required for a software implementation. The local memory for the accelerator is split into four banks, based on the MSB and parity of each address. Each butterfly computation operates on values from two different banks, allowing both values to be fetched at the same time. The butterfly operations are specifically ordered so that sequential butterflies involve disjoint sets of memory banks. This allows processing one butterfly per clock cycle, with the results from one butterfly being written back to two memory banks while the inputs to the next butterfly are read from the other two banks. A small number of hazards are unavoidable and result in stalling the datapath for one cycle. 20

Comparison of the µAMPS DSP with other micropower processors 21