L4: Architectural Level Design 성균관대학교 조 준 동 교수

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

DSPs Vs General Purpose Microprocessors
CSCI 4717/5717 Computer Architecture
VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Sequential Definitions  Use two level sensitive latches of opposite type to build one master-slave flipflop that changes state on a clock edge (when the.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
CSE477 L26 System Power.1Irwin&Vijay, PSU, 2002 Low Power Design in Microarchitectures and Memories [Adapted from Mary Jane Irwin (
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Advanced Computer Architectures
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Basics and Architectures
Chapter 2 The CPU and the Main Board  2.1 Components of the CPU 2.1 Components of the CPU 2.1 Components of the CPU  2.2Performance and Instruction Sets.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
EEE440 Computer Architecture
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
Lecture 15 Microarchitecture Level: Level 1. Microarchitecture Level The level above digital logic level. Job: to implement the ISA level above it. The.
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lower Power and Deep Submicron VLSI Design
Low-power Digital Signal Processing for Mobile Phone chipsets
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
SECTIONS 1-7 By Astha Chawla
Embedded Systems Design
Assembly Language for Intel-Based Computers, 5th Edition
Architecture & Organization 1
Cache Memory Presentation I
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Architecture & Organization 1
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Overheads for Computers as Components 2nd ed.
* From AMD 1996 Publication #18522 Revision E
Chapter 12 Pipelining and RISC
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

L4: Architectural Level Design 성균관대학교 조 준 동 교수

System-Level Solutions Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past). Precompute physical capacitance of Interconnect and switching activity (number of bus accesses) Architecture-Driven Voltage Scaling: Choose more parallel architecture Supply Voltage Scaling : Lowering V dd reduces energy, but increase delays

Software Power Issues Upto 40% of the on-chip power is dissipated on the buses ! System Software : OS, BIOS, Compilers Software can affect energy consumption at various levels Inter- Instruction Effects Energy cost of instruction varies depending on previous instruction For example, XORBX 1; ADDAX DX; I est = (319:2+313:6)=2 = 316:4mA I obs =323:2mA The difference defined as circuit state overhead Need to specify overhead as a function of pairs of instructions Due to pipeline stalls, cache misses Instruction reordering to improve cache hit ratio

Software Power Optimization Instruction packing –reduce cache miss with a high power penalty –example Fujisu DSP permit an ALU operation and a memory data transfer to be packed Instruction ordering –attempt to minimize the energy associated with the circuit state effect –reordering instruction to minimize the total power for a given table Operand swapping –minimize activity associated with the operand –attempts to swap operands to ALU or FPU

Software Power Optimization Minimizing memory access costs –minimizes the number of memory accesses required by an algorithm –example Memory bank assignment –formulated as a graph partitioning problem –each groups correspond to a memory bank –optimum code sequence can vary using dual loads FOR i:= 1 TO N DO B[i] = f(A[i]); Before FOR i:= 1 TO N DO C[i] = g(B[i]); FOR i:= 1 TO N DO B[i] = f(A[i]); After END_FOR; C[i] = g(B[i]); e b a d c access graph for code fragment e b a c d partitioned access graph Bank A Bank B

Power Management Mode Support power management –easy control for applications and OS APM : Advanced power management –power states Full On APM Enabled APM Standby APM Suspend Off APM System APM-Aware Application APM-Aware Application APM-Aware Device Driver APM-Aware Device Driver APM Driver APM BIOS Controlled Hardware Add-In Device Add-In Device Operating System BIOS OS dependent OS independent

Power Management Mode APM state transitions Full On APM Enabled APM Standby APM Suspend Hibernation Off Power Managed Off Switch Off Call Short Inactivity Standby Call APM Disable Disable Call Long Inactivity Suspend Interrupt Suspend Call APM Enable Enable Call On Switch Device Responsiveness Decrease Power Usage Increase

Power Management Mode PowerPC 603 –Doze clock running to data cache, snooping logic, time base/decrementer only –Nap clocks running to time base/decrementer only –Sleep all clocks stopped, no external input clock MIPS 4200 –Reduced power clocks at 1/4 bus clock frequency Hitachi SH7032 –Sleep CPU clocks stopped, preipherals remain clocked –Standby all clocks stopped peripherals initialized

Power Optimization Modeling and Technology Circuit Design Level Logic and Module Design Level Architecture and System Design Level Some Design Examples –ARM7TDMI

Some Design Examples ARM7TDMI core –size : 1mm 0.25um –power : 5V 143 MIPS/W –feature 32 bit addressing 32x8 DSP multiplier 32-bit register bank and ALU 32-bit barrel shifter –thumb instruction set compressed 32-bit ARM instruciton high-code density ARM7D ARM7TDMI PC403GA V DX i960SA ProcessorSystemPower(W)MIPS/W 33Mhz 5V 40Mhz 5V 25Mhz 5V 16Mhz 5V 33Mhz 5V 16Mhz 5V

Processor with Power Management Clock power management –basic logical method gated clocking – hardware method external pin + control register bit – software method specific instructions + control register bit

Avoiding Wastful Computation Preservation of data correlation Distributed computing / locality of reference Application-specific processing Demand-driven operation Transformation for memory size reduction Consider arrays A and C are already available in memory When A is consumed another array B is generated; when C is consumed a scalar value D is produced. Memory Size can be reduced by executing the j loop before the i loop so that C is consumed before B is generated and the same memory space can be used for both arrays.

Avoiding Wastful Computation

Architecture Lower Power Design Optimum Supply Voltage Architecture through Hardware Duplication (Trading Area for Lower Power) and/or Pipelining – complex and fewer instruction requires less encoding, but larger decode logic! use small complex instruction with smaller instruction length (e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses only two operands, NEC V800: variable-length instruction decoding overhead ) Superscalar: CPI < 1: parallel instruction execution. VLIW architecture.

Variable Supply Voltage Block Diagram Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload. The basic idea is to lower power supply when the a fixed supply for some fraction of time. The supply voltage and clock rate are increased during high workload period.

Power Reduction using Variable Supply Circuits with a fixed supply voltage work at a fixed speed and idle if the data sample requires less than the maximum amount of computation. Power is reduced in a linear fashion since the energy per operation is fixed. If the work load for a given sample period is less than peak, then the delay of the processing element can be increased by a factor of 1/workload without loss in throughput, allowing the processor to operate at a lower supply voltage. Thus, energy per operation varies.

Data Driven Signal Processing The basic idea of averaging two samples are buffered and their work loads are averaged. The averaged workload is then used as the effective workload to drive the power supply. Using a pingpong buffering scheme, data samples I n +2, I n +3 are being buffered while I n, I n +1 are being processed.

Datapath Parallelization

Memory Parallelization At first order P= C * f/2 * Vdd 2

Pipelined Micro-P

Architecture Trade-Off PIPLELINED Implementation P pipeline = (1.15C)( 0.58V) 2 (f) = 0.39P P parallel = (2.15C)(0.58V) 2 (0.5f) = 0.36P

Different Classes of RISC Micro-P

Application Specific Coprocessor DSP's are increasingly called upon to perform tasks for which they are not ideally suited, for example, Viterbi decoding. They may also take considerably more energy than a custom solution. Use the DSP for portions of algorithms for which it is well suited, and craft an application-specic coprocessor (i.e., custom hardware) for other tasks. This is an example of the dierence between power and energy The application-specic coprocessor may actually consume a more power than the DSP, but it may be able to accomplish the same task in far less time, resulting in a net energy savings. Power consumption varies dramatically with the instruction being executed.

Clock per Instruction (CPI)

SUPERPIPELINE micro-P

VLIW Architecture Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency. The fixed format could present more limitations to the combination of operations. Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel. As power becomes a major issue in the design of fast -Pro, the simple is the better architecture. VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneously high-speed and low-power.

Architecture Optimization 2’s complement architecture –correlator example 64MHz random input 64KHz accumulated output 1024 length –accumulator acts as a low-pass filter higher order bits have little switching activity –high switching activity of the adder all of the input bits to the adder switch each time the input changes sign CLK (64MHz) CLK (64MHz) CLK (64KHz) in_latched current_sum add_out 4 Bit Position Transition Activity add_out in_latched current_sum sign-extension

Architecture Optimization Sign-magnitude architecture –low switching activity in high order bit no sign-extension is being performed higher order bits only need an incrementer –power is not sensitive to very rapid fluctuations in the input data Bit Position Transition Activity sum(2’s complement) suma + sumb (sign-magnitude) suma sumb gated clk clk(64KHz) gated clk clk(64KHz) clk(64KHz) clk (64MHz) sign-bit (to control) POSACC NEGACC 13 input pattern2’s(mW)Sign(mW) constant (7,7,…) ramp (-7,-6,..,6,7..) random min->max->min (-7,+7,-7,+7,…)

++ >>7>>8 IN SUM2SUM1 Bit Position Transition Activity SUM1 SUM2 ++ >>7 >>8 IN SUM2SUM1 Architecture Optimization Ordering of input signals –the ordering of operations can result in reduced switching activity –example multiplication with a constant : IN + (IN >> 7) + (IN >> 8) –topology II the output of first adder has a small amplitude -> lower switching activity switched 30% less Bit Position Transition Activity SUM1 SUM2

Architecture Optimization Reducing glitching activity –static design can exhibit spurious transitions finite propagation delay from one logic block to the next –important to balance all signal path and reduce the logic depth –multiple input addition 4 input case : 1.5 larger than tree implementation 8 input case : 2.5 larger than tree implementation AB C D + + AB D + C Chained implemenationTree implemenation

Synchronous VS. Asynchronous SYSTEMS Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch. Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs.

Synchronous VS. Asynchronous SYSTEMS More difficult to implement, requiring explicit synchronization between communication blocks without clocks If the signal feeds directly to conventional gate-level circuitry, invalid logic levels could propagate throughout the system. Glitches, which are filtered out by the clock in synchronous designs, may cause an asynchronous design to malfunction. Asynchronous designs are not widely used, designers can't find the supporting design tools and methodologies they need. DCC Error Corrector of Compact cassette player saves power of 80% as compared to the synchronous counterpart. Offers more architectural options/freedom encourages distributed, localized control offers more freedom to adapt the supply voltage

Asynchronous Modules

Example: ABCS protocol 6% more logics

Control Synthesis Flow

PIPELINED SELF-TIMED micro P

Programming Style

Speed vs. Power Optimization