Variable Word Width Computation for Low Power

Slides:



Advertisements
Similar presentations
Adding the Jump Instruction
Advertisements

Morgan Kaufmann Publishers The Processor
ELEN 468 Advanced Logic Design
MIPS Pipelined Datapath
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.
Review: MIPS Pipeline Data and Control Paths
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
Pipelined Datapath and Control
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1. Building A CPU  We’ve built a small ALU l Add, Subtract, SLT, And, Or l Could figure out Multiply and Divide  What about the rest l How do.
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
Introduction to Computer Organization Pipelining.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
Interstage Buffers 1 Computer Organization II © McQuain Pipeline Timing Issues Consider executing: add $t2, $t1, $t0 sub $t3, $t1, $t0 or.
CS161 – Design and Architecture of Computer Systems
Electrical and Computer Engineering University of Cyprus
Pipeline Timing Issues
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
ELEN 468 Advanced Logic Design
Morgan Kaufmann Publishers
Single Clock Datapath With Control
ECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009
CDA 3101 Spring 2016 Introduction to Computer Organization
Processor (I).
Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Pipelining review.
Single-cycle datapath, slightly rearranged
Single-Cycle CPU DataPath.
Current Design.
Pipelining in more detail
Rocky K. C. Chang 6 November 2017
Pipelining Basic concept of assembly line
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.4: Pipelining Datapath and Control
Guest Lecturer TA: Shreyas Chand
The Processor Lecture 3.5: Data Hazards
Instruction Execution Cycle
COMS 361 Computer Organization
Designing a Pipelined CPU
Instruction Set Principles
Pipeline Control unit (highly abstracted)
Pipelining Basic concept of assembly line
Control unit extension for data hazards
Pipelining Basic concept of assembly line
Pipelining Appendix A and Chapter 3.
Morgan Kaufmann Publishers The Processor
Control unit extension for data hazards
CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed.
MIPS Pipelined Datapath
©2003 Craig Zilles (derived from slides by Howard Huang)
CS/COE0447 Computer Organization & Assembly Language
Presentation transcript:

Variable Word Width Computation for Low Power By Bret Victor Sayf Alalusi

Motivation 32 bit architecture required for most general purpose computing However, many applications don’t need a full 32 bit data word: Video: 24 bit Audio: 16 bit Text: 8 bit Logic: 1 bit How can we exploit this to save power?

Possibilities Architecture that supports 32, 24, 16, 8, and 1 bit operations? Or some subset? Switch processor between modes, or specify width for each instruction? Global or distributed control? Gated clocks? Don’t drive unused outputs? Power down unused blocks?

Implementation Based on MIPS architecture and ISA Two widths: 16 bit and 32 bit Width chosen on instruction-by-instruction basis. Flag bit in instruction word selects width Modified ISA: arithmetic: add16, add32; mul16, mul32 logical: and16, and32 memory: lw16, lw32; sw16, sw32 branch compare: beq16, beq32

Energy Energy consumption occurs when a node transitions, and is proportional to the capacitance at that node. Prevent nodes from transitioning unnecessarily. Energy savings can be calculated by adding all the capacitance that is switching.

Where We Save Energy Our design saves energy over a traditional processor in three main areas: Clock and control line energy HWTE (High Word Transition Energy) Memory control energy We will see these three areas as we step through the pipeline.

Pipeline Overview I$ branch address: 32 MUX PC + 4: 32 immed: 16 + 32 branch offset +4 dest reg: 5 5 srcA srcB dest data outA outB 32 I$ PC 32 5 32 32 5 = 32 IF/ID ID/EX reg A MUX ALU fwd from MEM ALU result: 32 32 32 MUX fwd from WB 32 addr wr data rd data 32 32 32 32 reg B MUX 32 32 fwd from MEM fwd from WB immed data for SW: 32 dest reg: 5 5 dest reg: 5 EX/MEM MEM/WB

IF Stage Instruction words and addresses must be 32 bits. branch address: 32 MUX PC + 4: 32 32 +4 32 I$ PC 32 32 IF/ID Instruction words and addresses must be 32 bits. Can’t modify much.

ID Stage We can: gate the clocks of the pipeline register branch address: 32 PC + 4: 32 immed: 16 32 + branch offset dest reg: 5 5 srcA srcB dest data outA outB 32 5 32 5 = 32 IF/ID ID/EX We can: gate the clocks of the pipeline register only drive high words out of register file if 32 bit operation

Pipeline Register (ID) WidthGatedClock UngatedClock reg A: high 16 Clock UngatedClock reg A: low 16 reg B: high 16 reg B: low 16 C WidthGatedClock Q Width D destReg: 5 (from instruction word) ImmedGatedClock immed: 16 Fit gating into clock distribution network. Little energy overhead and helps control skew. On ID stage, gating reduces clock energy by: 56% on 16-bit operations 19% on 32-bit non-immediate operations

Register File Read Port (ID) Decoder selects register to drive output bus. We add one AND gate per register. Switching capacitance dominated by output bus. 16 bit operation takes 50% less energy than 32 bit operation.... Not necessarily savings! D E C O R Width 16 Reg 0: high 16 N 16 Reg 0: low 16 N Width 16 Reg 1: high 16 N 16 Reg 1: low 16 N

EX Stage Modify the ALU to perform 16 bit operations. reg A MUX ALU fwd from MEM 32 fwd from WB 32 reg B MUX 32 fwd from MEM fwd from WB immed data for SW: 32 dest reg: 5 Modify the ALU to perform 16 bit operations. Prevent the high word output of the MUXes from changing on 16 bit operations. Gate the clock of the pipeline register: Only latch high word of ALU result on 32 bit operations Only latch reg B on “store word” operations

Logical Inst.’s (EX) X0 ------- Y0 ------- e.g. X AND Y X1 ------- Y1 ------- X31 ------ Y31 ------ Just don’t let the unused bits (high 16) transition If they don’t transition, they will not drive the next stage either. 50% less energy

Adder (EX) . 3 A0 B0 … … An Bn 16 . 19 4 . 7 Upper Level CLA Generation 20 . 23 8 . 11 24 . 27 S0 Sn 12 . 15 28 . 31 The 4CLA blocks just get replicated for the number of bits, but the upper level CLA structure will grow with the number of bits. 16 bits: 58% less energy

Multiplier (EX) 32 x 32bit adds 32 x 32bit reg. writes 32 shifts In 32 cycles Vs. 16 x 16bit adds 16 x 16bit reg. writes 16 shifts In 16 cycles Multiply complexity grows as N2, so a 16 bit multiply takes 77% less energy. Even if upper 16 bits = 0, a 32 bit multiply does 16 extra shifts.

HWTE Two types of data in 16 bit application: Computational data (16-bit): high word = 0 Pointers and addresses (32-bit): high word = C Assume C “mostly constant” (memory accesses mostly in 64K block) Traditional processor only consumes more datapath energy than our processor when transitioning between these data types. HWTE = High Word Transition Energy

HWTE With such a model, our processor effectively only excecutes “16 bit operations”. Traditional processor excecutes “32 bit operations” only when transitioning between data types. E32 = energy of 32 bit operation E16 = energy of 16 bit operation N = average number of consecutive instructions that use the same data type HWTE = ( E32 - E16 ) / N

Barrel Shifter (EX) A3 B3 A2 B2 A1 B1 A0 B0 SH0 SH1 SH2 SH3 Big win will come from not driving the control lines to the upper 16 bits. Save about 50% in energy

MEM Stage ALU result: 32 addr wr data rd data 32 32 32 dest reg: 5 This is a big, regular memory (SRAM) structure that can easily be segmented into blocks. Exploit this fact

DCache (MEM) 2-way set associative, write-back Width Block # Only drive the word line that you need! 2-way set associative, write-back Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data values are aligned on 16b boundaries, 32 on 32b.

DCache (MEM) Only drive the word lines that are needed. Need a little bit of logic to figure out what the correct lines are, but large capacitance of WL dominates. Block size is larger for 16 bit values, better exploits spatial locality Associativity does not change from 16 bit to 32 bit word lengths Energy savings: 50% Control Line Savings, no HWTE!

WB Stage On a 16 bit operation, we can: Dest reg: 5 srcA srcB dest data outA outB Mem data: 32 MUX 5 ALU result: 32 32 MEM/WB On a 16 bit operation, we can: Only drive the low word out of the MUX Capacitive load on register write port is large Driving 16 bits out of the MUX consumes 50% less energy than driving 32 bits… HWTE formula applies. Only latch the low word into the register?

Reg. File Write Port (WB) We can add one AND gate for each register. But 16 bit write uses same amount of clock energy as 32 bit write without modifications. Little savings from not writing into the register, because the high word would not change in a 16 bit application. Not worth it! HiWrite Width D E C O R HiWrite C Reg 0: high 16 D 16 Write C Reg 0: low 16 D 16 HiWrite C Reg 1: high 16 D 16 Write C Reg 1: low 16 D 16

Summary Typical power distribution in core (non-memory): ALU: 34% x 66% I-decode: 23% x 100% Register file: 13% x 66% Clock: 10% x 50% Shifter: 11% x 50% Pipeline: 9% x 74% Core energy reduced by 29%.

Summary Typical power distribution in memory: Instruction cache 60% x 100% Data cache 40% x 50% Cache energy reduced by 20%. Total processor power consumption: Cache 66% x 80% Core 33% x 71% Total energy reduced by 24% when executing a 16 bit application.

Conclusions Primary drawback is modification of ISA. Energy savings are reasonable. Our modifications are fairly easy to implement, and can be fit into existing processor designs with minimal area increase.

Where do we go from here? More accurate capacitance models and SPICE simulation More accurate models of instruction mix