Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.
Instructor: Yuzhuang Hu Final Exam! The final exam is scheduled on 7 th, August, Friday 7:00 pm – 10:00 pm.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 11 Instruction Sets
MIPS Architecture CPSC 321 Computer Architecture Andreas Klappenecker.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
UPC Value Compression to Reduce Power in Data Caches Carles Aliagas, Carlos Molina and Montse García Universitat Rovira i Virgili – Tarragona, Spain {caliagas,
The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
DLX Instruction Format
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Arithmetic Logic Unit (ALU) Anna Kurek CS 147 Spring 2008.
Gary MarsdenSlide 1University of Cape Town Chapter 5 - The Processor  Machine Performance factors –Instruction Count, Clock cycle time, Clock cycles per.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
CDA 3101 Fall 2013 Introduction to Computer Organization
MIPS ALU. Building from the adder to ALU ALU – Arithmetic Logic Unit, does the major calculations in the computer, including – Add – And – Or – Sub –
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Chapter 2 — Instructions: Language of the Computer — 1 Memory Operands Main memory used for composite data – Arrays, structures, dynamic data To apply.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.
Introduction to Computer Organization Pipelining.
CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
PipeliningPipelining Computer Architecture (Fall 2006)
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Instruction Level Parallelism
Morgan Kaufmann Publishers
CSC 4250 Computer Architectures
COMP541 Datapaths I Montek Singh Mar 28, 2012.
Chap 7. Register Transfers and Datapaths
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
CS203 – Advanced Computer Architecture
Morgan Kaufmann Publishers The Processor
Appendix C Pipeline implementation
CDA 3101 Spring 2016 Introduction to Computer Organization
Pipelining: Advanced ILP
Review: MIPS Pipeline Data and Control Paths
CSCI206 - Computer Organization & Programming
CSC 4250 Computer Architectures
Pipelining Basic concept of assembly line
The Processor Lecture 3.6: Control Hazards
Guest Lecturer TA: Shreyas Chand
COMP541 Datapaths I Montek Singh Mar 18, 2010.
Pipelining Basic concept of assembly line
Pipelining Basic concept of assembly line
Morgan Kaufmann Publishers The Processor
The Processor: Datapath & Control.
CS 111 – Sept. 16 Machine language examples Instruction execution
Presentation transcript:

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya, Barcelona, Spain 33rd Annual IEEE/ACM International Symposium 2000, pp

2/ /11/25 Very low power pipelines using significance compression Abstract  Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the significant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bits flowing down the pipeline to enable pipeline operations only for the significant bytes. Consequently, register logic and cache activity (and dynamic power) are substantially reduced.

3/ /11/25 Very low power pipelines using significance compression Abstract (cont.)  An initial trace-driven study shows reduction in activity of approximately 30-40% for each pipeline stage. Several pipeline organizations are studied. A byte serial pipeline is the simplest implementation, but suffers a CPI (cycles per instruction) increase of 79% compared with a conventional 32-bit pipeline. Widening certain pipeline stages in order to balance processing bandwidth leads to an implementation with a CPI 24% higher than the baseline 32-bit design. Finally, full-width pipeline stages with operand gating achieve a CPI within 2-6% of the baseline 32-bit pipeline.

4/ /11/25 Very low power pipelines using significance compression What’s the problem?  Energy consumption is the most critical design constraint in some microprocessor applications  For example, energy saving is more important than performance in the battery-powered embedded applications

5/ /11/25 Very low power pipelines using significance compression Introduction for reducing activity level  Dynamic energy consumption is proportional to the switching activity  gate off some execute unit if no use  reduce the activity of memory access

6/ /11/25 Very low power pipelines using significance compression Data representation  The data has only a few numerically significant low-order bits  for example, two’s complement integer data  use two extension bits to compress  ex > 04:11  ex. FF FF F5 04 -> F5 04 : 10  The address data usually has some internal bits that are insignificant  use three extension bits to compress  ex > : 011  ex. FF E > E7 04 : 101

7/ /11/25 Very low power pipelines using significance compression PC increment  using block-serial implementation  higher order bits is not often changed  block size N can be change : 1~30bit(s)  N=5 activity is reduced by 83% performance loss 3%

8/ /11/25 Very low power pipelines using significance compression Instruction cache  R-format : recode function field and exchange shift field  eight common cases using three bits function  shift do not use the rs field  I-format : divide immediate field into 2 parts  59.1%use immediate value and 80% only need 8 bits

9/ /11/25 Very low power pipelines using significance compression ALU operations  case 1 : both bytes are significant  byte addition must be perform  case 2 : one of the operands has a significant byte  Non-significant byte 0(1) carry-in 0(1) : equal to significant byte  Non-significant byte 0(1) carry-in 1(0) : significant byte plus 1 (minus one)  case 3 : neither of the operands has a significant byte in the position being added  depend on C i-1

10/ /11/25 Very low power pipelines using significance compression ALU operations (cont.)  A i and B i are both sign extensions of their preceding bytes,A i-1 and B i-1  C i =A i +B i  In general, C i is not significant  In the exceptional cases, ALU general a full byte value

11/ /11/25 Very low power pipelines using significance compression Data cache operation  extension bits are appended to each data word  the bytes containing significant data are read and write  first compared with low order tag, if not match, early miss can be signaled  miss rate is always low, the method is not efficient

12/ /11/25 Very low power pipelines using significance compression Activity performance

13/ /11/25 Very low power pipelines using significance compression Byte-serial implementation  one byte wide data path  multiple cycles for data if needs  one byte wide PC increment unit  three byte wide instruction cache  avoid excessive stalls

14/ /11/25 Very low power pipelines using significance compression Byte-serial implementation (cont.)  Byte-serial implementation

15/ /11/25 Very low power pipelines using significance compression Semi-parallel implementation  reduce performance losses  two byte-wide register files  multiple byte-wide ALU  it can be disabled if it is no use  add extra pipeline stage  72% of stalls are caused by structural hazards in the EX stage

16/ /11/25 Very low power pipelines using significance compression Semi-parallel implementation (cont.)  Byte semi-parallel implementation

17/ /11/25 Very low power pipelines using significance compression Fully parallel implementation  Byte-parallel skewed pipeline  4 bytes parallelism at each stage  similar way to the semi-parallel implementation  is optimized for the long data cases  Byte-parallel compressed pipeline  4 bytes parallelism at each stage  consists of the original 5 stages  one more cycle in the same stage to read addition data  store only use a single cycle  works well for short data

18/ /11/25 Very low power pipelines using significance compression Fully parallel implementation (cont.)  Two fully parallel implementations

19/ /11/25 Very low power pipelines using significance compression Performance of three implementations

20/ /11/25 Very low power pipelines using significance compression Conclusion  It proposed a number of pipeline implementations that achieve these low activity levels while providing a reasonable level of performance