Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.
VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
COMP3221 lec23-decode.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 23: Instruction Representation; Assembly and Decoding.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Performance Analysis and Optimization (General guidelines; Some of this is review) Outline: introduction evaluation methods timing space—code compression.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Aleksandra Tešanović Low Power/Energy Scheduling for Real-Time Systems Aleksandra Tešanović Real-Time Systems Laboratory Department of Computer and Information.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Decomposition of Instruction Decoder for Low Power Design TingTing Hwang Department of Computer Science Tsing Hua University.
Smart Dust Mote Core Architecture Brett Warneke, Sunil Bhave CS252 Spring 2000.
© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 of 16 March 30, 2000 Bus Access Optimization for Distributed Embedded Systems Based on Schedulability Analysis Paul Pop, Petru Eles, Zebo Peng Department.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Dynamically Reconfigurable Architectures: An Overview Juanjo Noguera Dept. Computer Architecture (DAC-UPC)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
©2000 Addison Wesley A basic ARM memory system. ©2000 Addison Wesley Simple ARM memory system control logic.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
Case Study - SRAM & Caches
Review of Memory Management, Virtual Memory CS448.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
1 Sign Bit Reduction Encoding for Low Power Applications Hsin-Wei Lin Saneei, M. Afzali-Kusha, A. and Navabi, Z. Sign Bit Reduction Encoding for Low Power.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.
Implementing Click IP Router Kernel on VLIW Architectures Kanyu Mark Cao and Xiaodong Jin Many thanks to Scott Weber and Kees Vissers for the help on this.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Abstraction :Managing Design Complexity through High-Level C-Model Verification Mike Andrews Mentor Graphics Yuan-Shiu Chen present.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
Bus Encoding to Prevent Crosstalk Delay Bert Victor and Kurt Keutzer ICCAD 2001.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
ARM7 Architecture What We Have Learned up to Now.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
L. Benini, G. DeMicheli Stanford University, USA A. Macii, E. Macii, M
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Improving Program Efficiency by Packing Instructions Into Registers
Dynamically Reconfigurable Architectures: An Overview
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong and Chi-ying Tsui

2/18 Outline Introduction Bus model and embedded system model Overview of the encoding scheme Static bus encoding scheme Dynamic bus encoding scheme Experimental results and comparison Conclusions

3/18 Bus model and embedded system model Bus model Cc: cross coupling capacitances Cs stand-alone capacitances Y: Cross coupling switching X: Bit lines switching

4/18 Bus model and embedded system model (cont.) i: bit line, j: cycle j X ij = 1, when there is an 0 to 1 transition 0, otherwise

5/18 Bus model and embedded system model (cont.)

6/18 Overview of the encoding scheme Reduce instruction bus energy Encoding instructions during compilation Off-line (static or dynamic encoding) Decoding information attached to program Load decoding information into lookup table Using decoding information when executing

7/18 Static bus encoding scheme Phase one Invert a set of bit lines Problem formulation Graph optimization problem

8/18 Static bus encoding scheme Phase two Rearranging the order of the bit lines Graph optimization problem Completely-connected undirected graph Traveling salesmen problem

9/18 Static bus encoding scheme Required overhead Extra bus transition activity 32log with 6 cycles 32 bits mux-based crossbar Inverting back the flipped bit lines Required hardware Crossbar for rearranging bit lines Inverters for inverting back bit lines A set of registers

10/18 Dynamic bus encoding scheme Encoding strategy Multiple permutations are generated Two strategies Phase two of static bus encoding The number of permutations is based on the number of blocks in the cache One permutation for one block

11/18 Dynamic bus encoding scheme Decoding strategy

12/18 Dynamic bus encoding scheme Overhead required Sending m sets of decoding information 32log 2 32 with 5m cycles Additional hardware for decoding Lookup table and crossbar Reduced dynamic encoding k groups where k=32/n

13/18 Experiment result ARM processor architecture 32 bits bus 20 mm for memory 15mm for cache 0.07 um technology with 1V power supply Three different cache block sizes

14/18 Experiment result Results of the static encoding scheme

15/18 Experiment result Results of the dynamic encoding scheme

16/18 Experiment result Comparison with previous work

17/18 Experiment result

18/18 Conclusions Both static and dynamic encoding Software-encoding During compilation time Good experiment result