Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
A Study of Energy Efficiency Methods for Memory Mao-Yin Wang & Cheng-Wen Wu.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Processor Frequency Setting for Energy Minimization of Streaming Multimedia Application by A. Acquaviva, L. Benini, and B. Riccò, in Proc. 9th Internation.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.
Low-Power Wireless Sensor Networks
November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept Presenter:
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Andrea Acquaviva, Luca Benini, Bruno Riccò
Digital readout architecture for Velopix
Selective Code Compression Scheme for Embedded System
SECTIONS 1-7 By Astha Chawla
Improving java performance using Dynamic Method Migration on FPGAs
Department of Electrical & Computer Engineering
Improving Program Efficiency by Packing Instructions Into Registers
Tosiron Adegbija and Ann Gordon-Ross+
Esam Ali Khan M.S. Thesis Defense
Final Project presentation
Presentation transcript:

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and Test in Europe Conference and Exhibition Presenter : Hung Yu Chen

Hung-Yu Chen 2/ /6/21 Abstract  Memory partitioning is a effective approach to memory energy optimization in embedded systems. Spatial locality of the memory address profile is the key property that partitioning exploits to determine an efficient multi-bank memory architecture. This paper presents an approach, called address clustering, for increasing the locality of given memory access profile, and thus improving the efficiency of partitioning. Results obtained on several embedded applications running on an ARM7 core show average energy reductions of 25% (maximum 57%) w.r.t a partitioned memory architecture synthesized without resorting to address clustering.

Hung-Yu Chen 3/ /6/21 Outline  What’s the problem?  Memory Energy  Memory Partitioning  Address Clustering  Experimental Result  Conclusions

Hung-Yu Chen 4/ /6/21 What’s the problem?  Modern SoC platforms usually contain one or more processors.  the increasing gap between processor and memory speed.  Various types of on-chip embedded memories providing shorting latencies and wider interfaces.  Problem:  Ubiquity of embedded memories makes them the largest contributor to the overall energy budget of a chip.

Hung-Yu Chen 5/ /6/21 Memory Energy  Model : E men = ∑ N i=1 Cost(i);  N : number of accesses during the computation.  Cost(i) : cost of an access due to the memory organization and the cost of the physical access given by technology.  Memory energy optimization : 1. Reducing Cost(i):  build low-energy memory architecture. 2. Reducing N:  modify the memory access pattern. 3. Both two.

Hung-Yu Chen 6/ /6/21 Memory Partitioning  memory partitioning technique.

Hung-Yu Chen 7/ /6/21 Memory Partitioning (cont.)  Figure 1-a :  The whole address space of the application is mapped to a single SRAM memory array.  Figure 1-b :  A dynamic access profile.  Figure 1-c :  The partitioned memory.  Notice that we need to account for the power consumed in the entire partitioned memory system.

Hung-Yu Chen 8/ /6/21 Address Clustering-Example  MPEG Decoding application for ARM7 core  Instruction stream

Hung-Yu Chen 9/ /6/21 Address Clustering-Example (cont.)  Figure 2 show :  Total number of addresses : 31,233 (range from 0 to 124,892) Memory cut has 1,952 rows * 512 columns.  Power consumes 170mJ. (44.4 million total read)  Memory partitioning :  Three memory blocks of sizes 736*256696*512892*512  Power consumes 96mJ. (inclusive of the overhead)  43.5% Energy reduction :  696*512 : keep the majority (82%) of the memory accesses. (36 million out of 44.4)

Hung-Yu Chen 10/ /6/21 Address Clustering-Example (cont.)  Figure 3 : Clustered Address Profile of a MPEG Decoder  Two memory block sizes : 212* *512  Power : 42mJ. (an additional 56% of energy saved)  99% of the memory access. (43.99 million out of 44.4 )

Hung-Yu Chen 11/ /6/21 Address Clustering-Problem  Find a relocation of a proper subset of the address space.  Maximize the locality of the dynamic trace.  Minimizing the energy consumption of the memory architecture  Cost Metrics  Dynamic access profile C = {c 0,c 1,….,c N-1 }  D(C,W) = max i (S i ), i = 0, 1, …, N-W (S i ) = ∑ W-1 j=0 c i+j, W : a sliding window of size  d(C,W) = D(C,W) / Tot. Tot = ∑ N i=0 C i

Hung-Yu Chen 12/ /6/21 Address Clustering-Problem (cont.)  Figure4 shows the values of d(C,W) for w = 32, 64, 128, 256, 512, about Figure2. 80%

Hung-Yu Chen 13/ /6/21 Address Clustering-Exploration  High-level pseudo-code :  Explore : find a good value of W

Hung-Yu Chen 14/ /6/21 Address Clustering-Clustering Algorithm  Cluster : returns a modified trace whose first M locations contain the M most visited addresses.

Hung-Yu Chen 15/ /6/21 Address Clustering-Encoder  Hardware Encode :  the swap of address pair -> 2M Cluster Address.  f(X) represents a function if X belongs to the set of 2M.  Clustering address X’ = R(X).  32 input, combinational network.

Hung-Yu Chen 16/ /6/21 Experimental Result  Benchmarks are taken from the Ptolemy distribution, others come from the MediaBench suite.  Platform : ARM software development kit.  Table1 :  #Addr : total number of distinct addresses.  E mono : the energy of the monolithic memory that contains all the data/instructions.  E partitioned : total memory energy of a partitioned memory architecture.  M = 256, 512, 1024 : memory partitioning combined with address clustering.

Hung-Yu Chen 17/ /6/21 Experimental Result (cont.)

Hung-Yu Chen 18/ /6/21 Experimental Result (cont.)  Original vs. Clustering (Energy)

Hung-Yu Chen 19/ /6/21 Encoder Overhead Analysis  Encoders have been synthesized with Synopsys DesignCompier on a 0.25um technology by STMicroelectronics  Power figure (Figure 8) are obtained with Synopsys PowerCompier.  The energy figures over the various applications is relatively small 1. The complexity of the decoder is basically independent of the set of addresses that are clustered. 2. The switching activity of the address lines is very similar for all benchmarks.

Hung-Yu Chen 20/ /6/21 Encoder Overhead Analysis (cont.)  16K memory which dissipates about 375 mW  frequency of 150Mhz.  Power = 7.5 mW for M = 1024.

Hung-Yu Chen 21/ /6/21 Conclusions  Energy reduction achievable by memory partitioning technology can be improved sensibly by increasing the locality of the trace.  Proposed an architectural solution, called Address Clustering.  Experimental results on a set of typical embedded applications running on an ARM- based system.  Address Clustering is able to reduce the energy consumption of a partitioned memory architecture by 25% on average (maximum 57%) with respect to the partitioning driving by the original trace.