† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Prith Banerjee ECE C03 Advanced Digital Design Spring 1998
1 Chapter 02 Authors: John Hennessy & David Patterson.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
10.2 Characteristics of Computer Memory RAM provides random access Most RAM is volatile.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
Array Allocation Taking into Account SDRAM Characteristics Hong-Kai Chang Youn-Long Lin Department of Computer Science National Tsing Hua University HsinChu,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
Memory Technology “Non-so-random” Access Technology:
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Chapter 8 Memory Management Dr. Yingwu Zhu. Outline Background Basic Concepts Memory Allocation.
Survey of Existing Memory Devices Renee Gayle M. Chua.
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
CPEN Digital System Design
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Shih-Fan, Peng 2013 IEE5008 –Autumn 2013 Memory Systems DRAM Controller for Video Application Shih-Fan, Peng Department of Electronics Engineering National.
COMP203/NWEN Memory Technologies 0 Plan for Memory Technologies Topic Static RAM (SRAM) Dynamic RAM (DRAM) Memory Hierarchy DRAM Accelerating Techniques.
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.
D ESIGN AND I MPLEMENTAION OF A DDR SDRAM C ONTROLLER FOR S YSTEM ON C HIP Magnus Själander.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Computer Architecture Chapter (5): Internal Memory
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS 704 Advanced Computer Architecture
Modeling of Digital Systems
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 15: DRAM Main Memory Systems
CMSC 611: Advanced Computer Architecture
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Lecture 23.
Presentation transcript:

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy Consumption † H.S. Kim, † V. Narayanan, † M. Kandemir, ‡ E. Brockmeyer, ‡ F. Catthoor, † M.J. Irwin Aug., 2003

2 Estimating influence of data layout optimizations on SDRAM energy Applications demand much larger memory bandwidth (eg. Video applications) There have been much work on reducing off-chip memory access frequency by improving local (intermediate) memory locality Locality in SDRAM itself make significant difference on energy, as well (a page open operation is 6 times more expensive than a data read operation) Estimation of the number of page open operation (page break) can serve as an energy estimate of various optimizations Data Layout optimization  Conventional Layout vs. Blocked Layout

3 Preliminaries (SDRAMS) Banked architecture BANK 0 MEMORY ARRAY ROW DECO- DER BANK 0 MEMORY ARRAY ROW DECO- DER BANK 0 MEMORY ARRAY ROW DECO- DER BANK 0 MEMORY ARRAY ROW DECO- DER DATA BUFFER SENSE AMS COLUMN DECODER CONTROL LOGIC MODE REGISTER CONTROL COMMANDS ADDRESS

4 Preliminaries (SDRAM operations) tRPtRCDCAS latency Precharge bank 0 Activate bank 0 Read data command DQ D0D1D2D3 Bank /Page x D0 tRRD DQD0D1D2D3 Bank 0 /Page y Lost cycles command Two consecutive operations to two different rows of one bank One operation

5 SDRAM energy consumption D words, B burst size, P miss miss rate, e act = x*e d, e stat_act = y*e d, where e act is energy per activation, e d energy per data transfer of one word, e stat_act static energy per activation (Example) Microns 8MB SDRAM, e act = 13nJ, e stat_act = 7nJ, e d = 3.6nJ, x+y ~ 6

6 Page break estimation of data layouts Page break estimation can be used to estimate energy and performance of various optimization techniques Estimation should take little time In blocked layout, different tile/block sizes/shapes result in different number of page breaks Intra page break Inter page break Tile size = Page size Block Array

7 Estimation Modeling Polyhedral Modeling of page breaks, implemented using Presburger Formulas  Valid Iteration Points  Lexicographical Ordering  Data Layouts in Memory  Mapping Memory Locations to Memory Banks  Page Break Estimation Model for Blocked Layout Implementation  Omega Calculator to simplify the models (existential operators allowed, not possible in Polylib)  Polylib to count the numbers

8 Intra/Inter page break models for blocked data layout Intra page breaks Inter page breaks

9 Experiments E_ACT = (IDD0 - IDD3)*Trc*Vdd*T cycle *#.pagebreaks E_STAT = IDD3*Vdd* T cycle *total_cycles Benchmarks  qsdpcm (quadtree-structured motion estimation)  phods (parallel hierarchical motion estimation)  an edge_detect code from UTDSP benchmark suite  Various fetch tile/block shapes (set_1, set_2, set_3) Architectural assumptions  a block of data is fetched from SDRAM into local data memory via Direct Memory Access (ie. software controlled intermediate memory)  SDRAM (MICRON’s 8MB/4 banked, 32b bus, 1KB pages)

10 Experiments SDRAM cycle simulator ATOMIUM (memory instrumentation tool) MICRON’s SDRAM Power Calculator Memory reference log (addr. size, time) #. page activations Total Activation energy C code SDRAM power (& cycle) simulator to compare the estimates with

11 Results (qsdpcm, simulation) Conventional layout shows varying energy numbers depending on the array size (800X640 vs. 176X144) Blocked layout shows no variance on the array size

12 Results (row-major vs. blocked, phods) Estimated numbers match the corresponding simulated numbers reasonably for both row-major and blocked layout

13 Results (blocked layout, estimation vs. simulation) Arrays w/ manifest indexes can be estimated without error (edge_detect) Arrays w/ dynamic elements (eg. motion vectors) can be estimated reasonably (phods, qsdpcm) Varying energy numbers depending on block/tile shapes (set_1 ~ set_3) qsdpcm phods edge_detect

14 Conclusions and Future Work Estimation framework tracks page breaks well Blocked Layout reduces the number of page breaks significantly Tile/Block shapes should be chosen carefully On-going work  Refinement of estimation formulas for conventional/blocked layout of higher order dimensional arrays  Automation Automatic incorporation of omega library and polylib Automatic code transformation into main memory efficient data layout for each array  Exploration techniques to find optimal data layout