Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Computing Systems Memory Hierarchy.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Main Memory CS448.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Sunpyo Hong, Hyesoon Kim
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
PipeliningPipelining Computer Architecture (Fall 2006)
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Reducing Hit Time Small and simple caches Way prediction Trace caches
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ISPASS th April Santa Rosa, California
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
ECE 445 – Computer Organization
Lecture 15: DRAM Main Memory Systems
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Adapted from slides by Sally McKee Cornell University
Lecture 22: Cache Hierarchies, Memory
DRAM Hwansoo Han.
Presentation transcript:

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System Performance? Richard Wells ECE 7810 April 21, 2009

The University of Utah Reservations The paper is old  Presented at ISCA 2001  Only considers uniprocessor systems They draw some conclusions that while valid are focused on their research goals Papers relating to our groups project are not prevalent in recent years, except one already presented at the architecture reading club.

The University of Utah Overview Investigate DRAM system organization parameters to determine bottleneck Determine synergy or antagonism between groups of parameters Empirically determine the optimal DRAM system configuration

The University of Utah Methodologies to increase system performance Concurrent transactions Reducing latency Reduce system overhead

The University of Utah Previous approaches to reduce memory system overhead DRAM Component  Increase bandwidth Current “tack” taken by the PC industry  Reduce DRAM latency ESDRAM  SRAM cache for the full row buffer  Allows precharge to begin immediately after access FCRAM  Subdivide internal bank by activating only a portion of each wordline

The University of Utah Previous approaches to reduce memory system overhead (cont.)  Reduce capacitance on word access to 30 ns (2001) MoSys  Subdivides storage into a large number of very small banks  Reduces latency of DRAM core to nearly that of SRAM VCDRAM  Set-associative SRAM buffer that holds a number of sub-pages

The University of Utah The Jump DRAM oriented approaches do reduce application execution time Because zero latency DRAM doesn’t reduce the overhead of memory system to zero, bus transactions are considered Other factors considered  Turnaround time  Queuing delays  Inefficiencies due to asymmetric read/write requests  Multiprocessor - Arbitration and Cache coherence would add to overhead

The University of Utah CPU – DRAM Channel Access reordering (cited Impulse group here at the U)  Compacts sparse data into densely-packed bus transactions Reduces the number of bus transactions Possibly reduces duration of bus transaction

The University of Utah Increasing concurrency Different banks on the same channel Independent channels to different banks Pipelined requests Split-transaction bus

The University of Utah Decreasing channel latency Due to channel contention  Back to back read requests  Read arriving during precharge  Narrow channels  Large data burst size

The University of Utah Addressing System Overhead Bus turnaround time Dead cycles due to asymmetric read/write shapes Queuing overhead Coalescing queued requests Dynamic re-prioritization of requests

The University of Utah Timing Assumptions 10 ns address 70 ns until burst starts on a read 40 ns until a write can start

The University of Utah Split Transaction Bus Assumptions Overlapping Supported  Back-to-back reads  Back-to-back read/write pairs

The University of Utah Burst Ordering, Coalescing Critical-burst first, non-critical burst second, writes last Coalesce writes followed by reads

The University of Utah Bit Addressing & Page Policy Bit assignments chosen to exploit page mode and maximize degree of memory concurrency  Most significant bits identify the smallest-scale component in the system  Least significant bits identify the largest-scale component in the system  Allows sequential addresses to be stripped across channels maximizing concurrency Close-page auto-precharge policy

The University of Utah Simulation Environment SimpleScalar (used in 6810)  2 GHz clock  L1 caches 64Kb/64Kb, 2-way set associative  L2 cache unified 1Mb, 4-way set associative, 10 cycle access time  Lock-up free cache using miss status holding register (MSHR)

The University of Utah Timing Calculations CPU + DRAM determined by running a second simulation with perfect primary memory (available on next cycle)

The University of Utah Results – Degrees of Freedom Bus Speed: 800 MHz Bus width:1, 2, 4, 8 bytes Channels:1, 2, 4 Banks/Channel:1, 2, 4, 8 Queue Size:infinite, 0, 1, 2, 8, 16, 32 Turnaround:0, 1 cycles R/W shapes:symmetric, asymmetric

The University of Utah Results – Execution Times Assumes infinite request queue System parameters can lead to widely varying CPI

The University of Utah Results – Turnaround and Banks Turnaround only accounts for 5% of system related overhead Banks/Channel accounts for 1.2x – 2x variation – shows concurrency is important Latency accounts for over about 50% of CPI

The University of Utah Results – Burst Length vs. BW Accounts for 10-30% of execution time Wider channels have optimal performance with larger bursts Narrow channels have optimal performance with smaller bursts

The University of Utah Results - Concurrency

The University of Utah Results – Concurrency (Cont.) Increasing the number of banks typically increases performance, but not always much Many narrow channels is risky because application might not have much inherent concurrency Optimal 1 channel x 4 bytes x 64 byte burst, 2 channel x 2 bytes x 64 byte burst, 1 channel x 4 bytes x 128 byte burst Performance varies depending on the concurrency of the benchmark

The University of Utah Results – Concurrency (Cont.) “We find that, in a uniprocessor setting, concurrency is very important, but it is not more important than latency.... However, we find that if, in an attempt to increase support for concurrent transactions, one interleaves very small bursts or fragments the DRAM bus into multiple channels, one does so at the expense of latency, and this expense is too great for the levels of concurrency being produced.”

The University of Utah Results – Request Queue Size

The University of Utah Results – Request Queue Size How queuing benefits system performance  Sub-blocks of different read requests can be interleaved  Writes can be buffered until read-burst traffic has died down  Read and write requests may be coalesced Applications with significant write activity see more benefit from queuing  Bzip has many more writes than GCC Anomalies attributed to requests with temporal locality go to the same bank. With a small queue they are delayed.

The University of Utah Conclusions Tuning system level parameters can improve the memory system performance by 40%  Bus turnaround – 5-10%  Banks – 1.2x – 2x  Burst length vs. bandwidth – 10%-30%  Concurrency Smaller bursts to allow for interleaving is not a good idea because it limits concurrency

The University of Utah Our Project To evaluate the effect of mat array size on power and latency of the DRAM chips. Simulators  Cacti  DRAMSim  Simics Predicted Results  Positive Decreased memory latency Decreased power profile DIMM parallelism increase  Negative Decreased row buffer hit rates Decreased memory capacity (for same chip area) Increase the important cost/bit metric

The University of Utah How project relates to the paper Trying to decrease the memory system bottlenecks  Although we have evaluated bottlenecks differently Jacob indirectly showed the importance of minimizing DRAM latency  DRAM latency was largest portion of CPI so Amdahl’s law would justify reducing latency  Both our solutions could work together synergistically

The University of Utah Additional thoughts The current path of DRAM innovation has limitations DRAM chips and DIMMs need to undergo fundamental changes, of which this could be a step Helps power efficiency Can balance with cost effectiveness Partially addresses the memory gap

The University of Utah Questions Questions?