Temporal Memory Streaming

Slides:



Advertisements
Similar presentations
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
© 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Sampling Dead Block Prediction for Last-Level Caches
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CMSC 611: Advanced Computer Architecture
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Hyperthreading Technology
Temporal Streaming of Shared Memory
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Address-Value Delta (AVD) Prediction
ECE Dept., University of Toronto
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Performance of computer systems
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Today’s agenda Hardware architecture and runtime system
Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
CS 3410, Spring 2014 Computer Science Cornell University
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter 4 Multiprocessors
Cache - Optimization.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos STEMS Computer Architecture Lab Carnegie Mellon http://www.ece.cmu.edu/CALCM

The Memory Wall Logic/DRAM speed gap continues to increase! VAX/1980 Database performance on modern processors hurts because of the processor/memory speed gap. The problems is due to (a) memories getting bigger (b) storage managers are smart and hide disk I/O And (c) DBMS have too many knobs, benchmarks too complex to be used in early processor design stages Database instruction streams contain many data accesses (=accesses to memory hierarchy) And many dependencies (=>instruction-level parallelism is hard) So on a 1980 machine (VAX 11/750) the access time to memory was about as long as the time to execute one instruction, whereas the access time to memory in 2000 (Pentium II) was equivalent to the time needed to execute 1000’s of instructions. The trend is going even faster; On the Pentium 4 the memory latency is 300 cycles, which is one more order of magnitude in only 2 years time. We project that soon the caches will play the role of memory (on-chip memory), the main memory will become the disk and the disk will be tertiary storage. This means that unless the cache usage (cache=fast on-chip memory) is optimal, memory latency will dominate, and no matter how fast the processor is, database applications will be slow. Therefore the cache-aware database research. VAX/1980 PPro/1996 2010+ Logic/DRAM speed gap continues to increase!

Current Approach Cache hierarchies: Trade off capacity for speed Exploit “reuse” But, in modern servers Only 50% utilization of one proc. [Ailamaki, VLDB’99] Much bigger problem in MPs What is wrong? Demand fetch/repl. data CPU 1000 clk 100 clk 10 clk 1 clk L1 64K L2 2M The electronic world is encompassing much of our daily activities electronic commerce and search engines are two examples of daily activities. People use client devices such as -- handheld, desktop computers -- to interface exchange information through the internet. The backbone of the e-world consists of compute servers that process and store information and make them available to e-world’s users. My talk will primarily focus designing servers. While there are several issues associated with server design, in this talk I will focus on two key aspects: performance and programmability L3 64M DRAM 100G

Prior Work (SW-Transparent) Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04] Simple patterns or low accuracy Large Exec. Windows / Runahead [Mutlu 03] Fetch dependent addresses serially Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04] Limited applicability (e.g., migratory) Need solutions for arbitrary access patterns

Our Solution: Spatio-Temporal Memory Streaming CPU STEMS Observation: Data spatially/temporally correlated Arbitrary, yet repetitive, patterns Approach  Memory Streaming Extract spat./temp. patterns Stream data to/from CPU Manage resources for multiple blocks Break dependence chains In HW, SW or both L1 A C D B .. X The electronic world is encompassing much of our daily activities electronic commerce and search engines are two examples of daily activities. People use client devices such as -- handheld, desktop computers -- to interface exchange information through the internet. The backbone of the e-world consists of compute servers that process and store information and make them available to e-world’s users. My talk will primarily focus designing servers. While there are several issues associated with server design, in this talk I will focus on two key aspects: performance and programmability replace stream W fetch Z DRAM or Other CPUs

Contribution #1: Temporal Shared-Memory Streaming Recent coherence miss sequences recur 50% misses closely follow previous sequence Large opportunity to exploit MLP Temporal streaming engine Ordered streams allow practical HW Performance improvement: 7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Contribution #2: Last-touch Correlated Data Streaming Last-touch prefetchers Cache block deadtime >> livetime Fetch on a predicted “last touch” But, designs impractical (> 200MB on-chip) Last-touch correlated data streaming Miss order ~ last-touch order Stream table entries from off-chip Eliminates 75% of all L1 misses with ~200KB

Outline STEMS Overview Example Temporal Streaming Summary Temporal Shared-Memory Streaming Last-Touch Correlated Data Streaming Summary

Temporal Shared-Memory Streaming [ISCA’05] Record sequences of memory accesses Transfer data sequences ahead of requests Baseline System Streaming System Miss A CPU Mem CPU Mem Fill A Miss A Miss B Fill A,B,C,… Fill B Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing

Relationship Between Misses Intuition: Miss sequences repeat Because code sequences repeat Observed for uniprocessors in [Chilimbi’02] Temporal Address Correlation Same miss addresses repeat in the same order Correlated miss sequence = stream … Miss seq. Q W A B C D E R T A B C D E Y

Relationship Between Streams Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes Temporal Stream Locality Recent streams likely to recur Node 1 Q W A B C D E R Node 2 T A B C D E Y Addr. correlation + stream locality = temporal correlation

Memory Level Parallelism Streams create MLP for dependent misses Not possible with larger windows / runahead Temporal streaming breaks dependence chains Baseline Temporal Streaming A A CPU CPU B B C C Must wait to follow pointers Fetch in parallel

Temporal Streaming   Record Node 1 Directory Node 2 Miss A Miss B Miss C Miss D  Record

Temporal Streaming   Record Node 1 Directory Node 2 Miss A Miss B Miss C Miss D Req. A Miss A Fill A  Record

Temporal Streaming    Record  Locate Node 1 Directory Node 2 Miss A Miss B Miss C Miss D  Req. A Miss A Locate A Fill A Stream B, C, D  Record  Locate

Temporal Streaming    Record   Locate  Stream Node 1 Directory Miss A Miss B Miss C Miss D  Req. A Miss A Locate A Fill A Stream B, C, D  Record  Locate  Stream  Fetch B, C, D Hit B Hit C

Temporal Streaming Engine  Record Coherence Miss Order Buffer (CMOB) ~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses CPU $ Fill E Local Memory CMOB Q W A B C D E R T Y

Temporal Streaming Engine  Locate Annotate directory Already has coherence info for every block CMOB append  send pointer to directory Coherence miss  forward stream request Directory A shared Node 4 @ CMOB[23] B modified Node 11 @ CMOB[401]

Temporal Streaming Engine Fetch data to match use rate Addresses in FIFO stream queue Fetch into streamed value buffer CPU L1 $ Node i: stream {A,B,C…} Streamed Value Buffer A data Stream Queue F E D C B Fetch A ~32 entries

Practical HW Mechanisms Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup

Methodology: Infrastructure SimFlex [SIGMETRICS’04] Statistically sampling → uArch sim. in minutes Full-system MP simulation (boots Linux & Solaris) Uni, CMP, DSM timing models Real server software (e.g., DB2 & Oracle) Component-based  FPGA board interface for hybrid simulation/prototyping Publicly available at http://www.ece.cmu.edu/~simflex

Methodology: Benchmarks & Parameters Benchmark Applications Scientific em3d, moldyn, ocean OLTP: TPC-C 3.0 100 WH IBM DB2 7.2 Oracle 10g SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3 Model Parameters 16 4GHz SPARC CPUs 8-wide OoO; 8-stage pipe 256-entry ROB/LSQ 64K L1, 8MB L2 TSO w/ speculation

TSE Coverage Comparison TSE outperforms Stride and GHB for coherence misses

Stream Lengths Comm: Short streams; low base MLP (1.2-1.3) Sci: Long streams; high base MLP (1.6-6.6) Temporal Streaming addresses both cases

TSE Performance Impact Time Breakdown Speedup 95% CI em3d moldyn ocean Apache DB2 Oracle Zeus TSE eliminates 25%-95% of coherent read stalls 6% to 230% performance improvement

TSE Conclusions Temporal Streaming Temporal Streaming Engine Intuition: Recent coherence miss sequences recur Impact: Eliminates 50-100% of coherence misses Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement 7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Outline Big Picture Example Streaming Techniques Summary Temporal Shared Memory Streaming Last-Touch Correlated Data Streaming Summary

Enhancing Lookahead Observation [Mendelson, Wood&Hill]: Dead sets L1 @ Time T1 Observation [Mendelson, Wood&Hill]: Few live sets Use until last “hit” Data reuse  high hit rate ~80% dead frames! Exploit for lookahead: Predict last “touch” prior to “death” Evict, predict and fetch next line Dead sets L1 @ Time T2 Live sets

How Much Lookahead? L2 latency DRAM latency Frame Deadtimes (cycles) Predicting last-touches will eliminate all latency!

Dead-Block Prediction [ISCA’00 & ’01] Accesses to a block frame Per-block trace of memory accesses to a block Predicts repetitive last-touch events PC0: load/store A0 (hit) PC1: load/store A1 (miss) First touch Accesses to a block frame PC3: load/store A1 (hit) PC3: load/store A1 (hit) Last touch PC5: load/store A3 (miss) Trace = A1  (PC1,PC3, PC3)

Dead-Block Prefetcher (DBCP) History & correlation tables History ~ L1 tag array Correlation ~ memory footprint Encoding  truncated addition Two bit saturating counter History Table (HT) Correlation Table Latest PC1,PC3  A1,PC1,PC3,PC3 A3 11 A1 PC3 Evict A1 Fetch A3 Current Access

DBCP Coverage with Unlimited Table Storage Say something more intelligent High average L1 miss coverage Low misprediction (2-bit counters)

Impractical On-Chip Storage Size Needs over 150MB to achieve full potential!

Our Observation: Signatures are Temporally Correlated Signatures need not reside on chip Last-touch sequences recur Much as cache miss sequences recur [Chilimbi’02] Often due to large structure traversals Last-touch order ~ cache miss order Off by at most L1 cache capacity Key implications: Can record last touches in miss order Store & stream signatures from off-chip

Last-Touch Correlated Data Streaming (LT-CORDS) Streaming signatures on chip Keep all sigs. in sequences in off-chip DRAM Retain sequence “heads” on chip “Head” signals a stream fetch Small (~200KB) on-chip stream cache Tolerate order mismatch Lookahead for stream startup DBCP coverage with moderate on-chip storage!

DBCP Mechanisms Core L1 L2 DRAM HT Sigs. (160MB) All signatures in random-access on-chip table

Signatures stored off-chip What LT-CORDS Does Core L1 L2 DRAM HT … and only in order “Head” as cue for the “stream” Only a subset needed at a time Signatures stored off-chip

LT-CORDS Mechanisms Core L1 L2 DRAM HT SC Heads (10K) Sigs. (200K) On-chip storage independent of footprint

Methodology SimpleScalar CPU model with Alpha ISA SPEC CPU2000 & Olden benchmarks 8-wide out-of-order processor 2 cycle L1, 16 cycle L2, 180 cycle DRAM FU latencies similar to Alpha EV8 64KB 2-way L1D, 1MB 8-way L2 LT-CORDS with 214KB on-chip storage Apps. with significant memory stalls

LT-CORDS vs. DBCP Coverage Olden, Int, FP LT-CORDS reaches infinite DBCP coverage

LT-CORDS hides large fraction of memory latency LT-CORDS Speedup LT-CORDS hides large fraction of memory latency

LT-CORDS Conclusions Intuition: Signatures temporally correlated Cache miss & last-touch sequences recur Miss order ~ last-touch order Impact: eliminates 75% of all misses Retains DBCP coverage, lookahead, accuracy On-chip storage indep. of footprint 2x less memory stalls over best prior work

For more information Visit our website: http://www.ece.cmu.edu/CALCM