Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos

Slides:

Advertisements

Similar presentations

361 Computer Architecture Lecture 15: Cache Memory

Advertisements

T OR A AMODT Andreas Moshovos Paul Chow Electrical and Computer Engineering University of Toronto Canada The Predictability of.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Prefetch-Aware Cache Management for High Performance Caching

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Skewed Compressed Cache

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

ARM Processor Architecture

Input / Output CS 537 – Introduction to Operating Systems.

Chapter 2 - Computer Organization CPU organization –Basic Elements and Principles –Parallelism Memory –Storage Hierarchy I/O –Fast survey of devices Character.

Revisiting Load Value Speculation:

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.

On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.

Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Pipelining and Parallelism Mark Staveley

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.

University of Toronto Department of Electrical And Computer Engineering Jason Zebchuk RegionTracker: Optimizing On-Chip Cache.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

5.2 Eleven Advanced Optimizations of Cache Performance

Prefetch-Aware Cache Management for High Performance Caching

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 14: Reducing Cache Misses

Address-Value Delta (AVD) Prediction

Lecture 10: Branch Prediction and Instruction Delivery

* From AMD 1996 Publication #18522 Revision E

Chapter 4 Introduction to Computer Organization

Patrick Akl and Andreas Moshovos AENAO Research Group

Aliasing and Anti-Aliasing in Branch History Table Prediction

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos

Moshovos © 2 Gigascale Checkpoint/Restore  Several Potential Uses:  Debugging  Runtime Checking  Reliability  Gigascale Speculation Many instructions checkpoint Restore trigger Instruction Stream

Moshovos © 3 Key Issues & This Study n Track and Restore Memory State n I/O? n This Work: Memory State Compression n Goals: l Minimize On-Chip Resources l Minimize Performance Impact n Contributions: l Used Value Prediction to simplify compression hardware l Fast, Simple and Inexpensive l Benefits whether used alone or not

Moshovos © 4 Outline n Gigascale Checkpoint/Restore n Compressor Architecture: Challenges n Value-Prediction-Based Compressors n Evaluation

Moshovos © 5 Our Approach to Gigascale CR (GCR) n Checkpoint: l blocks that were written into n Current Memory State + Checkpoint = Previous Memory State Checkpoints: Can be large (Mbytes) and we may want many checkpoint begins Restore trigger Checkpoint memory block on first write Restore all checkpointed memory blocks 4 5

Moshovos © 6 Checkpoint Storage Requirements 32K 1M 32M 1K Checkpoint Interval in Instructions Max. Checkpoint Size in Bytes 1G

Moshovos © 7 Architecture of a GCR Compressor L1 Data Cache Compressor Alignment Network Main Memory in-buffer out-buffer Size Resources & Performance  Previous work: Compressor = Dictionary-Based  Relatively Slow, Complex Alignment, order 10K of Transistors  64K In-Buffer  ~3.7% Avg. Slowdown

Moshovos © 8 Our Compression Architecture n Standalone: l ~Compression, - Resources n In Combination: l -Resources (in-buffer), +Compression, +Performance L1 Data Cache Dictionary Compressor Alignment Network Main Memory in-buffer out-buffer VP Compressor Simple Alignment VP stageOptional

Moshovos © 9 Value-Predictor-Based Compression value Value Predictor value 0 1 Input streamOutput stream predicted mispredicted

Moshovos © 10 Example 0 22 VP TIME

Moshovos © 11 Block VP-Based Compressor n Shown is Last-Outcome Predictor n Studied Others (four combinations per word) word 0 value 01 Input stream Output stream mispredicted words word 1 word 15 address VP 1 value Header (one word) single entry predictors Cache block Half-word alignment

Moshovos © 12 Evaluation n Compression Rates l Compared with LZW n Performance l As a function of in-buffer size

Moshovos © 13 Methodology n Simplescalar v3 n SPEC CPU 2000 with reference inputs n Ignore first checkpoint to avoid artificially skewing the results n Simulated up to: l 80Billion instructions (compression rates) l 5Billion instructions (performance) n 8-way OOO Superscalar n 64K L1D, L1I, 1M UL2

Moshovos © 14 Compression Rate vs. LZW better 256M Instructions Checkpoint Interval

Moshovos © 15 Performance Degradation n LZW + 64K buffer = ~3.7% slowdown n LZW + LO + 1K buffer = 1.6% slowdown better

Moshovos © 16 Summary n Memory State Compression for Gigascale CR n Many Potential Applications n Used Simple Value-Prediction Compressors l Few Resources l Low Complexity l Fast Performance n Can be Used Alone n Can be Combined with Dictionary-based Compressors l Reduced on-chip buffering l Better Performance n Main memory compression?