Avoiding Initialization Misses to the Heap

Slides:



Advertisements
Similar presentations
SE-292: High Performance Computing
Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Multiprocessing Memory Management
A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison
Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.
Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CS161 – Design and Architecture of Computer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Non Contiguous Memory Allocation
Lecture: Large Caches, Virtual Memory
CS161 – Design and Architecture of Computer
Section 9: Virtual Memory (VM)
Multiscalar Processors
Today How was the midterm review? Lab4 due today.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
Appendix B. Review of Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Lu Peng, Jih-Kwon Peir, Konrad Lai
Lecture: Cache Hierarchies
CSE 153 Design of Operating Systems Winter 2018
ECE 445 – Computer Organization
Chapter 8: Main Memory.
Temporally Silent Stores (Alternatively: Louder Silent Stores)
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Andy Wang Operating Systems COP 4610 / CGS 5765
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Address-Value Delta (AVD) Prediction
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
CS399 New Beginnings Jonathan Walpole.
Andy Wang Operating Systems COP 4610 / CGS 5765
Chap. 12 Memory Organization
CSE451 Virtual Memory Paging Autumn 2002
Lecture: Cache Hierarchies
CSE 153 Design of Operating Systems Winter 2019
Lecture 9: Caching and Demand-Paged Virtual Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Sarah Diesburg Operating Systems CS 3430
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Andy Wang Operating Systems COP 4610 / CGS 5765
Chapter 8 & 9 Main Memory and Virtual Memory
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

Avoiding Initialization Misses to the Heap Jarrod Lewis, Bryan Black, and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison Intel Labs http://www.ece.wisc.edu/~pharm

Motivation Memory bandwidth is expensive Shouldn’t waste on useless traffic Can be put to better use Multithreading, prefetching, MLP, etc. Search and destroy useless traffic Focus of this talk: heap initialization Detect and optimize initialization of newly allocated memory 23% of misses in 2MB cache are invalid April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Dynamically Allocated Memory Invalid Unallocated Invalid Heap Space malloc() free() initializing store free() Allocated Valid load or store Invalid memory need not be transferred Provide interface that expresses this directly? April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Allocation Analysis Two main modes Single dominant allocation (up to 100MB) or Numerous moderate allocations Initialization of allocations 88% initialized with store miss Little temporal reuse of free’d memory Phase behavior Start of program often dominates Even SPEC has counterexamples (gcc, vortex) April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Cache Miss Behavior Init stores cause up to 60% of misses (avg 23%) These are 35% of all compulsory misses April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Detecting Initializing Writes Annotate malloc() Record base, size in allocation range cache Key questions What is working set? How are ranges represented? Valid bits? Not scalable for 100M allocation Base + bound How are ranges updated on writes? Split vs. truncate April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Allocation Working Set 4-8 entries sufficient, except parser needs 64 April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Sequential Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 1. Sequential 1. Forward Sweep A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F Forward sweep captures 90%+ except Bzip, gzip, perl April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Alternating Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 2. Alternating 2. Bidirectional Sweep A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F Bidirectional captures 90%+ of perl Doesn’t help bzip or gzip April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Striding Initialization Tracking Allocated-Invalid Initialized Pattern Scheme Unknown 3. Striding 3. Interleaving A B C D E F A C E B D F A B C D E F A C E B D F A B C D E F A C E B D F A B C D E F A C E B D F Interleaving captures 90%+ of gzip Still only 60% of bzip Bzip has a large allocation with random initialization April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Talk Outline Motivation Analysis of Heap Behavior Detecting Initializing Writes Performance Analysis Conclusions April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti PharmSim Overview PharmSim -OOO Core -Gigaplane Block Simple SimOS-PPC -AIX 4.3.1 -Disk driver -E’net driver Ethernet Device simulation, etc. from SimOS-PPC [IBM ARL] PharmSim replaces functional simulators Full OOO core model, values in rename registers Supports priv. mode, MMU, TLB, exceptions, interrupts, barriers, flushes, etc. Lead developer: Trey Cain (thanks Trey!) April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Operating System Effects Widely accepted for SPECINT: Safe to ignore O/S paths Most popular tool (Simplescalar) Intercepts system calls Emulates on host, updates “flat” memory Returns “magically” with cache contents intact We have found that [CAECW2002]: Omitting system references leads to dramatic error (5.8x L2 miss rate, 100% IPC in worst case) Specifically, AIX page fault handler eliminates many initializing write misses Had we not used PHARMsim? Dramatically overstated performance benefit April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti AIX Page Installation Heap manager calls sbrk Malloc returns block < 4KB Program writes to block First reference causes page fault Heap manager calls sbrk Malloc returns block < 4KB Program writes to block Heap manager calls sbrk Malloc returns block < 4KB Program writes to block First reference causes page fault AIX installs entire page using dcbz Heap manager calls sbrk Heap manager calls sbrk Malloc returns block < 4KB Unallocated Unallocated Allocated Valid Data segment April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Block vs. Page Installation Practically free as part of page fault Shortcomings of page installation Pollutes cache Not scalable to superpages (AIX v5.1) Does not work for heap reuse Our short simulations don’t show this benefit I.e. high overlap between initializing writes and first reference to extended data segment April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Integrating ARC April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Speedup Very aggressive core model Still can’t tolerate all store miss latency Block mode slightly better than page mode Cache pollution, less coverage April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Program Phase Behavior Only benefits initialization program phase Some programs initialize throughout execution April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Conclusions Initializing writes Cause 23% of all misses in 2MB L2 Avoid miss with block or page mode install Up to 41% performance improvement Subject to initialization:computation ratio Tracking allocation ranges Working set very small (4-8, 64) Forward/bidirectional/interleaved sweep enables range truncation April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Acknowledgments Originated as course project: Gordie Bell, Trey Cain, Kevin Lepak PHARMsim infrastructure Lead developer: Trey Cain Financial and equipment support IBM and Intel Corp National Science Foundation University of Wisconsin April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Questions? April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Backup Slides April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Invalid Memory Traffic Real data traffic that transfers invalid data Initializing Store Initial write to a storage location that contains invalid data April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Allocation Analysis Single dominant allocation vs. Numerous moderate allocations April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Initialization of Heap 88% initialized by store miss Relatively little temporal reuse of freed memory April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti PharmSim Pipeline Decode Execute Commit Mem Fetch Translate Substantially similar to IBM Power4 Some instructions “cracked” (1:2 expansion) Others (e.g. lmw) microcode stream Mem Stage Interface to 2-level cache model Sun Gigaplane XB snoopy MP coherence Caches contain values, must remain coherent No cheating! No “flat” memory model for reference/redirect April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti

Avoiding Initialization Misses to the Heap – Mikko Lipasti Machine Model Unrealistically aggressive model to devalue the impact of store misses. 8-wide, 6-stage pipeline 8K entry combining predictor 128 RUU, 64 LSQ entries, 64 write buffers 256KB 4-way associative L1D cache 64KB 2-way associative L1I 2MB 4-way associative L2 unified cache All cache blocks are 64 bytes L2 latency is 10 cycles Memory latency is 70 cycles. April 6, 2019 Avoiding Initialization Misses to the Heap – Mikko Lipasti