On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration.

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration with Sun Microsystems and Ericsson Research Per Stenström Department of Computer Engineering, Chalmers, Göteborg, Sweden http://www.ce.chalmers.se/~pers

Motivation Database applications dominate (32%) Yet, major focus is on scientific/eng. apps (16%)

Project Objective Design principles for high-performance memory systems for emerging applications Systems considered: –high-performance compute nodes –SMP and DSM systems built out of them Applications considered: –Decision support and on-line transaction processing –Emerging applications Computer graphics video/sound coding/decoding handwriting recognition...

Outline Experimental platform Memory system issues studied –Working set size in DSS workloads –Prefetch approaches for pointer-intensive workloads (such as in OLTP) –Coherence issues in OLTP workloads Concluding remarks

Experimental Platform $ M Single and multiprocessor system models Platform enables Analysis of comm. workloads Analysis of OS effects Tracking architectural events to OS or appl. level Operating system (Linux) Application CPU Sparc V8 Memory Interrupt TTY SCSI Ethernet Devices

Decision-Support Systems (DSS) Compile a list of matching entries in several database relations Will moderately sized caches suffice for huge databases? Join Scan Level i i-1 2 1

Our Findings MWS: footprint of instructions and private data to access a single tuple –typically small (< 1 Mbyte) and not affected by database size DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node –typically small impact (~0.1%) on overall miss rate Cache Miss Rate Cache size MWS DWS 1 DWS 2 DWS i

Methodological Approach Challenges: Not feasible to simulate huge databases Need source code: we used PostgreSQL and MySQL Approach: Analytical model using –parameters that describe the query –parameters measured on downscaled query executions –system parameters

Footprints and Reuse Characteristics in DSS MWS: instructions, private, and metadata –can be measured on downscaled simulation DWS: all tuples accessed at lower levels –can be computed based on query composition and prob. of match Join Scan Level i i-1 2 1 Footprints per tuple access MWS and DWSi MWS and DWSi-1 MWS and DWS 2 MWS and DWS 1

Analytical model-an overview Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems Number of cold misses: size of footprint/block size – |MWS| is measured – |DWS i | computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc) Number of capacity misses for tuple access at level i: CM 0 (1- C - C 0 ) if C 0 < Cache size < |MWS| |MWS| - C 0 size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWS i | Number of accesses per tuple: measured Total number of misses and accesses: computed

Model Validation Goal: Prediction accuracy for queries with different compositions –Q3, Q6, and Q10 from TPC-D Prediction accuracy when scaling up the database –parameters at 5Mbyte used to predict at 200 Mbytes databases Robustness across database engines –Two engines: PostgreSQL and MySQL Miss ratio Cache size (Kbyte) Q3 on PostgreSQL - 3 levels - 1 seq. scan - 2 index scan - 2 nest. loop joins

Model Predictions: Miss rates for Huge Databases Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes) Miss rate component for database data small What’s in the tail?

Cache Issues for Linked Data Structures Pointer-chasing show up in many interesting applications: –35% of the misses in OLTP (TPC-B) –32% of the misses in an expert system –21% of the misses in Raytrace Traversal of lists may exhibit poor temporal locality Results in chains of data dependent loads, called pointer-chasing

SW Prefetch Techniques to Attack Pointer-Chasing Greedy Prefetching (G). - computation per node < latency Prefetch Arrays (P.(S/H)) Generalization of G and J that addresses above shortcomings. - Trade memory space and bandwidth for more latency tolerance Jump Pointer Prefetching (J) - short list or traversal not known a priori

Results: Hash Tables and Lists in Olden B G J P.S P.H MST HEALTH Prefetch Arrays do better because: MST has short lists and little computation per node They prefetch data for the first nodes in HEALTH unlike Jump prefetching

Results: Tree Traversals in OLTP and Olden Hardware-based prefetch Arrays do better because: Traversal path not known in DB.tree (depth first search) Data for the first nodes prefetched in Tree.add DB.treeTree.add B G J P.S P.H

Other Results in Brief Impact of longer memory latencies: –Robust for lists –For trees, prefetch arrays may cause severe cache pollution Impact of memory bandwidth –Performance improvements sustained for bandwidths of typical high-end servers (2.4 Gbytes/s) – Prefetch arrays may suffer for trees. Severe contention on low- bandwidth systems (640 Mbytes/s) were observed Node insertion and deletion for jump pointers and prefetch arrays –Results in instruction overhead (-). However, –insertion/deletion is sped up by prefetching (+)

Coherence Issues in OLTP Favorite protocol: write-invalidate Ownership overhead: invalidations cause write stall and inval. traffic DSM node P $ P $ P $ MMM P $ P $ P $ MMM P $ P $ P $ MMM P $ P $ P $ MMM SMP node

Ownership Overhead in OLTP Simulation setup: CC-NUMA with 4 nodes MySQL, TPC-B, 600 MB database 40% of all ownership transactions stem from load/store sequences

Techniques to Attack Ownership Overhead Dynamic detection of migratory sharing –detects two load/store sequences by different processors –only a sub-set of all load/store sequences (~40% in OLTP ) Static detection of load/store sequences –compiler algorithms that tags a load followed by a store and brings exclusive block in cache –poses problems in TPC-B

New Protocol Extension Criterion: load miss from processor i followed by global store from i, tag block as Load/Store

Concluding Remarks Focus on DSS and OLTP has revealed challenges not exposed by traditional appl. –Pointer-chasing –Load/store optimizations Application scaling not fully understood –Our work on combining simulation with analytical modeling shows some promise

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration.

Similar presentations

Presentation on theme: "On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration.

Similar presentations

Presentation on theme: "On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration."— Presentation transcript:

Similar presentations

About project

Feedback