Thoughts on Shared Caches Jeff Odom University of Maryland.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
The University of Adelaide, School of Computer Science
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Structure of Computer Systems
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Communication Pattern Based Node Selection for Shared Networks
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Chapter 17 Parallel Processing.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Martin Kruliš by Martin Kruliš (v1.1)1.
By Islam Atta Supervised by Dr. Ihab Talkhan
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
The University of Adelaide, School of Computer Science
Processor Level Parallelism 1
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
These slides are based on the book:
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Multiprocessing.
Distributed Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Task Scheduling for Multicore CPUs and NUMA Systems
12.4 Memory Organization in Multiprocessor Systems
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Simulation of computer system
Improving Multiple-CMP Systems with Token Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Thoughts on Shared Caches Jeff Odom University of Maryland

University of Maryland 2 A Brief History of Time First there was the single CPU –Memory tuning new field –Large improvements possible –Life is good Then came multiple CPUs –Rethink memory interactions –Life is good (again) Now there’s multi-core on multi-CPU –Rethink memory interactions (again) –Life will be good (we hope)

University of Maryland 3 SMP vs. CMP Symmetric Multiprocessing (SMP) –Single CPU core per chip –All caches private to each CPU –Communication via main memory Chip Multiprocessing (CMP) –Multiple CPU cores on one integrated circuit –Private L1 cache –Shared second-level and higher caches

University of Maryland 4 CMP Features Thread-level parallelism –One thread per core –Same as SMP Shared higher-level caches –Reduced latency –Improved memory bandwidth Non-homogeneous data decomposition –Not all cores are created equal

University of Maryland 5 CMP Challenges New optimizations –False sharing/private data copies –Delaying reads until shared Fewer locations to cache data –More chance of data eviction in high- throughput computations Hybrid SMP/CMP systems –Connect multiple multi-core nodes –Composite cache sharing scheme –Cray XT4 2 cores/chip 2 chips/node

University of Maryland 6 False Sharing Occurs when two CPUs access different data structures on the same cache line

University of Maryland 7 False Sharing (SMP)

University of Maryland 8 False Sharing (SMP)

University of Maryland 9 False Sharing (SMP)

University of Maryland 10 False Sharing (SMP)

University of Maryland 11 False Sharing (SMP)

University of Maryland 12 False Sharing (SMP)

University of Maryland 13 False Sharing (SMP)

University of Maryland 14 False Sharing (SMP)

University of Maryland 15 False Sharing (CMP)

University of Maryland 16 False Sharing (CMP)

University of Maryland 17 False Sharing (CMP)

University of Maryland 18 False Sharing (CMP)

University of Maryland 19 False Sharing (CMP)

University of Maryland 20 False Sharing (CMP)

University of Maryland 21 False Sharing (CMP)

University of Maryland 22 False Sharing (CMP)

University of Maryland 23 False Sharing (SMP vs. CMP) With private L2 (SMP), modification of co-resident data structures results in trips to main memory In CMP, false sharing impact is limited by the shared L2 Latency from L1 to L2 much less than L2 to main memory

University of Maryland 24 Maintaining Private Copies Two threads modifying the same cache line will want to move data to their L1 Simultaneous reading/modification causes thrashing between L1’s and L2 Keeping a copy of data in separate cache line keeps data local to the processor Updates to shared data occur less often

University of Maryland 25 Delaying Reads Until Shared Often the results from one thread are pipelined to another Typical signal-based sharing: –Thread 1 accesses data, is pulled into L1 T1 –T1 modifies data –T1 signals T2 that data is ready –T2 requests data, forcing eviction from L1 T1 into L2 Shared –Data is now shared L1 line not filled in, wasting space

University of Maryland 26 Delaying Reads Until Shared Optimized sharing: –T1 pulls data into L1 T1 as before –T1 modifies data –T1 waits until it has other data to fill the line with, then uses that to push data into L2 Shared –T1 signals T2 that data is ready –T1 and T2 now share data in L2 Shared Eviction is side-effect of loading line

University of Maryland 27 Hybrid Models Most CMP systems will have SMP as well –Large core density not feasible –Want to balance processing with cache sizes Different access patterns –Co-resident cores act different than cores of different nodes –Results may differ depending on which processor pairs you get

University of Maryland 28 Experimental Framework Simics simulator –Full system simulation –Hot-swappable components –Configurable memory system Reconfigurable cache hierarchy Roll-your-own coherency protocol Simulated environment –SunFire 6800, Solaris 10 –Single CPU board, 4 UltraSPARC IIi –Uniform main memory access –Similar to actual hardware on hand

University of Maryland 29 Experimental Workload NAS Parallel Benchmarks –Well known, standard applications –Various data access patterns (conjugate gradient, multi-grid, etc.) OpenMP-optimized –Already converted from original serial versions –MPI-based versions also available Small (W) workloads –Simulation framework slows down execution –Will examine larger (A-C) versions to verify tool correctness

University of Maryland 30 Workload Results Some show marked improvement (CG)… …others show marginal improvement (FT)… …still others show asymmetrical loads (BT)… …and asymmetrical improvement (EP)

University of Maryland 31 The Next Step How to get data and tools for programmers to deal with this? –Hardware –Languages –Analysis tools Specialized hardware counters –Which CPU forced eviction –Are cores or nodes contending for data –Coherency protocol diagnostics

University of Maryland 32 The Next Step CMP-aware parallel languages –Language-based framework easier to perform automatic optimizations –OpenMP, UPC likely candidates –Specialized partitioning may be needed to leverage shared caches Implicit data partitioning Current languages distribute data uniformly –May require extensions (hints) in the form of language directives

University of Maryland 33 The Next Step Post-execution analysis tools –Identify memory hotspots –Provide hints on restructuring Blocking Execution interleaving –Convert SMP-optimized code for use in CMP –Dynamic instrumentation opportunities

University of Maryland 34 Questions?