A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 IIES 2008 Thomas Heinz (Saarland University, CR/AEA3) | 22/03/2008 | © Robert Bosch GmbH All rights reserved, also regarding any disposal, exploitation,
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Multiple Processor Systems
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
Processes CSCI 444/544 Operating Systems Fall 2008.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Chapter 17 Parallel Processing.
12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Question and Answer Samples and Techniques. A series of step by step instructions that tell a computer what functions to complete… CPU Output Device Graphic.
Object Model Cache Locality Abstract In modern computer systems the major performance bottleneck is memory latency. Multi-layer cache hierarchies are an.
An Effective Method to Control Interrupt Handler for Data Race Detection Makoto Higashi †, Tetsuo Yamamoto ‡, Yasuhiro Hayase †, Takashi Ishio † and Katsuro.
March 12, 2001 Kperfmon-MP Multiprocessor Kernel Performance Profiling Alex Mirgorodskii Computer Sciences Department University of Wisconsin.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
CS333 Intro to Operating Systems Jonathan Walpole.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Martin Kruliš by Martin Kruliš (v1.1)1.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
Parallel processing
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
Two notions of performance
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Lecture 2: Performance Evaluation
Definition of Distributed System
Threads & multithreading
What we need to be able to count to tune programs
Advanced Operating Systems
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Henk Corporaal TUEindhoven 2011
Multithreaded Programming
CS510 - Portland State University
Levels of Parallelism within a Single Processor
Presentation transcript:

A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP

Outline Background Solution Outline Algorithm and Implementation Results Conclusion 3/13/2007 CGO 2007

cache struct S{ int a; char X[1024]; int b; } struct S{ int a; int b; char X[1024]; } Structure layout ld s.a ld s.b st s.a ld s.a ld s.b st s.a s.a s.b s.a s.b MMHMMH MHHMHH 3/13/2007 CGO 2007

Multiprocessors: False Sharing Data kept coherent across processor-local caches Cache coherence protocols – shared, exclusive, invalid, … – operate at cache line granularity False Sharing: Unnecessary coherence costs incurred because data migrates at cache line granularity Fields f1 and f2 are in cache line L. When f1 is written by P1, P1 invalidates f2 in other Ps even if f2 is not shared. 3/13/2007 CGO 2007

Structure layout cache ld s.ast s.b s.a s.b cache st s.bld s.a s.a s.b struct S{ int a; char X[1024]; int b; } struct S{ int a; int b; char X[1024]; } MMHHHH MMM’H H 3/13/2007 CGO 2007

Locality vs False Sharing Tightly packed layouts Goodlocality, more false sharing Loosely packed layouts Less false sharing, poor locality Goal : Increase locality and reduce false sharing simultaneously 3/13/2007 CGO 2007

Solution Outline struct S { int f1, f2; int f3, f4, f5; } f1 f3f5 f4f for(…){ … access f1 … access f3 … } 3/13/2007 CGO 2007

f1f4 f2f3f5 Solution Outline struct S { int f1, f2; int f3, f4, f5; } f1 f f3f5 f T1 barrier write f1 T2 barrier read f /13/2007 CGO 2007

CycleGain For all dynamic pairs of instructions (i1, i2) – If i1 accesses f1 and i2 accesses f2 (or vice versa) If MemDistance(i1,i2) < T CycleGain(f1, f2) += 1 MemDistance(i1, i2) - # distinct memory addresses touched between i1 and i2 3/13/2007 CGO 2007

CycleGain – In practice Approximations – Use static instruction pairs – Consider only intra-procedural paths – Find paths within the same loop level If i1 and i2 belong to loop L, CycleGain(f1, i1, f2, i2) = Min(Freq(i1), Freq(i2)) 3/13/2007 CGO 2007

CycleLoss Estimating cycles lost due to false sharing for a given layout is difficult … and insufficient Solution : Compute concurrent execution profile and estimate FS – Relies on performance counters in Itanium 3/13/2007 CGO 2007

Concurrency Profile Use Itanium’s performance monitoring unit (PMU) Collect PC and ITC values P1P2P3 (1,B1) (5,B3) (12,B1) (12,B2) (7,B4) (2,B3) (1,B3) (7,B2) (15,B4) B1B2B3B4 B1 B2 B3 B (16,B1) (10,B4) 3/13/2007 CGO 2007

CycleLoss For every pair of fields f1 accessed in B1 and f2 in B2 – If one of them is a write CycleLoss(f1,f2) = k*Concurrency(f1, f2) B1B2B3B4 B1 B2 B3 B /13/2007 CGO 2007

Clustering Algorithm Separate RO fields and RW fields while RWF is not empty – seed = Hottest field in RWF – current_cluster = {seed} – unassigned = RWF – {seed} – while true: f = find_best_match() If f is NULL exit loop add f to current_cluster remove f from unassigned – add current_cluster to clusters Assign each cluster to a cache line, adding pad as needed f1f2 f3 f4 f5 f6 f5f1 f2 f3 f4 f /13/2007 CGO 2007

Clustering Algorithm find_best_match() best_match = NULL best_weight = MIN for every f1 from unassigned weight = 0 For every f2 from current_cluster weight += w(f1, f2) If weight > best_weight best_weight = weight best_match = f1 return best_match f1f2 f3 f4 f5 f /13/2007 CGO 2007

Clustering Algorithm while RWF is not empty – seed = Hottest field in RWF – current_cluster = {seed} – unassigned = RWF – {seed} – while true: f = find_best_match() If f is NULL exit loop add f to current_cluster remove f from unassigned – add current_cluster to clusters Assign each cluster to a cache line, adding pad as needed f1f2 f3 f4 f5 f6 f5f1 f2 f3 f4 f f6 f1 3/13/2007 CGO 2007

Implementation Source Files build Executable caliper Process trace Hotness Conc. Profile Layout tool Layout Layout rationale Analysis PMU Trace BB to field map 3/13/2007 CGO 2007

Experimental setup Target application : HP-UX kernel – Key structures heavily hand optimized by kernel performance engineers Profile runs 16 CPU Itanium2 ® machine Measurement runs HP Superdome ® with 128 Itanium2 ® CPUs 8 CPUS per Cell 4 Cells per Crossbar 2 Crossbars per backplane Access latencies increase from cell-local to cross-bar local to inter- crossbar 3/13/2007 CGO 2007

Experimental setup SPEC Software Development Environment Throughput (SDET) benchmark – Runs multiple small processes and provides a throughput measure 1 warmup run, 10 actual runs Only a single structure’s layout modified on each run Arithmetic mean computed on throughput after removing outliers 3/13/2007 CGO 2007

Results 3/13/2007 CGO 2007

Results 3/13/2007 CGO 2007

Results 3/13/2007 CGO 2007

Results 3/13/2007 CGO 2007

Conclusion Unified approach to locality and false sharing between structure fields A new sampling technique roughly estimate false sharing Positive initial performance results on an important real- world application 3/13/2007 CGO 2007

Thanks! Questions? 3/13/2007 CGO 2007