Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Corey – An Operating System for Many Cores 謝政宏.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors : Thomas J. Ashby, Pedro D´ıaz, Marcelo.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Scheduler Activations On BSD: Sharing Thread Management Between Kernel and Application Christopher Small and Margo Seltzer Harvard University Presenter:
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
Synchronization and Communication in the T3E Multiprocessor.
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Multi-core architectures. Single-core computer Single-core CPU chip.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
Processes and Virtual Memory
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
The University of Adelaide, School of Computer Science
Multiprocessors – Locks
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Chapter 4: Multithreaded Programming
A New Coherence Method Using A Multicast Address Network
Architecture Background
The University of Adelaide, School of Computer Science
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Lecture 5: Snooping Protocol Design Issues
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 9: Caching and Demand-Paged Virtual Memory
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft

Motivations CMPs are not just small multiprocessors –Different computation/communication ratio –Different shared resources Inter-core fabric offers potential to support optimizations/acceleration –CMPs for vector, streaming workloads

Fine-grained Parallelism CMPs in role of vector processors –Software synchronization still expensive –Can target inner-loop parallelism Barriers a straightforward organizing tool –Opportunity for hardware acceleration Faster barriers allow greater parallelism –1.2x – 6.4x on 256 element vectors –3x – 12.2x on 1024 element vectors

Accelerating Barriers Barrier Filters: a new method for barrier synchronization –No dedicated networks –No new instructions –Changes only in shared memory system –CMP-friendly design point Competitive with dedicated barrier network –Achieves 77%-95% of dedicated network performance

Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary

Observation and Intuition Observations –Barriers need to stall forward progress –There exist events that already stall processors Co-opt and extend existing stall behavior –Cache misses Either I-Cache or D-Cache suffices

High Level Barrier Behavior A thread can be in one of three states 1. Executing –Perform work –Enforce memory ordering –Signal arrival at barrier 2. Blocking –Stall at barrier until all arrive 3. Resuming –Release from barrier

Barrier Filter Example CMP augmented with filter –Private L1 –Shared, banked L2 # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING

Example: Memory Ordering Before/after for memory –Each thread executes a memory fence # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING

Example: Signaling Arrival Communication with filter –Each thread invalidates a designated cache line # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING

Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation –Checks address for match –Records arrival # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING Arrived-counter : 1 Thread A : BLOCKING

Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation –Checks address for match –Records arrival # Threads: 3 Filter State Arrived-counter : 1 Thread A : BLOCKING Thread B : EXECUTING Thread C : EXECUTING Arrived-counter : 2 Thread C : BLOCKING

Example: Stalling Thread A attempts to fetch the invalidated data Fill request not satisfied –Thread stalling mechanism # Threads: 3 Filter State Arrived-counter : 2 Thread A : BLOCKING Thread B : EXECUTING Thread C : BLOCKING

Example: Release Last thread signals arrival Barrier release –Counter resets –Filter state for all threads switches # Threads: 3 Filter State Arrived-counter : 2 Thread A : BLOCKING Thread B : EXECUTING Thread C : BLOCKING Arrived-counter : 0 Thread C : RESUMING Thread A : RESUMING Thread B : RESUMING

Example: Release After release –New cache-fill requests served –Filter serves pending cache- fills # Threads: 3 Filter State Arrived-counter : 0 Thread A : RESUMING Thread B : RESUMING Thread C : RESUMING

Example: Release After release –New cache-fill requests served –Filter serves pending cache- fills # Threads: 3 Filter State Arrived-counter : 0 Thread A : RESUMING Thread B : RESUMING Thread C : RESUMING

Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary

Software Interface Communication requirements –Let hardware know # of threads –Let threads know signal addresses Barrier filters as virtualized resource –Library interface –Pure software fallback User scenario –Application calls OS to create barrier with # threads –OS allocates barrier filter, relays address and # threads –OS returns address to application

Barrier Filter Hardware Additional hardware: “address filter” –In controller for shared memory level –State table, associated FSMs –Snoops invalidations, fill requests for designated addresses Makes use of existing instructions and existing interconnect network

Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank

Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank

Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank

Why have an exit address? Needed for re-entry to barriers –When does Resuming again become Executing? –Additional fill requests may be issued Delivery is not a guarantee of receipt –Context switches –Migration –Cache eviction

Ping-Pong Optimization Draws from sense reversal barriers –Entry and exit operations as duals Two alternating arrival addresses –Each conveys exit to the other’s barrier –Eliminates explicit invalidate of exit address

Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary

Methodology Used modified version of SMT-Sim We performed experiments using 7 different barrier implementations –Software: Centralized, combining tree –Hardware: Filter barrier (4 variants), dedicated barrier network We examined performance over a set of parallelizeable kernels –Livermore loops 2, 3, 6 –EEMBC kernels autocorrelation, viterbi

Benchmark Selection Barriers are seen as heavyweight operations –Infrequently executed in most workloads Example: Ocean from SPLASH-2 –On simulated 16 core CMP: 4% of time in barriers Barriers will be used more frequently on CMPs

Latency Micro-benchmark Average time of barrier execution (in isolation) –#threads = #cores

Latency Micro-benchmark Notable effects due to bus saturation –Barrier filter scales well up until this point

Latency Micro-benchmark Filters closer to dedicated network than software –Significant speedup vs. software still exhibited

Autocorrelation Kernel On 16 core CMP –7.98x speedup for dedicated network –7.31x speedup for best filter barrier –3.86 speedup for best software barrier Significant speedup opportunities with fast barriers

Viterbi Kernel Not all applications can scale to arbitrary number of cores Viterbi performance higher on 4 or 8 cores than on 16 cores Viterbi on 4 core CMP

Livermore Loops Serial/parallel crossover –HW achieves on 4x smaller problem Livermore Loop 3 on 16-core CMP

Livermore Loops Reduction in parallelism to avoid false sharing Livermore Loop 3 on 16-core CMP

Result Summary Fine-grained parallelism on CMPs –Significant speedups possible 1.2x – 6.4x on 256 element vectors 3x – 12.2x on 1024 element vectors –False sharing affects problem size/scaling Faster barriers allow greater parallelism –HW approaches extend worthwhile problem sizes Barrier filters give competitive performance –77% - 95% of dedicated network performance

Conclusions Fast barriers –Can organize fine-grained data parallelism on a CMP CMPs can act in a vector processor role –Exploit inner-loop parallelism Barrier filters –CMP-oriented fast barrier

(FIN) Questions?

Extra Graphs