Reactive Spin-locks: A Self-tuning Approach Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas I-SPAN ’05, Las Vegas, Dec. 7 th – 9 th, 2005.

Slides:



Advertisements
Similar presentations
CS 603 Process Synchronization: The Colored Ticket Algorithm February 13, 2002.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
G. Alonso, D. Kossmann Systems Group
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Highly Concurrent and Fault-Tolerant h-out of-k Mutual Exclusion Using Cohorts Coteries for Distributed Systems.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
Multiple Sender Distributed Video Streaming Thinh Nguyen, Avideh Zakhor appears on “IEEE Transactions On Multimedia, vol. 6, no. 2, April, 2004”
CS603 Process Synchronization February 11, Synchronization: Basics Problem: Shared Resources –Generally data –But could be others Approaches: –Model.
Self-tuning Reactive Distributed Trees for Counting and Balancing Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas OPODIS ’04, Grenoble, France.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
BCOR 1020 Business Statistics
MAC Layer Protocols for Sensor Networks Leonardo Leiria Fernandes.
MATE: MPLS Adaptive Traffic Engineering Anwar Elwalid, et. al. IEEE INFOCOM 2001.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
1 Real-Time Traffic over the IEEE Medium Access Control Layer Tian He J. Sobrinho and A. krishnakumar.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Company LOGO Provision of Multimedia Services in based Networks Colin Roby CMSC 681 Fall 2007.
Adaptive Transaction Scheduling for Transactional Memory Systems Richard M. Yoo Hsien-Hsin S. Lee Georgia Tech.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
BMAC - Versatile Low Power Media Access for Wireless Sensor Networks.
A Prediction-based Fair Replication Algorithm in Structured P2P Systems Xianshu Zhu, Dafang Zhang, Wenjia Li, Kun Huang Presented by: Xianshu Zhu College.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
Opportunistic Traffic Scheduling Over Multiple Network Path Coskun Cetinkaya and Edward Knightly.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Jennifer Campbell November 30,  Problem Statement and Motivation  Analysis of previous work  Simple - competitive strategy  Near optimal deterministic.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-
Adaptive Sleep Scheduling for Energy-efficient Movement-predicted Wireless Communication David K. Y. Yau Purdue University Department of Computer Science.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Distributed Mutual Exclusion Synchronization in Distributed Systems Synchronization in distributed systems are often more difficult compared to synchronization.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 32 – Multimedia OS Klara Nahrstedt Spring 2010.
QianZhu, Liang Chen and Gagan Agrawal
Timothy Zhu and Huapeng Zhou
Chapter 2 Scheduling.
Chapter 8 – Processor Scheduling
Martin Rinard Laboratory for Computer Science
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Lecture 2: Snooping-Based Coherence
CPU Scheduling G.Anuradha
CS510 Concurrent Systems Jonathan Walpole.
Designing Parallel Algorithms (Synchronization)
Provision of Multimedia Services in based Networks
Chapter 5: CPU Scheduling
Yiannis Nikolakopoulos
Constraint Programming and Backtracking Search Algorithms
Chapter 6: CPU Scheduling
Concurrency: Mutual Exclusion and Process Synchronization
CS533 Concepts of Operating Systems
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Lecture 1: Introduction
Department of Computer Science University of California, Santa Barbara
Lecture 18: Coherence and Synchronization
Distributed Mutual eXclusion
Presentation transcript:

Reactive Spin-locks: A Self-tuning Approach Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas I-SPAN ’05, Las Vegas, Dec. 7 th – 9 th, 2005

I-SPAN '052 Outline Mutual exclusion –Overhead –Available reactive spin-locks New reactive spin-lock –Model –Algorithm –Evaluation Conclusions

I-SPAN '053 Mutual exclusion Performance goals: –Low latency –Low contention –…–… Entry sectionCritical sectionExit sectionNoncritical sec. Lock released Requests issued Arbitration Lock sent to winner

I-SPAN '054 Spin-lock categories Arbitrating locks: –Determine who is the next lock-holder in advance, e.g. ticket-locks, queue-locks. –Advantages: Prevent processors from causing bursts in network traffic and high contention on the lock. Non-arbitrating locks: –E.g. Test-and-set locks –Advantages: Exploit locality/cache Tolerate failures in the Entry section.

I-SPAN '055 Arbitrating vs. non-arbitrating locks Interconnection Network Interconnection Network Interconnection Network Interconnection Network

I-SPAN '056 Available reactive spin-lock algorithms Drawbacks: –Their reactive schemes rely on Fixed experimental thresholds –The thresholds frequently become inappropriate in variable and unpredictable environments like multiprogramming systems –E.g. ticket locks with proportional backoff, test-and-test-and- set locks with exponential backoff Known probability distributions of some inputs –The assumption is not usually feasible.

I-SPAN '057 New reactive spin-lock algorithm Ideas –A non-arbitrating lock with adaptive sensible backoff delay. Advantages –Its reactive scheme is self-tuning Neither experimentally tuned thresholds nor probability distributions of inputs are needed –It combines advantages of both arbitrating and non- arbitrating spin-lock categories. It can exploit locality as well as reduce contention on the lock.

I-SPAN '058 Find sensible backoff delay Need to optimize trade-off between: –Latency The interval between a pair of lock-release and lock-acquisition –Contention on the lock This is an online problem. Load on the lock  delay=?

I-SPAN '059 Reactive scheme – Increase delay only when the load on lock is the highest so far, – When increasing delay, increase just enough to keep the competitive ratio c = P - (P-1)/P 1/(P-1) Bounds for loads on the lock: 1  l t  P During a load-rising phase: Similar for load-dropping phase In each load-rising/load-dropping phase, the reactive scheme is competitive with competitive ration c=  (ln(P))

I-SPAN '0510 Interconnection Network Interconnection Network Algorithm The algorithm guarantees mutual exclusion and non- livelock. Its space complexity is log(P).

I-SPAN '0511 Evaluation Benchmarks –Spark98 kernel: lmv –SPLASH-2 suite: Volrend and Radiosity Representatives: –Arbitrating: ticket lock with (tuned) proportional backoff –Non-arbitrating: test-and-test-and-set lock with (tuned) exponential backoff System –A ccNUMA SGI Origin2000 with MHz MIPS R1000 processors.

I-SPAN '0512 Experimental results

I-SPAN '0513 Experimental results (2)

I-SPAN '0514 Experimetal results (3)

I-SPAN '0515 Conclusions We have designed and implemented a new reactive spin-lock: –It is self-tuning. –It combines advantages of both arbitrating and non- arbitrating locks –Its reactive scheme is competitive with c=  (ln(P))  The lock automatically adjusts its backoff delay reasonably according to loads on the lock as well as applications

Thanks for your attention!

I-SPAN '0517 Estimate delay bases Fairness –A fair lock helps parallel application gain performance since the application threads can execute their non- critical section in parallel. –Definition: Heuristic to estimate base l, where a, b are system documented constants and DoCS is the delay outside CS, where n i is #lock-acquisitions of a processor in  t and N is #processors

I-SPAN '0518 NUMA Another parameter that makes the problem harder is NUMA –Latency is much different –E.g. ccNUMA SGI Origin2000

I-SPAN '0519 Model: An online problem A sequence of loads on the lock are unfolded on-the-fly. When observing a load, the algorithm must decide how much its current backoff delay should be lengthened. –If increasing delay too soon, it will waste time on a long delay when the lock becomes available –If not increasing delay in time, it will cause high contention on the lock  it must increase delay at high loads reasonably  Goal is to maximize  t  delay t.load t,where  t  delay t  P

I-SPAN '0520 Algorithm LockType: Initial delay = L.counter x base l The algorithm guarantees mutual exclusion and non-livelock. Its space complexity is log(P). Acquire( Lock pL) L = FAA(pL.L, ) if L.lock then delay = ComputeDelay(L) cond = do sleep(delay) L = pL.L if L.lock then delay = ComputeDelay(L) continue; cond = FAA(pL.L, ) while cond.lock Release( Lock pL) do L = pL.L while not CAS(pL.L,L, )