Cristiana Amza University of Toronto. Once Upon a Time … … locks were painful for dynamic and complex applications …. e.g., Massively Multiplayer Games.

Slides:

Advertisements

Similar presentations

Concurrency Control III. General Overview Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design.

Advertisements

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal.

Reliable and Efficient Programming Abstractions for Sensor Networks Nupur Kothari, Ramki Gummadi (USC), Todd Millstein (UCLA) and Ramesh Govindan (USC)

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.

Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,

Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,

Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)

Locality Aware Dynamic Load Management for Massively Multiplayer Games Written by Jin Chen 1, Baohua Wu 2, Margaret Delap 2, Bjorn Knutsson 2, Honghui.

Locality Aware Dynamic Load Management for Massively Multiplayer Games Jin Chen, Baohua Wu, Margaret Delap, Bjorn Knutson, Honghui Lu and Cristina Amza.

1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.

Peer-to-Peer Support for Massively Multiplayer Games Bjorn Knutsson, Honghui Lu, Wei Xu, Bryan Hopkins Presented by Mohammed Alam (Shahed)

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

1 One Torus to Rule Them All: Multi-dimensional Queries in P2P Systems Prasanna Ganesan Beverly Yang Hector Garcia-Molina Stanford University.

Distributed Systems Fall 2010 Transactions and concurrency control.

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.

Distributed Systems Fall 2009 Distributed transactions.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

Optimistic Load Balancing in a Distributed Virtual Environment Roman Chertov and Sonia Fahmy Department of Computer Science Purdue University {rchertov,

AN OPTIMISTIC CONCURRENCY CONTROL ALGORITHM FOR MOBILE AD-HOC NETWORK DATABASES Brendan Walker.

Parallel Programming in Java with Shared Memory Directives.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

1 Scalable and transparent parallelization of multiplayer games Bogdan Simion MASc thesis Department of Electrical and Computer Engineering.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Scaling Dynamic Content Applications through Data Replication - Opportunities for Compiler Optimizations Cristiana Amza UofT.

CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.

Locality Aware Dynamic Load Management for Massively Multiplayer Games Jin Chen, Baohua Wu, Margaret Delap, Bjorn Knutsson, Margaret Delap, Bjorn Knutsson,

Concurrency Server accesses data on behalf of client – series of operations is a transaction – transactions are atomic Several clients may invoke transactions.

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Optimistic Methods for Concurrency Control By: H.T. Kung and John Robinson Presented by: Frederick Ramirez.

Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

OOPSLA 2001 Choosing Transaction Models for Enterprise Applications Jim Tyhurst, Ph.D. Tyhurst Technology Group LLC.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

1/12 Distributed Transactional Memory for Clusters and Grids EuroTM, Paris, May 20th, 2011 Michael Schöttner.

Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

Tuning Threaded Code with Intel® Parallel Amplifier.

Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)

Hathi: Durable Transactions for Memory using Flash

Part 2: Software-Based Approaches

Task Scheduling for Multicore CPUs and NUMA Systems

Effective Data-Race Detection for the Kernel

Introduction to OpenMP

Martin Rinard Laboratory for Computer Science

L21: Putting it together: Tree Search (Ch. 6)

CMSC 611: Advanced Computer Architecture

EECS 498 Introduction to Distributed Systems Fall 2017

Changing thread semantics

Lecture 6: Transactions

Hybrid Transactional Memory

UNIVERSITAS GUNADARMA

Concurrent Cache-Oblivious B-trees Using Transactional Memory

Performance-Robust Parallel I/O

Mattan Erez The University of Texas at Austin

Presentation transcript:

Cristiana Amza University of Toronto

Once Upon a Time … … locks were painful for dynamic and complex applications …. e.g., Massively Multiplayer Games e.g., Massively Multiplayer Games

Massively Multiplayer Games Support many concurrent players and Low update interval to players

So, game developers said … “Forget locks ! “Forget locks ! We’ll use our secret sauce !” We’ll use our secret sauce !”

State-of-the-art in Game Code Ad-hoc parallelization: segments/shards Ad-hoc parallelization: segments/shards e.g., World of Warcraft/ Ultima Online Sequential code, admission control Sequential code, admission control e.g., Quake III

Ad-hoc Partitioning (segments) Countries, rooms

Artificial Admission Control Admission control Gateways E.g., airports, doors

But, gamers said … ”We want to interact, and we hate lag !”

Problem with State-of-the-art Flocking Flocking Players move to one area e.g., quests Players move to one area e.g., quests Overload the server hosting the hotspot Overload the server hosting the hotspot

So I said … Forget painful locks ! Transactional Memory will make game developers and players happy ! Story endorsed by Intel (fall of 2006).

Our Goals Parallelize server code into transactions Easy to thread any game Dynamic load balance of tx on any platform e.g., clusters, multi-cores, mobile devices … Beats locks any day !

Ideal solution: Contiguous world Seamless partition Players can “see” across partition boundaries Players can smoothly transfer Regardless of game map

Challenge: On Multi-core Inter-thread conflicts Mostly at the boundary

Roadmap The game Parallelization Using TM Compiler code transformations for TM Runtime TM design choices Dynamic load balancing of tx in game

15 Game Benchmark (SimMud) Interactions: player - Obj, player - player Players can move and interact Food objects Terrain fixed, restricts movement

16 Game Benchmark (SimMud) Actions: move, eat, fight Quest: flocking of players to a spot on the game map

17 Flocking in SimMud S1 S3 S2 S4 Quest

Parallelization of Server Code Process Requests Form & Send Replies Rx Tx Select Read-only phase Read-Write phase

Example: “Move” Request Move(){ region1->removePlayer( p ); region2->addPlayer( p ); }

Parallelize Move Request Insert “atomic” keyword in code Compiler makes it a transaction Ex:#pragma omp critical / __tm_atomic { region1->removePlayer( p ); region2->addPlayer( p ); }

Ex: SimMud Data Structure struct Region { int x, y; int x, y; int width, height; int width, height; set_t* players; set_t* players; set_t* objects; set_t* objects;}

Example Code for Action Move void movePlayer( Player* p, int new_x, int new_y ) { Region* r_old = getRegion( p->x, p->y ); Region* r_old = getRegion( p->x, p->y ); Region* r_new = getRegion( new_x, new_y ); Region* r_new = getRegion( new_x, new_y ); if( isVacant_position( r_new, new_x, new_y ) ) if( isVacant_position( r_new, new_x, new_y ) ) { set_remove( r_old->players, p ); set_remove( r_old->players, p ); set_insert( r_new->players, p ); set_insert( r_new->players, p ); p->x = new_x; p->y = new_y; p->x = new_x; p->y = new_y; }}

Manual Transformations (Locks) void movePlayer( Player* p, int new_x, int new_y ) { lock_player( p); lock_player( p); Region* r_old = getRegion( p->x, p->y ); Region* r_old = getRegion( p->x, p->y ); Region* r_new = getRegion( new_x, new_y ); Region* r_new = getRegion( new_x, new_y ); lock_regions( r_old, r_new ); lock_regions( r_old, r_new ); if( isVacant_position( r_new, new_x, new_y ) ) if( isVacant_position( r_new, new_x, new_y ) ) { set_remove( r_old->players, p ); set_remove( r_old->players, p ); set_insert( r_new->players, p ); set_insert( r_new->players, p ); p->x = new_x; p->y = new_y; p->x = new_x; p->y = new_y; } unlock_regions( r_old, r_new ); unlock_regions( r_old, r_new ); unlock_player( p->lock ); unlock_player( p->lock );}

Manual Transformations (TM) void movePlayer( Player* p, int new_x, int new_y ) { #pragma omp critical { #pragma omp critical { Region* r_old = getRegion( p->x, p->y ); Region* r_old = getRegion( p->x, p->y ); Region* r_new = getRegion( new_x, new_y ); Region* r_new = getRegion( new_x, new_y ); if( isVacant_position( r_new, new_x, new_y ) ) if( isVacant_position( r_new, new_x, new_y ) ) { set_remove( r_old->players, p ); set_remove( r_old->players, p ); set_insert( r_new->players, p ); set_insert( r_new->players, p ); p->x = new_x; p->y = new_y; p->x = new_x; p->y = new_y; } }}

My Story TM will make game developers and players happy ! So far, the developers should be ! So far, the developers should be !

It Gets Worse for Locks Move May impact objects within bounding box Short-range or long-range Lock all impacted objects need to search objects Top-view of world Short-range Long-range Objects

Each region corresponds to a leaf Top-view of World e.g., Quake III Area Node Tree 27

Each region corresponds to a leaf Lock all leaf nodes in bounding box atomically atomically Top-view of World Overlapping regions e.g., Quake III Area Node Tree 28

29 – Objects linked to leaf nodes – If cross leaf boundary, link to parent node Non-Overlapping regions Top-view of world Object lists Region leafs Objects cross boundary Area Node Tree – Even Worse !

30 – Need to lock parent nodes – False Sharing – The whole tree may be locked Non-Overlapping regions Top-view of world Object lists Region leafs Objects cross boundary Area Node Tree – Even Worse !

My Story TM will make game developers and players happy ! Lock down a whole box/tree, vs just read/write what you need in TM. Players should be happy too !

Compiler/Runtime TM Support Compiler Automatic source transformations to tx Runtime track accesses resolve conflicts between transactions adapt to application pattern

Manual Transformations (TM) void movePlayer( Player* p, int new_x, int new_y ) { i #pragma omp critical { #pragma omp critical { Region* r_old = getRegion( p->x, p->y ); Region* r_old = getRegion( p->x, p->y ); Region* r_new = getRegion( new_x, new_y ); Region* r_new = getRegion( new_x, new_y ); if( isVacant_position( r_new, new_x, new_y ) ) if( isVacant_position( r_new, new_x, new_y ) ) { set_remove( r_old->players, p ); set_remove( r_old->players, p ); set_insert( r_new->players, p ); set_insert( r_new->players, p ); p->x = new_x; p->y = new_y; p->x = new_x; p->y = new_y; } }}

Automatic Transformations (TM) void tm_movePlayer( tm_Player* p, int new_x, int new_y ) { Begin_transaction; Begin_transaction; tm_Region* r_old = tm_getRegion( p->x, p->y ); tm_Region* r_old = tm_getRegion( p->x, p->y ); tm_Region* r_new = tm_getRegion( new_x, new_y ); tm_Region* r_new = tm_getRegion( new_x, new_y ); if( tm_isVacant_position( r_new, new_x, new_y ) ) if( tm_isVacant_position( r_new, new_x, new_y ) ) { tm_set_remove( r_old->players, p ); tm_set_remove( r_old->players, p ); tm_set_insert( r_new->players, p ); tm_set_insert( r_new->players, p ); p->x = new_x; p->y = new_y; p->x = new_x; p->y = new_y; } Commit_transaction; Commit_transaction;}

Automatic Transformations (TM) struct tm_Region { tm_int x, y; tm_int x, y; tm_int width, height; tm_int width, height; tm_set_t* players; //recursively re-type tm_set_t* players; //recursively re-type tm_set_t* objects; //nested structures tm_set_t* objects; //nested structures}

Compiler TM code translation #pragma  begin/end Re-type variables: tm_shared<> or tm_private<>

TM Runtime (libTM) Access Tracking: tm_type<> Operator overloading for intercepting reads and writes Access Granularity: basic-type level Conflict detection and resolution Several design choices

TM Conflict Resolution Choices Pessimistic Reader/Writer Locks Read Optimistic Only writer locks Fully Optimistic ~No locks Adaptive

Pessimistic A transaction (tx) locks an object before use Waits for locks held by other tx Releases all locks at the end

BEGINEND Reader-writer locks Reader lock excludes writers Writer lock excludes readers/writers

Read Optimistic Writers take locks, readers do not A write invalidates (aborts) all readers a) Encounter-time: at the write a) Encounter-time: at the write T1: BEGIN_TRANSACTION... WRITE A... COMMIT_TRANSACTION T2: BEGIN_TRANSACTION READ A... INVALID T3: BEGIN_TRANSACTION... READ A... INVALID

Read Optimistic T1: BEGIN_TRANSACTION... WRITE A... COMMIT_TRANSACTION T2: BEGIN_TRANSACTION READ A... COMMIT_TRANSACTION T3: BEGIN_TRANSACTION... READ A... INVALID Writers take locks, readers do not A write invalidates (aborts) all readers b) Commit-time: at commit

Fully Optimistic T1: BEGIN_TRANSACTION... WRITE A... COMMIT_TRANSACTION T2: BEGIN_TRANSACTION WRITE A... COMMIT_TRANSACTION T3: BEGIN_TRANSACTION... READ A... INVALID A write invalidates (aborts) all active readers, but supports multiple writers Commit-time: at commit

Implementation Details Meta-data kept with tm_shared<> var Lock, visible-readers set

Implementation Details Validation of each read Recoverability:Undo-loggingWrite-buffering Private thread data (needs to be searchable) Necessary for fully optimistic

Factors Determining Trade-offs Conflict type w-w conflicts favor fully optimistic Conflict-span long  domino-effect (no progress) for read optimistic

Evaluation of Design Trade-offs No. of threads: 4

Roadmap The game Parallelization Using TM Compiler code transformations for TM Runtime TM design choices Dynamic load balancing of tx in game

Parallel Server Phase Types Process Requests Form & Send Replies Rx Tx Select Read-only phase Read-Write phase Load balancing

Dynamic Load Management Region: grid unit Dynamic load balancing Reassign regions from one server/thread to another

Conflicts vs Load Management Locality, fewer conflicts Keep adjacent regions on same thread Global reshuffle Block partition

Overload due to Quest

Reassign Load & Minimize Conflicts

Locality-Aware Load Balancing Locality-Aware Load Balancing SimMud game map with quest in upper left Recorded dynamic load balancing

55 Dynamic Load-balancing Algorithms Lightest Shed regions to lightest loaded thread Spread Best load spread across all threads Locality aware Keep nearby regions on same thread

Locality-aware (Quad-tree) Split task when: Load > thresh Reassign tasks: reduce conflicts reduce conflicts Can approximate !

Task Splitting A B C D E F G H IJ BCD AEF GHIJ

Task Re-assignment Assign tasks to reduce conflicts Keep Load < threshold T1 T0T2

59 Dynamic Load-balancing Algorithms All algorithms implemented on A cluster (single thread on each node) A multi-core (with multiple threads)

Results on Multi-core Load balancing algorithms: StaticLightestSpread Locality (Quad-tree) Metrics Number of clients per thread Border conflicts Client update latency

Thread Load on Multi-core

Border Conflicts on Multi-core

Client update latency on M-core

Conclusion Support for seamless world partitioning Compiler & Runtime parallelization support Tx much simpler than locks Locality aware dynamic load balancing Can apply in server clusters, P2P mobile environments and multi-cores

I need your help. “When TM first beat locks” is a good story I need a more sophisticated game to make the story happen !

Backup Slides

67 Client Update Latency on Cluster STATIC LOCALITY most loaded least loaded All dynamic load balancing algs - similar

68 Number of Player Migrations Locality aware has fewest migrations

Average Execution Time / Request (when App changes access pattern)

Trade-offs Private thread data Per-thread data copy overhead (-) Search private data on read (-) No need to restore data on abort (+) Allows multiple concurrent writers (+)

Trade-offs (contd) Private thread data Per-thread data copy overhead (-) Search private data on read (-) No need to restore data on abort (+) Allows multiple concurrent writers (+) Locks Aborts due to deadlock (-) No other aborts (+)

A WAN distributed server system Quest lasts during sec

TM code translation (cont.) Based on Omni OpenMP compiler

Average Execution Time / Request