Presentation is loading. Please wait.

Presentation is loading. Please wait.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Similar presentations


Presentation on theme: "Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas."— Presentation transcript:

1 Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden

2 Concurrent Data Structures Parallel/Concurrent programming: – Share data among threads/processes, sharing a uniform address space (shared memory) Inter-process/thread communication and synchronization – Both a tool and a goal Yiannis Nikolakopoulos ioaniko@chalmers.se 2

3 Concurrent Data Structures: Implementations Coarse grained locking – Easy but slow... Fine grained locking – Fast/scalable but: error-prone, deadlocks Non-blocking – Atomic hardware primitives (e.g. TAS, CAS) – Good progress guarantees (lock/wait-freedom) – Scalable Yiannis Nikolakopoulos ioaniko@chalmers.se 3

4 What’s happening in hardware? Multi-cores  many-cores – “Cache coherency wall” [Kumar et al 2011] – Shared address space will not scale – Universal atomic primitives (CAS, LL/SC) harder to implement Shared memory  message passing Yiannis Nikolakopoulos ioaniko@chalmers.se 4 Cache IA Core SharedLocal

5 Networks on chip (NoC) Short distance between cores Message passing model support Shared memory support Networks on chip (NoC) Short distance between cores Message passing model support Shared memory support Yiannis Nikolakopoulos ioaniko@chalmers.se 5 Can we have Data Structures: Fast Scalable Good progress guarantees Eliminated cache coherency Limited support for synchronization primitives Eliminated cache coherency Limited support for synchronization primitives

6 Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 6

7 Single-chip Cloud Computer (SCC) Experimental processor by Intel 48 independent x86 cores arranged on 24 tiles NoC connects all tiles TestAndSet register per core Yiannis Nikolakopoulos ioaniko@chalmers.se 7

8 SCC: Architecture Overview Yiannis Nikolakopoulos ioaniko@chalmers.se 8 Memory Controllers: to private & shared main memory Message Passing Buffer (MPB) 16Kb

9 Programming Challenges in SCC Message Passing but… – MPB small for large data transfers – Data Replication is difficult No universal atomic primitives (CAS); no wait-free implementations [Herlihy91] Yiannis Nikolakopoulos ioaniko@chalmers.se 9

10 Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 10

11 Concurrent FIFO Queues Main idea: – Data are stored in shared off-chip memory – Message passing for communication/coordination 2 design methodologies: – Lock-based synchronization (2-lock Queue) – Message passing-based synchronization (MP-Queue, MP-Acks) Yiannis Nikolakopoulos ioaniko@chalmers.se 11

12 2-lock Queue Array based, in shared off-chip memory (SHM) Head/Tail pointers in MPBs 1 lock for each pointer [Michael&Scott96] TAS based locks on 2 cores Yiannis Nikolakopoulos ioaniko@chalmers.se 12

13 2-lock Queue: “Traditional” Enqueue Algorithm Acquire lock Read & Update Tail pointer (MPB) Add data (SHM) Release lock Yiannis Nikolakopoulos ioaniko@chalmers.se 13

14 2-lock Queue: Optimized Enqueue Algorithm Acquire lock Read & Update Tail pointer (MPB) Release lock Add data to node SHM Set memory flag to dirty Yiannis Nikolakopoulos ioaniko@chalmers.se 14 Why? No Cache Coherency!

15 2-lock Queue: Dequeue Algorithm Acquire lock Read & Update Head pointer Release lock Check flag Read node data Yiannis Nikolakopoulos ioaniko@chalmers.se 15 What about progress?

16 2-lock Queue: Implementation Yiannis Nikolakopoulos ioaniko@chalmers.se 16 Head/Tail Pointers (MPB) Data nodes Locks? On which tile(s)?

17 Message Passing-based Queue Data nodes in SHM Access coordinated by a Server node who keeps Head/Tail pointers Enqueuers/Dequeuers request access through dedicated slots in MPB Successfully enqueued data are flagged with dirty bit Yiannis Nikolakopoulos ioaniko@chalmers.se 17

18 MP-Queue Yiannis Nikolakopoulos ioaniko@chalmers.se 18 ENQ TAIL DEQ HEAD SPIN What if this fails and is never flagged? “Pairwise blocking” only 1 dequeue blocks ADD DATA

19 Adding Acknowledgements No more flags! Enqueue sends ACK when done Server maintains in SHM a private queue of pointers On ACK: Server adds data location to its private queue On Dequeue: Server returns only ACKed locations Yiannis Nikolakopoulos ioaniko@chalmers.se 19

20 MP-Acks Yiannis Nikolakopoulos ioaniko@chalmers.se 20 ENQ TAIL ACK DEQ HEAD No blocking between enqueues/dequeues

21 Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se 21

22 Evaluation Benchmark: Each core performs Enq/Deq at random High/Low contention Benchmark: Each core performs Enq/Deq at random High/Low contention Yiannis Nikolakopoulos ioaniko@chalmers.se 22 Perfomance? Scalability? Is it the same for all cores? Perfomance? Scalability? Is it the same for all cores?

23 Measures Yiannis Nikolakopoulos ioaniko@chalmers.se 23 Operations by core i Average operations per core

24 Throughput – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se 24

25 Fairness – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se 25

26 Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se 26

27 Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se 27

28 Conclusion Lock based queue – High throughput – Less fair – Sensitive to lock locations, NoC performance MP based queues – Lower throughput – Fairer – Better liveness properties – Promising scalability Yiannis Nikolakopoulos ioaniko@chalmers.se 28

29 Thank you! ivanw@chalmers.se ioaniko@chalmers.se Yiannis Nikolakopoulos ioaniko@chalmers.se 29

30 BACKUP SLIDES Yiannis Nikolakopoulos ioaniko@chalmers.se 30

31 Experimental Setup 533MHz cores, 800MHz mesh, 800MHz DDR3 Randomized Enq/Deq operations High/Low contention One thread per core 600ms per execution Averaged over 12 runs Yiannis Nikolakopoulos ioaniko@chalmers.se 31

32 Concurrent FIFO Queues Typical 2-lock queue [Michael&Scott96] Yiannis Nikolakopoulos ioaniko@chalmers.se 32


Download ppt "Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas."

Similar presentations


Ads by Google