DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program 2013-14.

Slides:



Advertisements
Similar presentations
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Advertisements

SE-292 High Performance Computing
CPU Review and Programming Models CT101 – Computing Systems.
Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.
Linked Lists Compiled by Dr. Mohammad Alhawarat CHAPTER 04.
Computer Organization and Architecture 18 th March, 2008.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Chapter Hardwired vs Microprogrammed Control Multithreading
Recap.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Computer System Architectures Computer System Software
Chapter 10 And, Finally... The Stack. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display Stacks A LIFO.
Chapter 10 The Stack Stack: An Abstract Data Type An important abstraction that you will encounter in many applications. We will describe two uses:
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multicore Systems CET306 Harry R. Erwin University of Sunderland.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
This module created with support form NSF under grant # DUE Module developed Fall 2014 by Apan Qasem Parallel Computing Fundamentals Course TBD.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
1 Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian School of Computing, University of Utah
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
Copyright © Curt Hill Concurrent Execution An Overview for Database.
Barriers and Condition Variables
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Spring 2008 Mark Fontenot CSE Honors Principles of Computer Science I Note Set 15 1.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Chapter 3 Lists, Stacks, Queues. Abstract Data Types A set of items – Just items, not data types, nothing related to programming code A set of operations.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Lecture 20: Consistency Models, TM
Stack ADT (Abstract Data Type) N …
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
4- Performance Analysis of Parallel Programs
Parallel Processing - introduction
Atomic Operations in Hardware
Exploiting Parallelism
Chapter 10 The Stack.
Pointers and Linked Lists
Introduction.
Introduction.
Lesson Objectives Aims
Lecture 22: Consistency Models, TM
Quiz Questions Parallel Programming Parallel Computing Potential
Parallelism and Concurrency
Concurrency: Mutual Exclusion and Process Synchronization
Multithreading Why & How.
Lecture 2 The Art of Concurrency
Quiz Questions Parallel Programming Parallel Computing Potential
Quiz Questions Parallel Programming Parallel Computing Potential
Quiz Questions Parallel Programming Parallel Computing Potential
Multicore and GPU Programming
Lecture: Consistency Models, TM
EN Software Carpentry Python – A Crash Course Esoteric Sections Parallelization
Multicore and GPU Programming
Tim Harris (MSR Cambridge)
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Presentation transcript:

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

Presentation title | Presenter name The Multi-core Age 2 | Mobile PhonePCIntel Xeon Phi CSIRO ‘Bragg’ Compute Cluster 2-4 Cores4-16 Cores61 Cores2048 Cores

Presentation title | Presenter name Programming for multi-cores 3 | Problem CPU Core 1 CPU Core 2 CPU Core 3 CPU Core 4 Machine Instructions Execution Divide the problem

The maximum speedup is dependent on % of the problem you can run in parallel Presentation title | Presenter name Amdahl's Law 4 | Single Core Processor 1x Speed 50% 2x speedup 75% 4x speedup 90% 95% 10x speedup 20x speedup Maximum Speedup

Presentation title | Presenter name Data structures: 5 | Memory (data) is still a shared resource. Memory (data) CPU core Single Core Computer Memory (data) CPU core 4-Core Computer

Presentation title | Presenter name Linked-list (Stack) Data Structure 6 | Data A link to the next data point EMPTY A “node” that holds data. TOP

Presentation title | Presenter name Add new item (Push) 7 | Data A EMPTY We want to add a chunk of data (Data B) to the structure TOP Data B

Presentation title | Presenter name Add new item (Push) 8 | Data A EMPTY Steps: For new data B 1)Find the start of the structure (TOP) Data B TOP

Presentation title | Presenter name Add new item 9 | Data A EMPTY Data B Steps: For new data B 2) Link into the structure. TOP

Presentation title | Presenter name Add new item 10 | Data A NULL TOP (new) Data B Steps: For new data B 3) Update TOP.

Like stacking dinner plates Only need to keep track of where TOP is to access the rest. Presentation title | Presenter name Resulting structure 11 | Data NULL Data TOP

Presentation title | Presenter name What happens in multi-core systems? 12 | Two threads trying to operate on the stack structure: Thread 1 attempts at time T. Thread 2 attempts at time T + 1 nanosecond. Because each of the steps takes time to complete, errors occur.

Presentation title | Presenter name What happens in multi-core systems? 13 | This causes the interleaving of steps Thread 1 reads TOP (1) Thread 2 reads TOP (1) Thread 1 sets the next pointer (2) Thread 2 sets the next pointer (2) Thread 1 updates TOP (3) Thread 2 updates TOP (3)

Presentation title | Presenter name 14 | Data A EMPTY TOP Data CData B Data B is lost forever because it is not linked to TOP anymore  (Stack failure) Thread 1 Thread 2

Use “data locks”. Protect the 3 steps. One thread at a time is granted access to the stack. Complete an operation and release the lock. This is the standard approach for multithreaded structures. Presentation title | Presenter name How do we fix this? 15 |

Easy to use. 2 lines of code added to fix. -Get Lock -Step 1, 2,3. -Release Lock. × Slow. One thread at a time can use the lock. This becomes sequential code. This is the code that cannot run in parallel. Analogy: Merging highway traffic into a single lane. Presentation title | Presenter name Locks 16 |

New method Lock-free data structure. Special low-level instructions allows three steps in one computer instruction. Removes the need for locks. Called a Compare-Exchange. Presentation title | Presenter name Lock-free 17 |

Downside: Writing lock-free code is difficult (hence the project). The Compare-Exchange operation forms the base for writing lock-free code. The project takes specifications from research papers to implement. Presentation title | Presenter name Lock-free 18 |

Implemented a range of lock-free optimizations for the stack. Open coding standards (C++, OpenMP) Benchmarked using a Intel Xeon Phi 61 core processor. Lock-free structure performed about 2x better for pure stack operations. Presentation title | Presenter name Lock-free 19 |

Amdahl’s Law shows that it’s important to optimize sequential sections of code. The shared data structures are often sequential bottlenecks. Implementing lock-free data structures reduced this bottleneck. Presentation title | Presenter name Summary 20 |