Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
What is Concurrent Programming? Maram Bani Younes.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Department of Computer Science and Software Engineering
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 Lecture 5a: CPU architecture 101 boris.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
COMP7330/7336 Advanced Parallel and Distributed Computing Thread Basics Dr. Xiao Qin Auburn University
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Multithreaded Programming
CS427 Multicore Architecture and Parallel Computing
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Parallel Algorithm Design
Chapter 4: Threads.
CSCE569 Parallel Computing
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Chapter 4: Threads & Concurrency
Chapter 4 Multiprocessors
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory processors – Multi-core processors Parallel programming – Shared memory programming (OpenMP) – Distributed memory programming (MPI) – Thread-based programming Parallel algorithms – Sorting – Matrix-vector multiplication – Graph algorithms Applications – Computational science and engineering – High-performance computing

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory processors – Multi-core processors Parallel programming – Shared memory programming (OpenMP) – Distributed memory programming (MPI) – Thread-based programming Parallel algorithms – Sorting – Matrix-vector multiplication – Graph algorithms Applications – Computational science and engineering – High-performance computing This lecture hopes to touch on each of the highlighted aspects via one running example

Intel Nehalem: an example of a multi-core shared-memory architecture A few of the features… 2 sockets 4 cores per socket 2 hyper-threads per core  16 threads in total Proc speed: 2.5 GHz 24 GB total memory Cache: – L1 (32 KB, on-core) – L2 (2x6 MB, on-core) – L3 (8 MB, shared) Memory: virtually globally shared (from programmer’s point of view) Multithreading: simultaneous (Multiple instructions from ready threads executed in a given cycle) Block diagram

Overview of Programming Models Programming models provide support for expressing concurrency and synchronization Process based models assume that all data associated with a process is private, by default, unless otherwise specified Lightweight processes and threads assume that all memory is global – A thread is a single stream of control in the flow of a program Directive based programming models extend the threaded model by facilitating creation and synchronization of threads

Advantages of Multithreading (threaded programming) Threads provide software portability Multithreading enables latency hiding Multithreading takes scheduling and load balancing burdens away from programmers Multithreaded programs are significantly easier to write than message-passing programs – Becoming increasingly widespread in use due to the abundance of multicore platforms

Shared Address Space Programming APIs The POSIX Thread (Pthread) API – Has emerged as the standard threads API – Low-level primitives (relatively difficult to work with) OpenMP – Is a directive-based API for programming shared address space platforms (has become a standard) – Used with Fortran, C, C++ – Directives provide support for concurrency, synchronization, and data handling without having to explicitly manipulate threads

Parallelizing Graph Algorithms Challenges: – Runtime dominated by memory latency than processor speed – Little work is done while visiting a vertex or an edge Little computation to hide memory access cost – Access patterns determined only at runtime Prefetching techniques inapplicable – There is poor data locality Difficult to obtain good memory system performance For these reasons, parallel performance – on distributed memory machines is often poor – on shared memory machines is often better We consider here graph coloring as an example of a graph algorithm to parallelize on shared memory machines

Graph coloring Graph coloring is an assignment of colors (positive integers) to the vertices of a graph such that adjacent vertices get different colors The objective is to find a coloring with the least number of colors Examples of applications: – Concurrency discovery in parallel computing (illustrated in the figure to the right) – Sparse derivative computation – Frequency assignment – Register allocation, etc

A greedy algorithm for coloring Graph coloring is NP-hard to solve optimally (and even to approximate) The following Greedy algorithm gives very good solution in practice Complexity of GREEDY: O(|E|) (thanks to the way the array forbiddenColors is used) color is a vertex-indexed array that stores the color of each vertex forbiddenColors is a color-indexed array used to mark impermissible colors to a vertex

Parallelizing Greedy Coloring Desired goal: parallelize GREEDY such that – Parallel runtime is roughly O(|E|/p) when p processors (threads) are used – Number of colors used is nearly the same as in the serial case Difficult to achieve since GREEDY is inherently sequential Challenge: come up with a way to create concurrency in a nontrivial way

A potentially “generic” parallelization technique “Standard” Partitioning – Break up the given problem into p independent subproblems of almost equal sizes – Solve the p subproblems concurrently Main work lies in the decomposition step which is often no easier than solving the original problem “Relaxed” Partitioning – Break up the problem into p, not necessarily entirely independent, subproblems of almost equal sizes – Solve the p subproblems concurrently – Detect inconsistencies in the solutions concurrently – Resolve any inconsistencies Can be used potentially successfully if the resolution in the fourth step involves only local adjustments

“Relaxed Partitioning” applied towards parallelizing Greedy coloring Speculation and Iteration: – Color as many vertices as possible concurrently, tentatively tolerating potential conflicts, detect and resolve conflicts afterwards (iteratively)

Parallel Coloring on Shared Memory Platforms (using Speculation and Iteration) Lines 4 and 9 can be parallelized using the OpenMP directive #pragma omp parallel for

Sample experimental results of Algorithm 2 (Iterative) on Nehalem: I Graph (RMAT-G): 16.7M vertices; 133.1M edges; Degree (avg=16, Max= 1,278, variance=416)

Sample experimental results of Algorithm 2 (Iterative) on Nehalem: II Graph (RMAT-B): 16.7M vertices; 133.7M edges; Degree (avg=16, Max= 38,143, variance=8,086)

Recap Saw an example of a multi-core architecture – Intel Nehalem Took a bird’s eye view of parallel programming models – OpenMP (highlighted) Saw an example of design of a graph algorithm – Graph coloring Saw some experimental results