Parallel build blocks.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
1 The 2-to-4 decoder is a block which decodes the 2-bit binary inputs and produces four output All but one outputs are zero One output corresponding to.
1 Collective Operations Dr. Stephen Tse Lesson 12.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Scientific Computing Linear Systems – Gaussian Elimination.
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
Maths for Computer Graphics
Computational Methods for Management and Economics Carla Gomes
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Topic Overview One-to-All Broadcast and All-to-One Reduction
מבנה מחשבים תרגול מספר 4. Definition: A Boolean function f :{0,1} n  {0,1} is monotone if x  y  f (x)  f ( y) (where x  y means : for every i x i.
Introduction to Algorithms
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Collective Communication
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Yasser F. O. Mohammad Assiut University Egypt. Previously in NM Introduction to NM Solving single equation System of Linear Equations Vectors and Matrices.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Lecture 9 Architecture Independent (MPI) Algorithm Design
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Matrix Multiplication The Introduction. Look at the matrix sizes.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
Tree Representations Mathematical structure An edge A leaf The root A node.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Analysis of Sparse Convolutional Neural Networks
Linear Algebra review (optional)
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Matrix Operations SpringSemester 2017.
Array Processor.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Parallel Matrix Operations
Parallel Computation Patterns (Scan)
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
© 2012 Elsevier, Inc. All rights reserved.
2.2 Introduction to Matrices
Parallelization of Sparse Coding & Dictionary Learning
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
Multidimensional array
ECE 498AL Lecture 15: Reductions and Their Implementation
Multiplication of Matrices
Linear Algebra review (optional)
Matrices An appeaser is one who feeds a crocodile—hoping it will eat him last. Winston Churchhill.
Math review - scalars, vectors, and matrices
To accompany the text “Introduction to Parallel Computing”,
Matrix Operations SpringSemester 2017.
Data Parallel Pattern 6c.1
Arrays and Matrices Prof. Abdul Hameed.
Introduction to Matrices
Perfect Phylogeny Tutorial #10
Presentation transcript:

Parallel build blocks

Map Piecewise mapping between the input and output. Parallel version of the serial for loop. Input Elemental Function Output Introduction to Parallel Computing, University of Oregon, IPCC

MAP SAXPY (Scaled Vector Addition) y = ax + y Basic BLAS function 1 2 1 2 3 4 5 6 7 8 9 10 11 a 4 4 4 4 4 4 4 4 4 4 4 4 * x 2 4 2 1 8 3 9 5 5 1 2 1 + y 3 7 1 4 4 5 3 1 y 11 23 8 5 36 12 36 49 50 7 9 4 Introduction to Parallel Computing, University of Oregon, IPCC

Reduce Combination of the input elements Associative binary functions Min, max, add Introduction to Parallel Computing, University of Oregon, IPCC

Reduce Partitioned reduction Introduction to Parallel Computing, University of Oregon, IPCC

Scan Cumulative reduction of the input Every output element is a partial reduction of the input Exclusive or inclusive Introduction to Parallel Computing, University of Oregon, IPCC

Scan Work efficient implementation Blelloch 1990, binary tree Two phases Up sweep: partial sum from the leaves to the root root contains the final sum Down sweep: distribution from the root to the leaves in case of exclusive scan the roor element set to zero

Scan Up sweep Down sweep Introduction to Parallel Computing, University of Oregon, IPCC

Gather and scatter Gather Scatter For every output element the index defines the input Scatter For every input element the index defines the output

Compact Conditional selection The result is continous Map, scan, map Introduction to Parallel Computing, University of Oregon, IPCC

Sparse matrix vector multiplication Contains lots of zeroes The multiplication uses the compressed representation Compressed Sparse Row Value: Column: Row Ptr:

Sparse matrix vector multiplication Value: Column: Row Ptr: Value + Row Ptr: Vector + Column: Piecewise multiplication: Inclusive segmented scan:

Sparse matrix vector multiplication Segmented scan Conditional scan The conditional predicate is in a distinct array Inclusive scan: Head array Inclusive segmented scan:

Assignment Parallel compact Sparse matrix vector multiplication