Cilk and Writing Code for Hardware

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

Fine-grain Task Aggregation and Coordination on GPUs

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

SAGE: Self-Tuning Approximation for Graphics Engines

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

GPU Programming with CUDA – Optimisation Mike Griffiths

© 2008 CILK ARTS, Inc.1 Executive Briefing: Multicore- Enabling SaaS Applications September 3, 2008 Cilk++, Cilk, Cilkscreen, and Cilk Arts are trademarks.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

GPU Architecture and Programming

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Discussion Week 2 TA: Kyle Dewey. Overview Concurrency Process level Thread level MIPS - switch.s Project #1.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Parallelism without Concurrency Charles E. Leiserson MIT.

CS 240A: Shared Memory & Multicore Programming with Cilk++

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.

CILK: An Efficient Multithreaded Runtime System

CMPS 5433 Programming Models

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Atomic Operations in Hardware

Computer Engg, IIT(BHU)

Exploiting Parallelism

Basic CUDA Programming

Introduction to Parallelism.

Lecture 5: GPU Compute Architecture

EE 193: Parallel Computing

Cilk and OpenMP (Lecture 20, cs262a)

Lecture 5: GPU Compute Architecture for the last time

Multithreaded Programming in Cilk LECTURE 1

Introduction to CILK Some slides are from:

Background and Motivation

Distributed Systems CS

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.

CSE 451: Operating Systems Winter 2003 Lecture 4 Processes

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CS510 Operating System Foundations

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 3: CUDA Threads & Atomics

CSCI1600: Embedded and Real Time Software

Problems with Locks Andrew Whitaker CSE451.

Introduction to CILK Some slides are from:

6- General Purpose GPU Programming

OpenMP Parallel Programming

Chapter 3: Process Management

Presentation transcript:

Cilk and Writing Code for Hardware Jyothi Krishna V S Nov 8, 2017

Cilk (Intel Cilk Plus) Cilk C language for programming dynamic multithreaded applications on shared-memory multiprocessors.Cilk’s provably good runtime system automatically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. Why Cilk is a good starting point? Cilk extends the C language with just a handful of keywords: cilk_spawn, cilk_sync, cilk_for Cilk is processor-oblivious. Cilk supports speculative parallelism. Cool Supplymentary tools: CilkView and CilkScreen.

Fibonacci A new thread created. The child Cilk procedure can execute in parallel with the parent caller. int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } Wait until all spawned children have returned. C Code Cilk Code

Consider n=4 ... 4 3 2 2 1 1 1 Continue Edge Spawn Edge Return Edge Initial Thread int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n-1); y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } Exit 4 3 2 2 1 1 1

CilkView: A measure of Parallelism Used for performance analysis of Cilk Program Estimates Performance of your code for different number of processors Gives the estimates of Spawns Maximum parallelism Instructions executed etc

CilkView Output Work : 15,450 instructions Span : 3,826 instructions Burdened span : 263,826 instructions Parallelism : 4.04 Burdened parallelism : 0.06 Number of spawns/syncs: 48 Average instructions / strand : 106 Strands along span : 12 Average instructions / strand on span : 318 Total number of atomic instructions : 54 Frame count : 102 Entries to parallel region : 2 Speedup Estimate 2 processors: 0.07 - 2.00 4 processors: 0.05 - 4.00 8 processors: 0.04 - 4.04 16 processors: 0.04 - 4.04 32 processors: 0.04 - 4.04 64 processors: 0.03 - 4.04 128 processors: 0.03 - 4.04 256 processors: 0.03 - 4.04 CilkView Output

ArraySum cilk_for Spawns n threads int main() { int n = 100; int total; cilk_for(int i = 1; i <= n; ++i) { total += compute(i); } printf(“Total is %d”, total) int compute(int i) { return i; Does this program give the correct answer, 5050? Implicit cilk_sync NOT Always !! A potential race does not always happen at runtime and hence does not guarantee WRONG answer all the time 😓 Race on total.

Reduction Defining Reduction operation with variable. int main() { int n = 100; cilk::reducer_opadd<int> total; cilk_for(int i = 1; i <= n; ++i) { total += compute(i); } printf(“Total is %d”, total) int compute(int i) { return i; Reductions can be performed only for conductive operations like Add Xor Max, Min etc.

Cilkscreen: Race Detection Dynamic Race detection software. Race detection is a VERY HARD problem. CilkScreen does not guarantee 100% race free program. ~/cilk/bin/cilkscreen ./a.out No errors found by Cilkscreen Race condition on location 0x1aba310 write access at 0x401329: (~/test/matrix_multiply.cilk:51,__cilk_loop_d_002+0x51) read access at 0x401325: (~/test/matrix_multiply.cilk:51, …..

Cilk Plus Resources Home Page: https://www.cilkplus.org/ Runtime integrated with gcc 4.9 onwards. If you have gcc 4.9 and above just use -fcilkplus flag CilkScreen: https://software.intel.com/en-us/articles/an-introduction-to-the-cilk-screen-race-detector CilkView: https://www.cilkplus.org/tutorial-cilk-tools Load Balancing, Work stealing. Also look into Cacti stack : Cilk Memory Model

Writing Code for Hardware Understanding and coding to the underlying hardware can improve your parallel performance Amdahl's law Speedup S =1/ ((1-p) + p/s ) Actual Speedup = 1/ ((1-p) + p/s + T(s) + S(s)) T(s) : Thread/ task creation cost S(s): Synchronization cost Correctness Parallelism / Scalability Reduction in Communication/ Synchronization Load Balance among Threads Other optimizations Priority

Correctness x=1; y= 0; Thread 1: Thread 1: atomic(y = 1); x = 3; Subjective to requirement x=1; y= 0; Thread 1: atomic(y = 1); atomic(x = 1); Thread 2: if (x == 1) { atomic(y = 0); } ---------------------------- assert( x || y); Thread 1: x = 3; y = 5; Thread 2: x = 1; y = 4; ---------------------------- assert( x< y); Racy but correct. Race Free but can be incorrect.

Improve Performance Avoid Branches, small kernels High T(s), Moderate S(s) Choice of algorithm (the s factor in Amdahl’s law) SSSP : Dijkstra, Bellman-Ford (Scalable) Sequential Dijkstra performs very well. But, Bellman-Ford is nicely parallelizable. Choice of Hardware: CUDA (GPU) / Cilk (CPU) GPU: Scratch Memory, SIMD , larger no of threads CPU: False Sharing, MIMD, Faster threads Avoid False sharing Moderate T(s), Moderate S(s)

Example 1 : CPU False-sharing effect Consider a cache line of 8 word For N = 100,000, Intel core i7, 8 Threads int a[]; Int i; <parallel for> for(i=0;i<N;i++) { a [i] = i+2; }

Example 2: Effect of Branches in GPU Kernel Code: Without branch: void addk(int* dA, int* dB, int* dC) { int id = blockIdx.x * blockDim.x + threadIdx.x; dC[id] = dA[id] + dB[id]; } With Branch: if(id %2 == 0 ) else dC[id] = dA[id] - dB[id]; N = 1200000 Does not contain copy cost.