CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

Practical techniques & Examples
Parallel Programming Patterns Eun-Gyu Kim June 10, 2004.
Introduction to Parallel Computing
Lecture 3: Parallel Algorithm Design
Parallel Programming Patterns Ralph Johnson. Why patterns? Patterns for Parallel Programming The road ahead.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
Includes slides from “Multicore Programming Primer” course at Massachusetts Institute of Technology (MIT) by Prof. Saman Amarasinghe and Dr. Rodric Rabbah.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Parallel Programming Models and Paradigms
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.
Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.
Divide-and-Conquer 7 2  9 4   2   4   7
Algorithm structure Jakub Yaghob.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Data Structures and Algorithms in Parallel Computing Lecture 2.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Advanced Computer Networks Lecture 1 - Parallelization 1.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Page :Algorithms in the Real World Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Parallel Patterns.
Lecture 3: Parallel Algorithm Design
Conception of parallel algorithms
Parallel Programming Patterns
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Mattan Erez The University of Texas at Austin
MILEPOST GCC Lecture 4 John Cavazos
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Presentation transcript:

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture 10 Patterns for Parallel Programming III

CISC 879 : Software Support for Multicore Architectures Lecture 10: Overview Cell B.E. Clarification Design Patterns for Parallel Programs Finding Concurrency Algorithmic Structure Organize by Tasks Organize by Data Supporting Structures

CISC 879 : Software Support for Multicore Architectures LS-LS DMA transfer (PPU) int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc; rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures LS-LS DMA transfer (PPU) int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc; rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures LS-LS DMA transfer (PPU) int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc; rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures LS-LS DMA transfer (SPU) int main() { gettimeofday(&tv,NULL); printf("spu %lld; t.tv_usec %ld\n", spuid,tv.tv_usec); if (spuid == 0) { unsigned int ea; unsigned int tag = 0; unsigned int mask = 1; ea = spu_read_in_mbox(); printf("ea = %p\n",(void*)ea); mfc_put(&tv,ea + (unsigned int)&tv, sizeof(tv),tag,1,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); spu_write_out_intr_mbox(0); } spu_read_in_mbox(); printf("spu %lld; tv.tv_usec = %ld\n", spuid,tv.tv_usec); return 0; }

CISC 879 : Software Support for Multicore Architectures LS-LS Output -bash-3.2$./a.out spu 0; t.tv_usec = spu 1; t.tv_usec = spu 2; t.tv_usec = spu 3; t.tv_usec = mbox_data f ea = 0xf spu 0; tv.tv_usec = spu 1; tv.tv_usec = spu 2; tv.tv_usec = spu 3; tv.tv_usec =

CISC 879 : Software Support for Multicore Architectures Organize by Data Operations on core data structure Geometric Decomposition Recursive Data

CISC 879 : Software Support for Multicore Architectures Geometric Deomposition Arrays and other linear structures Divide into contiguous substructures Example: Matrix multiply Data-centric algorithm and linear data structure (array) implies geometric decomposition

CISC 879 : Software Support for Multicore Architectures Recursive Data Lists, trees, and graphs Structures where you would use divide-and-conquer May seem that can only move sequentially through data structure But, there are ways to expose concurrency

CISC 879 : Software Support for Multicore Architectures Recursive Data Example Find the Root: Given a forest of directed trees find the root of each node Parallel approach: For each node, find its successor’s successor Repeat until no changes O(log n) vs O(n) Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

CISC 879 : Software Support for Multicore Architectures Organize by Flow of Data Organize By Flow of Data RegularIrregular Event-Based Coordination Pipeline

CISC 879 : Software Support for Multicore Architectures Organize by Flow of Data Computation can be viewed as a flow of data going through a sequence of stages Pipeline: one-way predictable communication Event-based Coordination: unrestricted unpredictable communication

CISC 879 : Software Support for Multicore Architectures Pipeline performance Concurrency limited by pipeline depth Balance computation and communication (architecture dependent) Stages should be equally computationally intensive Slowest stage creates bottleneck Combine lightly loaded stages or decompose heavily- loaded stages Time to fill and drain pipe should be small

CISC 879 : Software Support for Multicore Architectures Supporting Structures Single Program Multiple Data (SPMD) Loop Parallelism Master/Worker Fork/Join

CISC 879 : Software Support for Multicore Architectures SPMD Pattern Create single program that runs on each processor Initialize Obtain a unique identifier Run the same program each processor Identifier and input data can differentiate behavior Distribute data (if any) Finalize Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

CISC 879 : Software Support for Multicore Architectures SPMD Challenges Split data correctly Correctly combine results Achieve even work distribution If programs require dynamic load balancing, another pattern may be more suitable (Job Queue) Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

CISC 879 : Software Support for Multicore Architectures Loop Parallelism Pattern Many programs expressed as iterative constructs Programming models like OpenMP provide pragmas to automatically assign loop iterations to processors Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

CISC 879 : Software Support for Multicore Architectures Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007

CISC 879 : Software Support for Multicore Architectures Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007 Relevant where tasks have no dependencies Embarrassingly parallel Problem is determining when entire problem complete

CISC 879 : Software Support for Multicore Architectures Fork/Join Pattern Slide Source: Dr. Rabbah, IBM, MIT Course IAP 2007 Parent creates new tasks (fork), then waits until they complete (join) Tasks created dynamically Tasks can create more tasks Tasks managed according to relationships