HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

MPI Message Passing Interface
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Distributed Systems CS
Programming Languages and Paradigms
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
Intertask Communication and Synchronization In this context, the terms “task” and “process” are used interchangeably.
(1) ICS 313: Programming Language Theory Chapter 10: Implementing Subprograms.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Introduction in algorithms and applications Introduction in algorithms and applications Parallel machines and architectures Parallel machines and architectures.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Chapter 11: Distributed Processing Parallel programming Principles of parallel programming languages Concurrent execution –Programming constructs –Guarded.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Communication in Distributed Systems –Part 2
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
PRASHANTHI NARAYAN NETTEM.
Mapping Techniques for Load Balancing
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Computer Architecture Computational Models Ola Flygt V ä xj ö University
Synchronization (Barriers) Parallel Processing (CS453)
Distributed Shared Memory Systems and Programming
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
SALSA: Language and Architecture for Widely Distributed Actor Systems. Carlos Varela, Abe Stephens, Department of.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
10/16/2015IT 3271 All about binding n Variables are bound (dynamically) to values n values must be stored somewhere in the memory. Memory Locations for.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
The Mach System Silberschatz et al Presented By Anjana Venkat.
Outline Why this subject? What is High Performance Computing?
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CS5102 High Performance Computer Systems Thread-Level Parallelism
SHARED MEMORY PROGRAMMING WITH OpenMP
Computer Engg, IIT(BHU)
Parallel and Multiprocessor Architectures – Shared Memory
Distributed Systems CS
Introduction to parallelism and the Message Passing Interface
Programming Parallel Computers
Presentation transcript:

HParC language

Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing segments Message passing level-2 –Slow level of message passing

Proposed solution The target architecture is a grid composed of MCs Scoping rules that clearly separates MP constructs and SM constructs. Extensive operation set to support serialization and deserialization of complex messages that could be sent in one operation

Scoping rules problems C:\Documents and Settings\Dima\My Documents\Downloads\attachments\cl.pdfC:\Documents and Settings\Dima\My Documents\Downloads\attachments\cl.pdf

Nesting rules problem Do we allow MP_PF to access shared variables defined in an outer Parallel Construct? If so, how do we support coherent views in the caches of different MC machines? Can a message queue variable defined in an outer MP_ PF be used by an inner MP_PF where a SM_PF separates between the two MP_PFs Can a shared variable defined in an outer SM_ PF be used by an inner SM_PF where a MP_PF separates between the two SM_PFs

HParC parallel constructs mparfor/mparblock –Threads can not access shared memory (i.e., can not reference nonlocal variables). They can only send and receive messages. –Threads are evenly distributed among all machines in the system parfor/parblock –Threads can access shared memory and use message passing. – Threads are distributed among the machines of only one MC out of the set of clusters constituting the target architecture.

Code example on HParC #define N number_of_MCs mparfor(int i=0;i<N;i++) using Q { int A[N], sum=0; if(i < N-1) { parfor (int j=0;j<N;j++) A[j]=f(i,j); Q[N-1]= A; } else { int z = rand()% N; parfor(int j=0;j<z;j++) { message m; int s=0,k; for(int t=0;t<N/z;t++) { m = Q[N-1]; for(k=0;k<N;k++) s+=m[k]; } faa(&sum,s); }

OPEN-MP ENHANCED WITH MPI The parallel directives of openMP enable similar parallel constructs as those available in HParC. atomically executed basic arithmetic expressions and synchronization primitives. Various types of shared variables helping to adjust shared memory usage MPI is a language independent communications protocol. Supports point-to-point and collective communication. Is a framework that provides an extensive API and not a language enhancing. Tailoring those two programming styles in a single program is not an easy task. – MPI constructs are aimed to be used only in thread-wide scope. –The dynamic joining of new threads to MPI realm is not straightforward

Comparison with HParC The code is far more complex and less readable The MPI usage demands a lot of supporting directives –Communication procedures demands low-level address information binding the application to hardware archtecture. –Lines 11,12,14 are replaced in HParC by simple delaration “using Ql” –The parfor of HParC implies natural synchronization. Thus no need in lines 8 and 24. –The asymmetric declaration of communication groups performed by different threads (lines ad 28-29), while in HParC the message passing queue is part of the parallel construct and is accessed in symmetrical manner.

PGAS Languages Partitioned Global Address Space languages assumes distinct areas of memory local to each machine are logically assembled to global memory address space Remote DMA operations are used to simulate shared memory among distinct MC computers Have no scoping rules that impose locality on shared variables.

Comparision with X10 X10 is a Java based language. X10 can be used to program a cluster of MCs allowing full separate shared memory scope at each MC. Uses a global separate shared memory area Instead of nested MP levels Has a foreach construct that is similar to the SM_PF allowing full nesting including declaration of local variables that are shared among the threads spawned by inner foreach constructs.

The X10 code vs. HParC ateach ((i) in Dist.makeUnique()) { //i iterates over all places if (here != Place.FIRST_PLACE){ val A = Rail.make[Int](N, (i:Int) => 0); //creates local array at place 2..n: finish foreach (j) in [0..N-1] A[j]=f(i,j); //create a thread to fill in each A: at (Place.FIRST_PLACE) atomic queue.add[A]; //go to Place1 and place A into queue: } else { //at Place.FIRST_PLACE: shared sum = 0; finish foreach (k) in [0..K-1] { //create k pool threads to sum up the A's: while (true){ var A; when( (nProcessed == N-1) || (!queue.isEmpty()) { //if all processed or queue non- empty: if (nProcessed == N-1) break; A = queue.remove() as ValRail[Int]; //copy the A array: nProcessed++; } var s = 0; for ((i) in 0..N-1) s+= A[i]; atomic sum += s; }

Comparison with HParC We have to spawn an additional thread in order to update Global Shared “Queue”. The update command implicitly copies the potentially big array “A” and sends it through the network.