 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.

Slides:



Advertisements
Similar presentations
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Advertisements

Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Threads. Readings r Silberschatz et al : Chapter 4.
PFunc: Modern Task Parallelism For Modern High Performance Computing Prabhanjan Kambadur, Open Systems Lab, Indiana University.
Lecture 2c: Benchmarks. Benchmarking Benchmark is a program that is run on a computer to measure its performance and compare it with other machines Best.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Ch 4: Threads Dr. Mohamed Hefeeda.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.
Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Team Badass.  Dennis M. Ritchie  1967 He became an employee at Bell Labs  Mid 1960s BCPL was developed by Martin Richards for the Multics Project 
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Chapter 4: Threads. 4.2CSCI 380 Operating Systems Chapter 4: Threads Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux.
Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Threads A thread (or lightweight process) is a basic unit of CPU.
Modifying Floating-Point Precision with Binary Instrumentation Michael Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
How Java becomes agile riding Rhino Xavier Casellato VP of engineering, Liligo.com.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Frequent Itemset Mining on Graphics Processors, Fang et al., DaMoN Turbo-charging Vertical Mining of Large Databases, Shenoy et al., MOD NVIDIA.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Microsoft Excel By Tom Osti. What is Excel? Microsoft Excel (full name Microsoft Office Excel) is a spreadsheet-application written and distributed by.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
JAVA AND MATRIX COMPUTATION
Chapter 4: Threads. 2 Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux Threads.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Objective-C Drew Cheng CS420. Overview True superset of ANSI C Object-Oriented methods from SmallTalk There is no formal written standard for the language.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden
Java FilesOops - Mistake Java lingoSyntax
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
Saurav Karmakar. Chapter 4: Threads  Overview  Multithreading Models  Thread Libraries  Threading Issues  Operating System Examples  Windows XP.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
A thread is a basic unit of CPU utilization within a process Each thread has its own – thread ID – program counter – register set – stack It shares the.
cs612/2002sp/projects/ CS612 Term Projects cs612/2002sp/projects/
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Chapter 4: Multithreaded Programming
Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture
Chapter 4: Threads.
CMPS 5433 Programming Models
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Chapter 4 – Thread Concepts
Chapter 4: Threads.
Chapter 4: Multithreaded Programming
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang
Chapter 4: Threads.
Intel® Parallel Studio and Advisor
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Tutorial: The Programming Interface
Chapter 4: Threads & Concurrency
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California State University, Sacramento

Overview. The goal of this project is to prove the efficacy of task parallelism in PFunc to parallelize industry-standard benchmark computation kernels and applications on shared-memory.  Introduce PFunc, a new tool for task parallelism.  New features and extensions.  Fibonacci example.  Introduce NAS parallel benchmarks.  Briefly explain the 7 benchmarks.

Background PFunc - A new tool for task parallelism.  Extends existing task parallel feature-set.  Cilk, Threading Building Blocks, Fortran M, etc.  Portable.  Linux, OS X, AIX and Windows.  Customizable.  Generic and generic programming techniques.  No runtime penalty.  C and C++ APIs.  Released under Eclipse Public License v1.0. 

Example: Parallelizing Fibonacci numbers. typedef struct {int n; int fib_n;} fib_t; void fibonacci (void arg) { fib_t* fib_arg = (fib_t*) arg; if (0 == fib_arg-> n || 1 == fib_arg-> n{ fib_arg-> fib_n = fib_arg-> n; }else{ pfunc_cilk_task_t fib_task; fib_t fib_n_1 = {(fib_arg-> n) - 1, 0}; fib_t fib_n_2 = {(fib_arg-> n) – 2, 0}; pfunc_cilk_task_init (&fib_task); pfunc_cilk_spawn_c (fib_task, /* Handle to the task* / NULL, /* Attribute -- use default */ NULL, /* Group -- use default */ fibonacci, /* Function to execute */ &fib_n_1); /* Argument */ fibonacci (&fib_n_2); pfunc_cilk_wait (fib_task); pfunc_cilk_task_clear (&fib_task); fib_arg-> fib_n = fib_n_1.fib_n + fib_n_2.fib_n; }}

Fibonacci: task creation overhead. Fibonacci number 37 (2 36 ≈ 69 billion tasks).  2x faster than TBB!  Only 2x slower than Cilk.  But provides more flexibility!  Fibonacci is the worst case behavior.  Library-based rather than a custom compiler. ThreadsCilk Time (secs)PFunc/CilkTBB/CilkPfunc/TBB

NAS Parallel Benchmarks.  Stands for NASA Advanced Supercomputing.  Help to evaluate performance of parallel tools and machines.  Consist of 5 kernels and 3 pseudo applications.  Taken mostly from Computational Fluid Dynamics (CFD).  Originally written in Fortran, but C versions are available.   NPB OpenMP-C v2.3.  Base code taken from Omni group’s implementation.

NAS Parallel Benchmarks. BenchmarkExplanation Embarrassingly Parallel (EP)Gaussian random varieties. Marsaglia polar method. Multigrid (MG)3-dimensional discrete Poisson equation. Conjugate Gradient (CG)Iterative solver for linear systems. Symmetric positive-definite matrices. Integer sort (IS)Bucket sort. LU Solver (LU)Lower-upped symmetric Gauss-Seidel. System of nonlinear equations. Pentadiagonal solver (SP)System of nonlinear equations. Block tridiagonal solver (BT)System of nonlinear equations.

Conclusion.  Modify data-parallel NPB OpenMP-C version to task parallel version.  Compare against original NPB OpenMP-C version.  For problem sizes in classes A, B and C.