Department of Computer Science

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Parallel Programming Yang Xianchun Department of Computer Science and Technology Nanjing University Introduction.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
NSF/TCPP Early Adopter Experience at Jackson State University Computer Science Department.
Introductory Courses in High Performance Computing at Illinois David Padua.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Introduction CS 524 – High-Performance Computing.
1 Course Information Parallel Computing Spring 2010.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.
ECE 1747H : Parallel Programming Lecture 1-2: Overview.
WEEK 1 CS 361: ADVANCED DATA STRUCTURES AND ALGORITHMS Dong Si Dept. of Computer Science 1.
Overview of the Course. Critical Facts Welcome to CISC 672 — Advanced Compiler Construction Instructor: Dr. John Cavazos Office.
Parallel and Distributed Computing Overview and Syllabus Professor Johnnie Baker Guest Lecturer: Robert Walker.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
1 Intel® Many Integrated Core (Intel® MIC) Architecture MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on.
Basics and Architectures
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
CIS4930/CDA5125 Parallel and Distributed Systems Florida State University CIS4930/CDA5125: Parallel and Distributed Systems Instructor: Xin Yuan, 168 Love,
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.
CS 390 Introduction to Theoretical Computer Science.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
The Beauty and Joy of Computing Lecture #3 : Creativity & Abstraction UC Berkeley EECS Lecturer Gerald Friedland.
L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Parallel and Distributed Computing Overview and Syllabus Professor Johnnie Baker Guest Lecturer: Robert Walker.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
CPS 258, Fall 2004 Introduction to Computational Science.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Spring 2016, Jan 13 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal.
Fortran Compilers David Padua University of Illinois at Urbana-Champaign.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
Reflections on Dynamic Languages and Parallelism David Padua University of Illinois at Urbana-Champaign 1.
CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.
Distributed Processors
Enabling machine learning in embedded systems
Morgan Kaufmann Publishers
Introduction CSE 1310 – Introduction to Computers and Programming
CS190/295 Programming in Python for Life Sciences: Lecture 1
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
MATLAB HPCS Extensions
Presentation transcript:

CS 420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists and Engineers Fall 2012 Department of Computer Science University of Illinois at Urbana-Champaign

Topics covered Parallel algorithms Parallel programing languages Parallel programming techniques focusing on tuning programs for performance. The course will build on your knowledge of algorithms, data structures, and programming. This is an advanced course in Computer Science.

Why parallel programming for scientists and engineers ? Science and engineering computations are often lengthy. Parallel machines have more computational power than their sequential counterparts. Faster computing → Faster science/design If fixed resources: Better science/engineering Yesterday: Top of the line machines were parallel Today: Parallelism is the norm for all classes of machines, from mobile devices to the fastest machines.

CS420/CSE402/ECE492 Developed to fill a need in the computational sciences and engineering program. CS majors can also benefit from this course. However, there is a parallel programming course for CS majors that will be offered in the Spring semester.

Course organization Course website: https://agora.cs.illinois.edu/display/cs420fa10/Home Instructor: David Padua 4227 SC padua@uiuc.edu 3-4223 Office Hours: Wednesdays 1:30-2:30 pm TA: Osman Sarrod sarood1@illinois.edu Grading: 6 Machine Problems(MPs) 40% Homeworks Not graded Midterm (Wednesday, October 10) 30% Final (Comprehensive, 8 am Friday, December 14) 30% Graduate students registered for 4 credits must complete additional work (associated with each MP).

MPs Several programing models Common language will be C with extensions. Target machines will (tentatively) be those in the Intel(R) Manycore Testing Lab.

MP Plan MP# Assign Date Due Date Grade Date MP1 9/7 9/17 10/1 MP2 9/26 10/8 MP3 10/5 10/19 MP4 10/10 11/2 MP5 11/16 MP6 11/12 12/3 MP7 11/30 12/12

Textbook G. Hager and G. Wellein. Introduction to High Performance Computing for Scientists and Engineers. CRC Press

Specific topics covered Introduction Scalar optimizations Memory optimizations Vector algorithms Vector programming in SSE Shared-memory programming in OpenMP Distributed memory programming in MPI Miscellaneous topics (if time allows) Compilers and parallelism Performance monitoring Debugging

Parallel computing

An active subdiscipline The history of computing is intertwined with parallelism. Parallelism has become an extremely active discipline within Computer Science.

What makes parallelism so important ? One reason is its impact on performance For a long time, the technology of high-end machines Today the most important driver of performance for all classes of machines

Parallelism in hardware Parallelism is pervasive. It appears at all levels Within a processor Basic operations Multiple functional units Pipelining SIMD Multiprocessors Multiplicative effect on performance

Parallelism in hardware (Adders) Adders could be serial Parallel Or highly parallel

Carry lookahead logic

Parallelism in hardware (Scalar vs SIMD array operations) ldv vr1, addr1 ldv vr2, addr2 addv vr3, vr1, vr2 stv vr3, addr3 n/4 times ld r1, addr1 ld r2, addr2 add r3, r1, r2 st r3, addr3 n times for (i=0; i<n; i++) c[i] = a[i] + b[i]; … Register File X1 Y1 Z1 32 bits +

Parallelism in hardware (Multiprocessors) Multiprocessing is the characteristic that is most evident in clients and high-end machines.

Clients: Intel microprocessor performance Knights Ferry MIC co-processor (Graph from Markus Püschel, ETH)

High-end machines: Top 500 number 1

Research/development in parallelism Produced impressive achievements in hardware and software Numerous challenges Hardware: Machine design, Heterogeneity, Power Applications Software: Determinacy, Portability across machine classes, Automatic optimization

Issues in applications

Applications at the high-end Numerous applications have been developed in a wide range of areas. Science Engineering Search engines Experimental AI Tuning for performance requires expertise. Although additional computing power is expected to help advances in science and engineering, it is not that simple:

More computational power is only part of the story “increase in computing power will need to be accompanied by changes in code architecture to improve the scalability, … and by the recalibration of model physics and overall forecast performance in response to increased spatial resolution” * “…there will be an increased need to work toward balanced systems with components that are relatively similar in their parallelizability and scalability”.* Parallelism is an enabling technology but much more is needed. *National Research Council: The potential impact of high-end capability computing on four illustrative fields of science and engineering. 2008

Applications for clients / mobile devices A few cores can be justified to support execution of multiple applications. But beyond that, … What app will drive the need for increased parallelism ? New machines will improve performance by adding cores. Therefore, in the new business model: software scalability needed to make new machines desirable. Need app that must be executed locally and requires increasing amounts of computation. Today, many applications ship computations to servers (e.g. Apple’s Siri). Is that the future. Will bandwidth limitations force local computations ?

Issues in libraries

Library routines Easy access to parallelism. Already available in some libraries (e.g. Intel’s MKL). Same conventional programming style. Parallel programs would look identical to today’s programs with parallelism encapsulated in library routines. But, … Libraries not always easy to use (Data structures). Hence not always used. Locality across invocations an issue. In fact, composability for performance not effective today

Implicit parallelism

Objective: Compiling conventional code Since the Illiac IV times “The ILLIAC IV Fortran compiler's Parallelism Analyzer and Synthesizer (mnemonicized as the Paralyzer) detects computations in Fortran DO loops which can be performed in parallel.” (*) (*) David L. Presberg. 1975. The Paralyzer: Ivtran's Parallelism Analyzer and Synthesizer. In Proceedings of the Conference on Programming Languages and Compilers for Parallel and Vector Machines. ACM, New York, NY, USA, 9-16. 

Benefits Same conventional programming style. Parallel programs would look identical to today’s programs with parallelism extracted by the compiler. Machine independence. Compiler optimizes program. Additional benefit: legacy codes Much work in this area in the past 40 years, mainly at Universities. Pioneered at Illinois in the 1970s

The technology Dependence analysis is the foundation. It computes relations between statement instances These relations are used to transform programs for locality (tiling), parallelism (vectorization, parallelization), communication (message aggregation), reliability (automatic checkpoints), power …

The technology Example of use of dependence Consider the loop for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration.

The technology Example of use of dependence Compute dependences (part 1) for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} i=1 i=2 j=1 a[1][1] = a[1][0] + a[0][1] a[1][2] = a[1][1] + a[0][2] a[1][3] = a[1][2] + a[0][3] a[1][4] = a[1][3] + a[0][4] a[2][1] = a[2][0] + a[1][1] a[2][2] = a[2][1] + a[1][2] a[2][3] = a[2][2] + a[1][3] a[2][4] = a[2][3] + a[1][4] j=2 nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration. j=3 j=4

The technology Example of use of dependence Compute dependences (part 2) for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} i=1 i=2 j=1 a[1][1] = a[1][0] + a[0][1] a[1][2] = a[1][1] + a[0][2] a[1][3] = a[1][2] + a[0][3] a[1][4] = a[1][3] + a[0][4] a[2][1] = a[2][0] + a[1][1] a[2][2] = a[2][1] + a[1][2] a[2][3] = a[2][2] + a[1][3] a[2][4] = a[2][3] + a[1][4] j=2 nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration. j=3 j=4

The technology Example of use of dependence for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} i 1 2 3 4 … 1,1 1 nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration. 2 or j 3 4

The technology Example of use of dependence3. Find parallelism for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration.

The technology Example of use of dependence Transform the code for (i=1; i<n; i++) { for (j=1; j<n; j++) { a[i][j]=a[i][j-1]+a[i-1][j]; }} nested loop, with a single statement S1 Each line corresponds to a statement instace of the j iteration. for k=4; k<2*n; k++) forall (i=max(2,k-n):min(n,k-2)) a[i][k-i]=...

How well does it work ? Depends on three factors: The accuracy of the dependence analysis The set of transformations available to the compiler The sequence of transformations

How well does it work ? Our focus here is on vectorization Vectorization important: Vector extensions are of great importance. Easy parallelism. Will continue to evolve SSE AltiVec Longest experience Most widely used. All compilers has a vectorization pass (parallelization less popular) Easier than parallelization/localization Best way to access vector extensions in a portable manner Alternatives: assembly language or machine-specific macros

How well does it work ? Vectorizers - 2005 G. Ren, P. Wu, and D. Padua: An Empirical Study on the Vectorization of Multimedia Applications for Multimedia Extensions. IPDPS 2005

How well does it work ? Vectorizers - 2010 S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers. International Conference on Parallel Architecture and Compilation Techniques. PACT 2011.

Going forward It is a great success story. Practically all compilers today have a vectorization pass (and a parallelization pass) But… Research in this are stopped a few years back. Although all compilers do vectorization and it is a very desirable property. Some researchers thought that the problem was impossible to solve. However, work has not been as extensive nor as long as work done in AI for chess of question answering. No doubt that significant advances are possible.

What next ? 3-10-2011 Inventor, futurist predicts dawn of total artificial intelligence Brooklyn, New York (VBS.TV) -- ...Computers will be able to improve their own source codes ... in ways we puny humans could never conceive.

Explicit parallelism

Accomplishments of the last decades in programming notation Much has been accomplished Widely used parallel programming notations Distributed memory (SPMD/MPI) and Shared memory (pthreads/OpenMP/TBB/Cilk/ArBB).

Languages OpenMP constitutes an important advance, but its most important contribution was to unify the syntax of the 1980s (Cray, Sequent, Alliant, Convex, IBM,…). MPI has been extraordinarily effective. Both have mainly been used for numerical computing. Both are widely considered as “low level”.

The future Higher level notations Libraries are a higher level solution, but perhaps too high-level. Want something at a lower level that can be used to program in parallel. The solution is to use abstractions.

Array operations in MATLAB An example of abstractions are array operations. They are not only appropriate for parallelism, but also to better represent computations. In fact, the first uses of array operations does not seem to be related to parallelism. E.g. Iverson’s APL (ca. 1960). Array operations are also powerful higher level abstractions for sequential computing Today, MATLAB is a good example of language extensions for vector operations

Array operations in MATLAB Matrix addition in scalar mode for i=1:m, for j=1:l, c(i,j)= a(i,j) + b(i,j); end Matrix addition in array notation c = a + b;