CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807.

CS 6068 Parallel Computing Fall 2015 Prof. Fred Annexstein @proffreda fred.annexstein@uc.edu Office Hours: before class or by appointment Tel: 513-556-1807 Meeting: Mondays 6:00-8:50PM Baldwin 661

Lecture 1: Welcome Goals of this course Syllabus, policies, grading Blackboard Resources Introduction/Motivation History and Trends in high performance computing Scope of the Problems in Parallel Computing Personal Supercomputing and massive parallelism on your Cluster, Desktop, Laptop, Phone, Watch

Course Goals Learning Outcomes: Students will learn the computational thinking and programming skills needed to achieve terascale computing performance, applicable in all science and engineering disciplines. Students will learn algorithmic design patterns for parallel computing, critical system and architectural design issues, and programming methods and analysis for parallel computing software.

Workload Grading Policy The grading for the class will be based on 1 or 2 exams, 6-8 lab projects, and a final project presentation and report. Each student’s final grade will be a weighted average – Exams and Quizzes 30%, Homework and Labs 40%, Final Presentation and Final Report 30%. Late homework will be considered on case-by-case basis.

Final Projects Students will be expected to work in small teams on a final project. The work for the project can be divided up between team members in any way that works best for you and the group. However, all individual work and outcomes must be clearly documented and self-assessed. Each team will make a presentation involving a demonstration. Also, each team will submit a final report documenting all accomplishments.

Course Schedule The course calendar is subject to change. Part I: focus on high-level functional approaches for multicore computers, Part II: focus on CUDA as a running example for hetergeneous cores, Part III: focus on message passing on cluster computers and parallel application areas. In Part II of the course we will make extensive use of the project materials made available through Udacity Course CS 334 (Labs 1-6). Please sign up for this course and download all the available course materials.

Course Schedule Week 1: Introduction to Parallel Computing. Week 2: Parallel Systems and Architectures. Week 3: Parallel Algorithmic Models. Week 4: Parallel Algorithmic Problems. Week 5: Parallel Programming and Performance. Week 6: Advanced Memory Models. Week 7: Parallel Scheduling. Week 8: Load balancing, Mapping, and Parallel Scalability Analysis. Week 9: Parallel Program Development. Week 10: Parallel Applications in Message Passing Systems. Week 11: Alternative and Future Architectures and Methods. Weeks 12-14: Final Project Presentations.

Course Platform and Materials Udacity Videos CS334 (7 Week online course): https://www.udacity.com/course/cs344 Recommended Textbooks: – CUDA by Example, Addison-Wesley - Introduction to Parallel Programming by Peter Pacheco – Programming on Parallel Machines by Norm Matloff (free online) Additional Notes will be made available on BB Use of Forums on BB is encouraged Lab Equipment: -Your own hardware Linux, Mac or PC with a CUDA enabled GPU -Ohio Supercomputer Center Accounts – Cuda-based Tesla server in my office

Welcome to the Jungle: Current Trends in Computint Reading assignments: 1) http://www.eecs.berkeley.edu/BEARS/presentations/06/Patterso n.ppt 2) http://herbsutter.com/welcome-to-the-jungle/http://herbsutter.com/welcome-to-the-jungle/ Pattersons’s 3 Walls: Power Wall + Memory Wall + ILP Wall = Brick Wall “desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters.” one trend, not three. Multicore, heterogeneous cores, and HaaS cloud computing are not three separate trends, but aspects of a single trend: putting a personal heterogeneous supercomputer cluster on every desk, in every home, and in every pocket.

Supercomputer Performance www.top500.org www.top500.org

Power Constrained Parallelism

Shared versus Distributed Memory in Multicore Processors Shared memory Ex: Intel Core 2 Duo/Quad – One copy of data shared along many core – Atomicity, locking and synchronization essential for correctness – Many scalability issues Distributed memory – Ex: Cell – Cores primarily access local memory – Explicit data exchange between cores – Data distribution and communication orchestration is essential for performance

Programming Shared Memory Processors Processor 1…n ask for X There is only one place to look Communication through shared variables Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize synchronization

Classic Examples of Parallelization ● Data parallelism Perform same computation but operate on different data ● Control parallelism – A single process can fork multiple concurrent threads – Each thread encapsulate its own execution path – Each thread has local state and shared resources – Threads communicate through shared resources such as global memory

Language based approach to concurrency and parallelism Concurrency is concerned with managing access to shared state from different threads, Parallelism is concerned with utilizing multiple processors/cores to improve the performance of a computation. Several languages including Python and Clojure have successfully improved the state of concurrent programming with many concurrency primitives and the same for multi-core parallel programming, Clojure’s original parallel processing function, pmap, will soon be joined by pvmap and pvreduce, based on Doug Lea’s Java Fork/Join Framework.

Functional Approaches to Concurrency Programming with threads, which offer multiple access to shared, mutable data, is "something that is not understood by application programmers, cannot be understood by application programmers, will never be understood by application programmers.” – Tim Bray (inventor of XML) Concurrency and imperative programming with shared state is very hard to reason about. Functional languages in which (nearly) all data is immutable is a promising approach. Clojure is a Lisp, runs on the Java Virtual Machine, and compiles to straight Java bytecode, which makes it extremely fast and is a super-high-performance languageJava Virtual Machine

Experiments in Clojure and Python using OSC resources How to get started in Clojure My Experiments: https://compuzzle.wordpress.com/2015/04/0 7/multi-threaded-parallel-codes-in-clojure https://compuzzle.wordpress.com/2015/04/0 7/multi-threaded-parallel-codes-in-clojure How to get started in Python http://ipython.org/ipython-doc/dev/parallel/

Heterogeneous Computing using Cuda Cuda is a platform for heterogeneous computing that leverages Nvidea’s popular GPUs. Cuda Toolkit has many, many sample codes that can be easily adapted. Many applications today are being built with cuda support. Assignment: Create a computing platform that includes Clojure, Ipython, and Cuda. Using your multicore or hetergeneous platform produce a rendering of a compute intensive graphic image.

Suggestions for Assignment #1 Mandelbrot Fractal in CUDA http://docs.nvidia.com/cuda/cuda- samples/#mandelbrot Deepdream in Ipythonhttps://github.com/google/deepdrea m/blob/master/dream.ipynbhttps://github.com/google/deepdrea m/blob/master/dream.ipynb

Example of Physical Reality Behind GPU Processing

23 CUDA – CPU/GPU Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host)... Parallel Kernel (device) KernelA >>(args); Serial Code (host) Parallel Kernel (device) KernelB >>(args); Thread grid 1 Thread grid 2

24 Arrays of Parallel Threads A CUDA kernel is executed by an array of threads –All threads run the same code (SPMD) –Each thread has an ID that it uses to compute memory addresses and make control decisions 76543210 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID

25 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate 7654321076543210 76543210

Learning CUDA with Udacity cs344 https://www.udacity.com/course/cs344 https://www.udacity.com/wiki/cs344 https://github.com/udacity/cs344

CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807.

Similar presentations

Presentation on theme: "CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807.

Similar presentations

Presentation on theme: "CS 6068 Parallel Computing Fall 2015 Prof. Fred Office Hours: before class or by appointment Tel: 513-556-1807."— Presentation transcript:

Similar presentations

About project

Feedback