SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Slides:



Advertisements
Similar presentations
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Computer Abstractions and Technology
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.
Introduction CS 524 – High-Performance Computing.
Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
1 Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Three types of computer languages
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Chapter 6: An Introduction to System Software and Virtual Machines
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
C++ Functions. 2 Agenda What is a function? What is a function? Types of C++ functions: Types of C++ functions: Standard functions Standard functions.
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
What is Concurrent Programming? Maram Bani Younes.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Effective C# 50 Specific Way to Improve Your C# Item 50 Scott68.Chang.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
CS-2710 Computer Organization Dr. Mark L. Hornick web: faculty-web.msoe.edu/hornick – CS-2710 info syllabus, homework, labs… –
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 1 Introduction. Goal to learn about computers and programming to compile and run your first Java program to recognize compile-time and run-time.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Programming With C.
EEC4133 Computer Organization & Architecture Chapter 6: Languages and the Machine by Muhazam Mustapha, May 2014.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
CS CS CS IA: Procedural Programming CS IB: Object-Oriented Programming.
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Object Oriented Software Development 4. C# data types, objects and references.
Chapter 1: Introduction to Visual Basic.NET: Background and Perspective Visual Basic.NET Programming: From Problem Analysis to Program Design.
Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,
Parallel Computing Presented by Justin Reschke
Evolution of C and C++ n C was developed by Dennis Ritchie at Bell Labs (early 1970s) as a systems programming language n C later evolved into a general-purpose.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Software Engineering Algorithms, Compilers, & Lifecycle.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Introduction to parallel programming modelS
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Chapter 1 Introduction.
For Massively Parallel Computation The Chaotic State of the Art
CS427 Multicore Architecture and Parallel Computing
Microprocessor and Assembly Language
Parallel Programming By J. H. Wang May 2, 2017.
Computer Engg, IIT(BHU)
Chapter 1 Introduction.
课程名 编译原理 Compiling Techniques
Your First Java Application
Introduction to parallelism and the Message Passing Interface
EE 4xx: Computer Architecture and Performance Programming
Presentation transcript:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services Group October 23, 2008

2 Software and Services Group 2 SEC(R) 2008 October 23, 2008 Agenda Existing parallel programming models Key concepts, Blackscholes example Performance results

3 Software and Services Group 3 SEC(R) 2008 October 23, 2008 Parallel programming is important Number of multi-core machines is growing Developers want to fully exploit architecture capabilities But Parallel programming is hard: Users must reason about parallelism Thread synchronization Embedded in serial languages − Data Overwriting − Arbitrary Serialization Tuning Performance Depends on a platform

4 Software and Services Group 4 SEC(R) 2008 October 23, 2008 Parallel Programming Models Improve productivity of programming Hide low-level details Provide high-level abstractions The following models are very popular: OpenMP Cilk Intel® Threading Building Blocks

5 Software and Services Group 5 SEC(R) 2008 October 23, 2008 OpenMP Perfect for Data-parallel algorithms Basics are easy to be applied: #pragma omp parallel for for (int i = 0; i < N; i++) doSomething(i); Advanced usage is complicated and error-prone Requires compiler support

6 Software and Services Group 6 SEC(R) 2008 October 23, 2008 Cilk The programmer identifyes elements that can safely be executed in parallel int fibonacci(int n) { if (n < 2) return n; int x = cilk_spawn fib(n-1); int y = cilk_spawn fib(n-2); cilk_sync; return (x+y); } Explicit spawning of tasks and synchronization with barriers

7 Software and Services Group 7 SEC(R) 2008 October 23, 2008 Intel® Threading Building Blocks Implemented as a C++ library Requires an excellent knowledge of C++ Provides excellent high-level abstractions Provides basic parallel algorithms: −parallel_for −parallel_sort −parallel_while −parallel_reduce −parallel_do −parallel_scan

8 Software and Services Group 8 SEC(R) 2008 October 23, 2008 Existing models - summary The programmer explicitly expresses parallelism Provide an imperative algorithm description Many low-levels questions are solved by the programmer Good control over performance

9 Software and Services Group 9 SEC(R) 2008 October 23, 2008 Agenda Existing parallel programming models Key concepts, Blackscholes example Performance results

10 Software and Services Group 10 SEC(R) 2008 October 23, 2008 The application problem: Serial code Semantic correctness Intel® Concurrent Collections: Architecture Actual parallelism Load balancing Distribution among processors Ideal Parallel programming model Domain Expert (person) Only domain knowledge No tuning knowledge Tuning Expert (person, runtime, static analysis) No domain knowledge Only tuning knowledge

11 Software and Services Group 11 SEC(R) 2008 October 23, 2008 How people think about their application What are high level operations? What are the chunks of data? What are the producer/consumer relationships? What are the inputs and outputs? Parameters Result Solve Blackscholes A data-parallel application Solves an equation independently for each parameters set

12 Software and Services Group 12 SEC(R) 2008 October 23, 2008 Step – a single high-level operation Item – a single data element Tag – an identifier of a step or an item Inputs/Outputs – items or tags produced or consumed by the environment Intel® Concurrent Collections Key Concepts

13 Software and Services Group 13 SEC(R) 2008 October 23, 2008 // Declarations ; [OptionData* Parameters: int n]; [float Result: int n]; // Step prescription :: (Solve); // Step execution [Parameters] -> (Solve) -> [Result]; // Input from the environment: // initialize all tags and data env ->, [Parameters]; // Output to the environment [Result] -> env; Textual Graph Representation

14 Software and Services Group 14 SEC(R) 2008 October 23, 2008 Graph definition Translator Translates a graph definition into a declaration of a class A generated class contains properly named item collections, tag collections and step collections Generates a coding hints file – a template for steps definition Checks correctness of a graph class blackscholes_graph_t : public Graph_t { public: ItemCollection_t Parameters; ItemCollection_t Result; TagCollection_t SolveTags; StepCollection_t SolveStepCollection;... };

15 Software and Services Group 15 SEC(R) 2008 October 23, 2008 Items identifiers Items are stored in a graph in an item collection Put stores an item, associates it with a tag Get accesses items by a tag Items are immutable Tags Steps identifiers Steps are prescribed by tags Put stores a tag, instantiates prescribed steps The same tag is passed to each instantiated step

16 Software and Services Group 16 SEC(R) 2008 October 23, 2008 Specifying Computation 1. StepReturnValue_t Solve( 2. Blackscholes_graph_t& graph, 3. const Tag_t& step_tag) 4. { 5. OptionData* data = 6. graph.Parameters.Get(step_tag); 7. float result = solveEquation(data); 8. graph.Result.Put(step_tag, result); 9. return CNC_Success; 10. }

17 Software and Services Group 17 SEC(R) 2008 October 23, 2008 Using the graph in your C++ application 1. Blackscholes_graph_t my_graph; 2. for (int i = 0; i < N; i++) { 3. my_graph.SolveTags.Put(Tag_t(i)); 4. my_graph.Parameters.Put(Tag_t(i), data[i]); 5. } 6. my_graph.run(); 7. for (int i = 0; i < N; i++) { 8. float result = my_graph.Result.Get(Tag_t(i)); 9. std::cout << result << std::endl; 10. }

18 Software and Services Group 18 SEC(R) 2008 October 23, 2008 Steps Rescheduling A step may begin execution before its input items are available It will be rescheduled and started again from the beginning when the corresponding item is added to the collection Image : k Image Tag : k Block : i, j Block Tag : i,j Result : i, j Split : k Process : i, j

19 Software and Services Group 19 SEC(R) 2008 October 23, 2008 Constraints required by the application 1. Steps have no side-effects 2. Steps call Gets before any Puts 3. Steps call Gets before allocating any memory

20 Software and Services Group 20 SEC(R) 2008 October 23, 2008 Benefits from using Intel® Concurrent Collections Improves programming productivity −Only serial code −No knowledge of parallel technologies required −Determinism −Race-free Portability Scalability Expert-tuning system

21 Software and Services Group 21 SEC(R) 2008 October 23, 2008 Summary: How to write an application using Intel® Concurrent Collections? 1. Draw the algorithm on a chalkboard 2. Define Data structures 3. Represent the algorithm in the textual notation 4. Implement high-level operations in C++ 5. Instantiate a Graph and run it

22 Software and Services Group 22 SEC(R) 2008 October 23, 2008 Agenda Existing parallel programming models Key concepts, Blackscholes example Performance results

23 Software and Services Group 23 SEC(R) 2008 October 23, 2008 Blackscholes benchmark Calculations for a single set of parameters are less than 500 CPU instructions Steps should be grouped to reduce the overhead and improve cache locality Automatic grain selection is an area for future research

24 Software and Services Group 24 SEC(R) 2008 October 23, 2008 Dedup benchmark Algorithm is a pipeline The last pipeline stage is serial Feature “Steps Priorities” makes Dedup run 1.4 times faster

25 Software and Services Group 25 SEC(R) 2008 October 23, 2008 Possible model improvements Memory management Garbage collection Automatic grain selection Streaming data input

26 Software and Services Group 26 SEC(R) 2008 October 23, 2008 Getting More Information Intel® Concurrent Collections for C/C++ on WhatIf.intel.com: intel-concurrent-collections-for-cc

27 Software and Services Group 27 SEC(R) 2008 October 23, 2008 Questions & Answers

28 Software and Services Group 28 SEC(R) 2008 October 23, 2008 Thank you!