Recognizing Potential Parallelism Introduction to Parallel Programming Part 1
This course module is intended for single and academic use only Single users may utilize these course modules for personal use and individual training. Individuals or institutions may use these modules in whole or part in an academic environment providing that they are members of the Intel Academic Community and abide by its terms and conditionshttp://software.intel.com/en-us/academic INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2008 Intel Corporation. DISCLAIMER AND LEGAL INFORMATION
What Is Parallel Computing? Attempt to speed solution of a particular task by 1. Dividing task into sub-tasks 2. Executing sub-tasks simultaneously on multiple processors Successful attempts require both 1. Understanding of where parallelism can be effective 2. Knowledge of how to design and implement good solutions
Clock Speeds Have Flattened Out Problems caused by higher speeds Excessive power consumption Heat dissipation Current leakage Power consumption critical for mobile devices Mobile computing platforms increasingly important Retail laptop sales now exceed desktop sales Laptops may be 35% of PC market in 2007
Multi-core Architectures Potential performance = CPU speed # of CPUs Strategy: Limit CPU speed and sophistication Put multiple CPUs (“cores”) on a single chip Potential performance the same
Concurrency vs. Parallelism Concurrency: two or more threads are in progress at the same time: Parallelism: two or more threads are executing at the same time Multiple cores needed Thread 1 Thread 2 Thread 1 Thread 2
Improving Performance Use parallelism in order to improve turnaround or throughput Examples Automobile assembly line Each worker does an assigned function Searching for pieces of Skylab Divide up area to be searched US Postal Service Post office branches, mail sorters, delivery
Turnaround Complete single task in the smallest amount of time Example: Setting a dinner table One to put down plates One to fold and place napkins One to place utensils One to place glasses
Throughput Complete more tasks in a fixed amount of time Example: Setting up banquet tables Multiple waiters each do separate tables Specialized waiters for plates, glasses, utensils, etc.
Methodology Study problem, sequential program, or code segment Look for opportunities for parallelism Try to keep all processors busy doing useful work
Ways of Exploiting Parallelism Domain decomposition Task decomposition
Domain Decomposition First, decide how data elements should be divided among processors Second, decide which tasks each processor should be doing Example: Vector addition
Domain Decomposition Large data sets whose elements can be computed independently Divide data and associated computation among threads Example: Grading test papers Multiple graders with same key What if different keys are needed?
Domain Decomposition Find the largest element of an array
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Domain Decomposition Find the largest element of an array Core 0Core 1Core 2Core 3
Task (Functional) Decomposition First, divide tasks among processors Second, decide which data elements are going to be accessed (read and/or written) by which processors Example: Event-handler for GUI
Task Decomposition Divide computation based on natural set of independent tasks Assign data for each task as needed Example: Paint-by-Numbers Painting a single color is a single task Number of tasks = number of colors Two artists: one does even, other odd
Task Decomposition f() s() r() q() h() g()
Task Decomposition f() s() r() q() h() g() Core 0 Core 2 Core 1
Task Decomposition f() s() r() q() h() g() Core 0 Core 2 Core 1
Task Decomposition f() s() r() q() h() g() Core 0 Core 2 Core 1
Task Decomposition f() s() r() q() h() g() Core 0 Core 2 Core 1
Task Decomposition f() s() r() q() h() g() Core 0 Core 2 Core 1
Recognizing Sequential Processes Time is inherently sequential Dynamics and real-time, event driven applications are often difficult to parallelize effectively Many games fall into this category Iterative processes The results of an iteration depend on the preceding iteration Audio encoders fall into this category Pregnancy is inherently sequential Adding more people will not shorten gestation
Summary Clock speeds will not increase dramatically Parallelism takes full advantage of multi-core processors Improve application turnaround or throughput Two methods for implementing parallelism Domain Decomposition Task Decomposition