Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

Computer Organization and Architecture
CPU Review and Programming Models CT101 – Computing Systems.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Computer Abstractions and Technology
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Performance Analysis of Multiprocessor Architectures
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Pipelining By Toan Nguyen.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallelism Processing more than one instruction at a time. Pipelining
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Performance Evaluation of Parallel Processing. Why Performance?
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Classic Model of Parallel Processing
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.
Pipelining and Parallelism Mark Staveley
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)
EKT303/4 Superscalar vs Super-pipelined.
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
1 A simple parallel algorithm Adding n numbers in parallel.
Classification of parallel computers Limitations of parallel processing.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
18-447: Computer Architecture Lecture 30B: Multiprocessors
Visit for more Learning Resources
Parallel Processing - introduction
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
CSE8380 Parallel and Distributed Processing Presentation
Computer Evolution and Performance
Chapter 11: Alternative Architectures
Chapter 4: Threads & Concurrency
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Presentation transcript:

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations of integrated circuits will continue to support increasing numbers of transistors. How to make efficient use of the additional transistors? Answer: Parallelism beyond multiple pipelines: adding multiple processors or processing components in a single chip or single package. Each level of parallelism performance suffers from the law of diminishing returns outlined by Amdahl. Incorporating multiple levels of parallelism results in higher overall performance and efficiency.

Presentation Content A discussion of practical and theoretical parallel speedup alternative methods and the efficient use of hardware/processing resources in capturing speedup. Parallel Speedup/Amdahl’s Law, Scaled Speedup Pipelined Processors Multiprocessors and Multicomputers Multiple concurrent threads Multiple concurrent processes Multiple levels of parallelism with integrated chips/packages that combine microcontrollers with Digital Signal Processing chips

Presentation Summary Architects/Chip-Manufacturers are integrating additional levels of parallelism. Multiple levels of speedup achieve higher speedups and greater efficiencies than increasing hardware at a single parallel level. A balanced approach would achieve about the same level of efficiency in cost of hardware resources allocated, in delivering parallel speedup at each level of parallelism. Numerous architectural approaches are possible, each with different trade-offs and performance returns. Current technology is integrating DSP processing with microcontroller functionality - achieving up to three levels of parallelism.

Classic Model of Parallel Processing Multiple Processors available (4) A Process can be divided into serial and parallel portions The parallel parts are executed concurrently Serial Time: 10 time units Parallel Time: 4 time units S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts An example parallel process of time 10 : Executed on a single processor: Executed in parallel on 4 processors: SAAAABBBBS S A A A A B B B B S

Amdahl’s Law (Analytical Model) Analytical model of parallel speedup from 1960s Parallel fraction (  ) is run over n processors taking  /n time The part that must be executed in serial (1-  ) gets no speedup Overall performance is limited by the fraction of the work that cannot be done in parallel (1-  ) diminishing returns with increasing processors (n)

OF Pipelined Processing Single Processor enhanced with discrete stages Instructions “flow” through pipeline stages Parallel Speedup with multiple instructions being executed (by parts) simultaneously Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster FD WBEX F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle Cycle:

Pipeline Performance Speedup is serial time (nS) over parallel time Performance is limited by the number of pipeline flushes (n) due to jumps speculative execution and branch prediction can minimize pipeline flushes Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

Super-Scalar: Multiple Pipelines Concurrent Execution of Multiple sets of instructions Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes

Algorithm/Thread Level Parallelism Example: Algorithms to compute Fast Fourier Transform (FFT) used in Digital Signal Processing (DSP) – Many separate computations in parallel (High Degree Of Parallelism) – Large exchange of data - much communication between processors – Fine-Grained Parallelism – Communication time (latency) may be a consideration if multiple processors are combined on a board of motherboard – Large communication load (fine-grained parallelism) can force the algorithm to become bandwidth-bound rather than computation- bound.

Simple Algorithm/Thread Parallelism Model Parallel “threads of execution” –could be a separate process –could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism S A AB B S S A AB B S P1 P2 Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.

Multiprocess Speedup Concurrent Execution of Multiple Processes Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n  single process speedup S A AB B S S A AB B S Two SAABBSSAABBS No speedup - uniprocessor 12 t Single Process 8 t, Speedup = 1.5 S A AB B S Multi-Process 4 t, Speedup = 3 S A AB B S

Algorithm/Thread Parallelism - Analytical Model Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

(Simple) Unified Model with Scaled Speedup Adds scaling factor on parallel work, while holding serial work constant k 1 = scaling factor on parallel portion  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

Capturing Multiple Levels of Parallelism Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

Trend in Microprocessor Architectures Architectural Variations –DSP and microcontroller cores on same chip –DSP also does microprocessor – Microprocessor also does DSP – Multiprocessor Each variation captures some speedup from all three levels Varying amounts of speedup from each level Each parallel level operates at a more efficient level than if all hardware resources were allocated to a single parallel level 1. Intra-Instruction Parallelism: Pipelines 2. Instruction-Level Parallelism: Super-Scalar - Multiple Pipelines 3. Algorithm/Thread Parallelism: – Multiple processing elements – Integrated DSP with microcontroller – Enhanced microcontroller to do DSP – Enhanced DSP processor that also functions as a microcontroller

More Levels of Parallelism Outside the Chip Multiple Processors in a box: –on a motherboard – on back-plane with daughter-boards Shared-Memory Multiprocessors – communication is through shared memory Clustered Multiprocessors – another hierarchical level – processors are grouped into clusters – intra-cluster can be bus or network – inter-cluster can be bus or network Distributed Multicomputers – multiple computers loosely coupled through a network n-tiered Architectures – modern client/server architectures

Speedup of Client-Server, 2-Tier Systems  - workload balance,% of workload on client –  = 1 (100%), completely distributed –  = 0 (100%), completely centralized n clients, m servers LAN INTERNET n CLIENTSm SERVERS

Speedup of Client-Server, n-Tier Systems m 1 level 1 machines (clients) m 2 server 2, m 3 server 3, m 3 server 3, etc.  1 - workload balance,% of workload on client  2 - % of workload on server 2,  3 - % of workload on server 3, etc. LAN INTERNET m 1 CLIENTS SERVERS m 2 m 3 m 4 SAN

Hierarchy of Embedded Parallelism 1. N-tiered Client-Server Distributed Systems 2. Clustered Multi-computers 3. Clustered-Multiprocessor 4. Multiple Processors on a Chip 5. Multiple Processing Elements 6. Multiple Pipelines 7. Multiple Stages per Pipeline Goals: Single analytical model that captures parallelism from all levels Simulator that allows exploration

References K. Hoganson, "Alternative Mechanisms to Achieve Parallel Speedup", First IEEE Online Symposium for Electronics Engineers, IEEE Society, August K. Hoganson, “Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors”, accepted for publication, to appear in The Journal of Supercomputing, To appear 8/2000, Vol. 17, No. 1. K. Hoganson, “Workload Execution Strategies and Parallel Speedup on Clustered Computers”, accepted for publication, IEEE Transactions on Computers, Vol. 48, No. 11, November Undergraduate Research Project: Unified Parallel System Modeling project, Directed Study, Summer-Fall 2000