Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Programming Models Dr. Xiao Qin Auburn.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
ECE669 L1: Course Introduction January 29, 2004 ECE 669 Parallel Computer Architecture Lecture 1 Course Introduction Prof. Russell Tessier Department of.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Computer architecture II
0 Introduction to parallel processing Chapter 1 from Culler & Singh Winter 2007.
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Parallel Computing Laxmikant Kale
ECE 568: Modern Comp. Architectures and Intro to Parallel Processing Fall 2006 Ahmed Louri ECE Department.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
ECE 569: High-Performance Computing: Architectures, Algorithms and Technologies Spring 2006 Ahmed Louri ECE Department.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
INEL6067 Technology ---> Limitations & Opportunities Wires -Area -Propagation speed Clock Power VLSI -I/O pin limitations -Chip area -Chip crossing delay.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
Diversity and Convergence of Parallel Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Fundamental Design Issues
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
These slides are based on the book:
Convergence of Parallel Architectures
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
Constructing a system with multiple computers or processors
Morgan Kaufmann Publishers
CS775: Computer Architecture
CMSC 611: Advanced Computer Architecture
What is Parallel and Distributed computing?
Shared Memory Multiprocessors
Lecture 1: Parallel Architecture Intro
Chapter 17 Parallel Processing
Multiprocessor Architectures and Parallel Programs
Convergence of Parallel Architectures
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
CS 258 Parallel Computer Architecture
Lecture 24: Memory, VM, Multiproc
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Parallel Computer Architecture
Computer Evolution and Performance
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Lecture 23: Virtual Memory, Multiprocessors
Database System Architectures
Prof. Onur Mutlu Carnegie Mellon University 9/12/2012
Presentation transcript:

Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Programming Models Dr. Xiao Qin Auburn University http://www.eng.auburn.edu/~xqin xqin@auburn.edu Lec02 Some slides are adopted from Dr. David E. Culler, U.C. Berkeley

Speedup Speedup (p processors) = For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Performance (p processors) Performance (1 processor) Time (1 processor) Time (p processors)

Architectural Trends Greatest trend in VLSI generation is increase in parallelism Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction to deal with control transfer and latency problems Next step: thread level parallelism

How far will ILP go? Infinite resources and fetch bandwidth, perfect branch prediction and renaming real caches and non-zero miss latencies

Threads Level Parallelism “on board” Proc Proc Proc Proc MEM No. of processors in fully configured commercial shared-memory systems Micro on a chip makes it natural to connect many to shared memory dominates server and enterprise market, moving down to desktop Faster processors began to saturate bus, then bus technology advanced today, range of sizes for bus-based systems, desktop to large servers

Memory Bandwidth

Bus Bandwidth Year 2002 250 MBps

Number of Cores

Commercial Computing scale deep web web Data marketplace and analytics Data-intensive HPC, cloud Semantic discovery Data marketplace and analytics Social media and networking Automate (discovery) Discover (intelligence) Transact Integrate Interact Relies on parallelism for high end Computational power determines scale of business that can be handled Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm) Inform Publish time

Scientific Computing Demand Old data

Applications: Speech and Image Processing

Top Ten Largest Databases Ref: http://www.focus.com/fyi/operations/10-largest-databases-in-the-world/

Roadmap to Exascale (architectural trends)

Is better parallel arch enough? Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Database and transactions as well as financial Usually smaller-scale, but large-scale systems also used Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Greatest use of small-scale multiprocessors Solid application demand exists and will increase AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D

Programming Model Conceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Multiprogramming no communication or synch. at program level Shared address space like bulletin board Message passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing

Shared Memory => Shared Addr. Space Bottom-up engineering factors Programming concepts Why its attactive.

Adding Processing Capacity Memory capacity increased by adding modules I/O by controllers and devices Add processors for processing! For higher-throughput multiprogramming, or parallel programs

Historical Development “Mainframe” approach Motivated by multiprogramming Extends crossbar used for Mem and I/O Processor cost-limited => crossbar Bandwidth scales with p High incremental cost use multistage instead “Minicomputer” approach Almost all microprocessor systems have bus Motivated by multiprogramming, TP Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck caching is key: coherence problem Low incremental cost

Shared Physical Memory Any processor can directly reference any memory location Any I/O controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate Communication occurs implicitly as result of loads and stores What about application processes?

Shared Virtual Address Space Process = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. User-kernel or multiple processes Multiple threads of control on one address space. Popular approach to structuring OS’s Now standard application capability (ex: POSIX threads) Writes to shared address visible to other threads Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization also load/stores

Structured Shared Address Space o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses Add hoc parallelism used in system code Most parallel applications have structured SAS Same program on each processor shared variable X means the same thing to each thread

Engineering: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth 6/20/2018

Engineering: SUN Enterprise Proc + mem card - I/O card 16 cards of either type All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus

Scaling Up Problem is interconnect: cost (crossbar) or bandwidth (bus) ° ° ° Network Network $ ° ° ° $ M M ° ° ° $ $ $ M $ P P P P P P “Dance hall” Distributed memory Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUMA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data?

Engineering: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates request message for non-local references No hardware mechanism for coherence SGI Origin etc. provide this

Systolic Arrays SIMD Generic Architecture Message Passing Dataflow ° ° ° Network P $ Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

Message Passing Architectures Complete computer as building block, including I/O Communication via explicit I/O operations Programming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram Communication integration? Mem, I/O, LAN, Cluster Easier to build and scale than SAS Programming model more removed from basic hardware operations Library or OS intervention M ° ° ° Network P $

Message-Passing Abstraction Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too Many overheads: copying, buffer management, protection

Evolution of Message-Passing Machines Early machines: FIFO on each link HW close to prog. Model; synchronous ops topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)

Diminishing Role of Topology Shift to general links DMA, enabling non-blocking ops Buffered by system at destination until recv Store&forward routing Diminishing role of topology Any-to-any pipelined routing node-network interface dominates communication time Simplifies programming Allows richer design space grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B

Example Intel Paragon

Building on the mainstream: IBM SP-2 Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)

Berkeley NOW 100 Sun Ultra2 workstations Inteligent network interface proc + mem Myrinet Network 160 MB/s per link 300 ns per hop

Toward Architectural Convergence Evolution and role of software have blurred boundary Send/recv supported on SAS machines via buffers Can construct global address space on MP (GA -> P | LA) Page-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for MP (low-latency, high-bandwidth) Hardware SAS passes messages Even clusters of workstations/SMPs are parallel systems Emergence of fast system area networks (SAN) Programming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines