Tuesday, September 04, 2006 "If you were plowing a field, what would you rather use, two strong oxen or 1024 chickens?" (Commenting on parallel architectures)

Slides:



Advertisements
Similar presentations
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
Advertisements

Multiple Processor Systems
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters.
Today’s topics Single processors and the Memory Hierarchy
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
Introduction CS 524 – High-Performance Computing.
Tuesday, September 04, 2006 I hear and I forget, I see and I remember, I do and I understand. -Chinese Proverb.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Lecture 1: Introduction to High Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel Architectures
Computer System Architectures Computer System Software
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Outline Course Administration Parallel Archtectures –Overview –Details Applications Special Approaches Our Class Computer Four Bad Parallel Algorithms.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Parallel Computing.
CS591x -Cluster Computing and Parallel Programming
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to parallel programming
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
What is Parallel and Distributed computing?
Parallel and Multiprocessor Architectures – Shared Memory
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Chapter 4 Multiprocessors
Presentation transcript:

Tuesday, September 04, 2006 "If you were plowing a field, what would you rather use, two strong oxen or 1024 chickens?" (Commenting on parallel architectures) - Seymour Cray, Founder of Cray Research

§Course URL §Folder on indus \\indus\Common\cs524a06 CS 524 : High Performance Computing

Serial computing §To be run on a single computer having a single Central Processing Unit (CPU); §A problem is broken into a discrete series of instructions. §Instructions are executed one after another. §Only one instruction may execute at any moment in time.

Parallel computing §Simultaneous use of multiple compute resources to solve a computational problem on multiple CPUs §A problem is broken into discrete parts that can be solved concurrently §Instructions from each part execute simultaneously on different CPUs

Grand Challenge Problems §Traditionally, parallel computing has been considered to be "the high end of computing“ l Motivated by numerical simulations of complex systems and "Grand Challenge Problems": §Global change §Fluid turbulence §Vehicle dynamics §Ocean circulation §Viscous fluid dynamics §Superconductor modeling §….

§Today, commercial applications are providing an equal or greater driving force in the development of faster computers. §These applications require the processing of large amounts of data. l Parallel databases, data mining l Oil exploration l Web search engines, web based business services l Computer-aided diagnosis in medicine l Advanced graphics and virtual reality, particularly in the entertainment industry l …

Some reasons for using parallel computing: §Save time - wall clock time §Solve larger problems §Provide concurrency (do multiple things at the same time)

Some reasons for using parallel computing: §Save time - wall clock time §Solve larger problems §Provide concurrency (do multiple things at the same time) Other reasons might include: §Taking advantage of non-local resources - using available compute resources on a wide area network, or even the Internet when local compute resources are scarce. §Overcoming memory constraints - single computers have very finite memory resources. For large problems, using the memories of multiple computers may overcome this obstacle.

Parallel Architectures: Memory Parallelism §To increase performance: l Replicate computers. l Can take advantage of commodity microprocessors The simplest and most useful way to classify modern parallel computers is by their memory model: §Shared memory §Distributed memory

Shared Memory §In mid 1980s, when 32-bit microprocessor was first introduced, computers containing multiple microprocessors sharing a common memory became prevalent. §However, a small number of processors can be supported by a bus. §The system is limited by bandwidth of the bus.

Shared Memory §Single address space visible to all CPUs §Data is available to all computers through load and store instructions §Multiple processors can operate independently but share the same memory resources.

UMA bus based SMP architecture §One way to alleviate this problem is to add a cache to each CPU. §Less bus traffic if most reads can be satisfied from the cache and system can support more CPUs. §Single bus limits UMA microprocessor to about CPUs.

UMA bus based SMP architecture §CC-UMA Cache Coherent UMA. §Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update.

UMA multiprocessors using Crossbar switches

§Non-blocking network.

UMA multiprocessors using Crossbar switches §Non-blocking network. §Cross points grow as n 2. §1000 CPUS and 1000 memory modules require a million crossbars. §Feasible for only medium sized systems.

§In 1994 companies such as SGI, Digital, and Sun began selling SMP models in their workstation families.

NUMA multiprocessors Non-uniform memory access (NUMA) l Does not require all memory access times to be same. §CC-NUMA l SGI Origin 300 (128 processors, 1024 special configuration) §Also called Distributed shared memory (DSM)

NUMA multiprocessors NUMA Multiprocessor Characteristics 1.Single address space visible to all CPUs 2.Access to remote memory via commands - LOAD - STORE 3.Access to remote memory slower than to local

Shared Memory §Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.

Distributed Memory §Distributed memory or shared-nothing model. l Use separate computers connected by a network §Typical programming model l Message passing l Emphasizes that parallel computer is a collection of separate computers

Distributed Memory

§Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. §The concept of cache coherency does not apply. §Distributed memory systems are most common parallel computers l Easiest to assemble.

Distributed Memory §When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. §Synchronization between tasks is likewise the programmer's responsibility.

Distributed Memory §Intel Paragon and 512-processor Delta l Showed the viability of using a large number of processors §IBM – SP2 l Commercial distributed memory systems (‘94) l 8 processors to 8192 processor ASCI White system l Database systems were an important component of sales §Cray T3D and T3E systems l Special hardware for remote memory operations

§NUMA and Distributed Memory Systems pictures.

Distributed Memory §Cluster of workstations (NOWs) l Low cost and high performance of commodity of workstations. §365 systems in current TOP500 are labelled as clusters.

§Late 1970s-1980s, Cray vector supercomputers §Initial improvements (clock rates, on- chip pipelined FPUs, on-chip cache size, memory hierarchies). §Multiprocessor architectures were adopted by both vector processor and microprocessor designs.

§Multiprocessor architectures were adopted by both vector processor and microprocessor designs but with differing scales. l Cray Xmp (2 then 4 processors) l C90 (16 processors) l T94 (32 processors) §Microprocessor based supercomputers (MPPs) initially provided 100 processors and then 1000s.

§Trend towards MPPs is very pronounced. §Cray Research announced T3D based on microprocessor in §MPPs continue to account of more than half of all installed high-performance computers worldwide.

High Performance Computers ~ 20 years ago 1x10 6 Floating Point Ops/sec (Mflop/s) Scalar based ~ 10 years ago 1x10 9 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing ~ Today 1x10 12 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, network based

§Parallel computing has made it possible for peak speeds of high end supercomputers to increase at a rate that exceeded Moore’s law.

LINPACK Benchmark §Emphasis on dense linear algebra. §Evaluates narrow aspect of system performance §Available for a wide range of machines for a very long time.

Earth simualtor: TFlops, 5120 processors Had held No. 1 position for five consecutive TOP500 lists before being replaced by BlueGene/L in Nov It is now No. 10.

2006 §BlueGene/L, Number 1 on the TOP500 list of supercomputers. §Located in the Terascale Simulation Facility at Lawrence Livermore National Laboratory. §BlueGene/L is optimized to run molecular dynamics applications. §Also occupied No. 1 slot for last three TOP500 lists.

BlueGene/L Supercomputer LINPACK performance of TFlops/s. IBM remains dominant vendor of supercomputers (48.6% of list) Intel µP at the heart of 301 of 500 systems

July 26, 2006 §MDGrape-3 at Riken, Japan clocked at a one quadrillion calculations per second (1 petaflops).

§Parallel computing is here to stay! §Primary mechanism by which computer performance can keep up with predictions of Moore’s law.

Parallel computing can answer challenges to society. §Diseases §Hurricane tracks (predictions to storms) §Environment impact (Metropolitan transportation systems) §…

Uses thousands of Internet connected PCs to help in the search for extraterrestrial intelligence. When their computer is idle this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours. The results of this analysis are sent back to the SETI team, combined with thousands of other participants. Largest distributed computation project in existence Averaging 40 Tflop/s

Global Distributed Computing §Running on 500,000 PCs, ~1000 CPU Years per Day §485,821 CPU Years so far §Sophisticated Data & Signal Processing Analysis

World Community Grid §Projects that benefit humanity l Defeat Cancer Project l Project §Idle computer time is donated.

§Wide spectrum of parallel computers.

Google query attributes §150M queries/day (2000/second) §3B documents in the index §Clusters of document servers for web pages. Data centers §15,000 Linux systems in 6 data centers §15 TFlop/s and 1000 TB total capability §100 MB Ethernet switches/cabinet with gigabit Ethernet uplink

Sony PlayStation 3 §IBM PowerPC technology §Clocked at 3.2GHz – claimed to yield 2.18 Teraflops. §Seven vector processing units.

von Neumann Architecture §A common machine model known as the von Neumann computer. §Uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory.