Tuesday, September 04, 2006 "If you were plowing a field, what would you rather use, two strong oxen or 1024 chickens?" (Commenting on parallel architectures) - Seymour Cray, Founder of Cray Research
§Course URL §Folder on indus \\indus\Common\cs524a06 CS 524 : High Performance Computing
Serial computing §To be run on a single computer having a single Central Processing Unit (CPU); §A problem is broken into a discrete series of instructions. §Instructions are executed one after another. §Only one instruction may execute at any moment in time.
Parallel computing §Simultaneous use of multiple compute resources to solve a computational problem on multiple CPUs §A problem is broken into discrete parts that can be solved concurrently §Instructions from each part execute simultaneously on different CPUs
Grand Challenge Problems §Traditionally, parallel computing has been considered to be "the high end of computing“ l Motivated by numerical simulations of complex systems and "Grand Challenge Problems": §Global change §Fluid turbulence §Vehicle dynamics §Ocean circulation §Viscous fluid dynamics §Superconductor modeling §….
§Today, commercial applications are providing an equal or greater driving force in the development of faster computers. §These applications require the processing of large amounts of data. l Parallel databases, data mining l Oil exploration l Web search engines, web based business services l Computer-aided diagnosis in medicine l Advanced graphics and virtual reality, particularly in the entertainment industry l …
Some reasons for using parallel computing: §Save time - wall clock time §Solve larger problems §Provide concurrency (do multiple things at the same time)
Some reasons for using parallel computing: §Save time - wall clock time §Solve larger problems §Provide concurrency (do multiple things at the same time) Other reasons might include: §Taking advantage of non-local resources - using available compute resources on a wide area network, or even the Internet when local compute resources are scarce. §Overcoming memory constraints - single computers have very finite memory resources. For large problems, using the memories of multiple computers may overcome this obstacle.
Parallel Architectures: Memory Parallelism §To increase performance: l Replicate computers. l Can take advantage of commodity microprocessors The simplest and most useful way to classify modern parallel computers is by their memory model: §Shared memory §Distributed memory
Shared Memory §In mid 1980s, when 32-bit microprocessor was first introduced, computers containing multiple microprocessors sharing a common memory became prevalent. §However, a small number of processors can be supported by a bus. §The system is limited by bandwidth of the bus.
Shared Memory §Single address space visible to all CPUs §Data is available to all computers through load and store instructions §Multiple processors can operate independently but share the same memory resources.
UMA bus based SMP architecture §One way to alleviate this problem is to add a cache to each CPU. §Less bus traffic if most reads can be satisfied from the cache and system can support more CPUs. §Single bus limits UMA microprocessor to about CPUs.
UMA bus based SMP architecture §CC-UMA Cache Coherent UMA. §Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update.
UMA multiprocessors using Crossbar switches
§Non-blocking network.
UMA multiprocessors using Crossbar switches §Non-blocking network. §Cross points grow as n 2. §1000 CPUS and 1000 memory modules require a million crossbars. §Feasible for only medium sized systems.
§In 1994 companies such as SGI, Digital, and Sun began selling SMP models in their workstation families.
NUMA multiprocessors Non-uniform memory access (NUMA) l Does not require all memory access times to be same. §CC-NUMA l SGI Origin 300 (128 processors, 1024 special configuration) §Also called Distributed shared memory (DSM)
NUMA multiprocessors NUMA Multiprocessor Characteristics 1.Single address space visible to all CPUs 2.Access to remote memory via commands - LOAD - STORE 3.Access to remote memory slower than to local
Shared Memory §Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.
Distributed Memory §Distributed memory or shared-nothing model. l Use separate computers connected by a network §Typical programming model l Message passing l Emphasizes that parallel computer is a collection of separate computers
Distributed Memory
§Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. §The concept of cache coherency does not apply. §Distributed memory systems are most common parallel computers l Easiest to assemble.
Distributed Memory §When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. §Synchronization between tasks is likewise the programmer's responsibility.
Distributed Memory §Intel Paragon and 512-processor Delta l Showed the viability of using a large number of processors §IBM – SP2 l Commercial distributed memory systems (‘94) l 8 processors to 8192 processor ASCI White system l Database systems were an important component of sales §Cray T3D and T3E systems l Special hardware for remote memory operations
§NUMA and Distributed Memory Systems pictures.
Distributed Memory §Cluster of workstations (NOWs) l Low cost and high performance of commodity of workstations. §365 systems in current TOP500 are labelled as clusters.
§Late 1970s-1980s, Cray vector supercomputers §Initial improvements (clock rates, on- chip pipelined FPUs, on-chip cache size, memory hierarchies). §Multiprocessor architectures were adopted by both vector processor and microprocessor designs.
§Multiprocessor architectures were adopted by both vector processor and microprocessor designs but with differing scales. l Cray Xmp (2 then 4 processors) l C90 (16 processors) l T94 (32 processors) §Microprocessor based supercomputers (MPPs) initially provided 100 processors and then 1000s.
§Trend towards MPPs is very pronounced. §Cray Research announced T3D based on microprocessor in §MPPs continue to account of more than half of all installed high-performance computers worldwide.
High Performance Computers ~ 20 years ago 1x10 6 Floating Point Ops/sec (Mflop/s) Scalar based ~ 10 years ago 1x10 9 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing ~ Today 1x10 12 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, network based
§Parallel computing has made it possible for peak speeds of high end supercomputers to increase at a rate that exceeded Moore’s law.
LINPACK Benchmark §Emphasis on dense linear algebra. §Evaluates narrow aspect of system performance §Available for a wide range of machines for a very long time.
Earth simualtor: TFlops, 5120 processors Had held No. 1 position for five consecutive TOP500 lists before being replaced by BlueGene/L in Nov It is now No. 10.
2006 §BlueGene/L, Number 1 on the TOP500 list of supercomputers. §Located in the Terascale Simulation Facility at Lawrence Livermore National Laboratory. §BlueGene/L is optimized to run molecular dynamics applications. §Also occupied No. 1 slot for last three TOP500 lists.
BlueGene/L Supercomputer LINPACK performance of TFlops/s. IBM remains dominant vendor of supercomputers (48.6% of list) Intel µP at the heart of 301 of 500 systems
July 26, 2006 §MDGrape-3 at Riken, Japan clocked at a one quadrillion calculations per second (1 petaflops).
§Parallel computing is here to stay! §Primary mechanism by which computer performance can keep up with predictions of Moore’s law.
Parallel computing can answer challenges to society. §Diseases §Hurricane tracks (predictions to storms) §Environment impact (Metropolitan transportation systems) §…
Uses thousands of Internet connected PCs to help in the search for extraterrestrial intelligence. When their computer is idle this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours. The results of this analysis are sent back to the SETI team, combined with thousands of other participants. Largest distributed computation project in existence Averaging 40 Tflop/s
Global Distributed Computing §Running on 500,000 PCs, ~1000 CPU Years per Day §485,821 CPU Years so far §Sophisticated Data & Signal Processing Analysis
World Community Grid §Projects that benefit humanity l Defeat Cancer Project l Project §Idle computer time is donated.
§Wide spectrum of parallel computers.
Google query attributes §150M queries/day (2000/second) §3B documents in the index §Clusters of document servers for web pages. Data centers §15,000 Linux systems in 6 data centers §15 TFlop/s and 1000 TB total capability §100 MB Ethernet switches/cabinet with gigabit Ethernet uplink
Sony PlayStation 3 §IBM PowerPC technology §Clocked at 3.2GHz – claimed to yield 2.18 Teraflops. §Seven vector processing units.
von Neumann Architecture §A common machine model known as the von Neumann computer. §Uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory.