Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University of Pittsburgh

Outline History (the past) Characteristics of scientific codes Scientific computing, supercomputers, and the Good Old Days Reality (the present) Is there anything “super” about computers anymore? Why “network” means more net work on your part. Fantasy (the future) Strategies for turning a huge pile of processors into something scientists can actually use.

A (very brief) Introduction of Scientific Computing

Properties of “interesting” scientific datasets Very large dataset where: Calculation is “tightly-coupled”

Example Science Application: Cosmology Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years To resolve the gravitational force on any single particle requires the entire dataset “read-only” coupling

Example Science Application: Cosmology Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years To resolve the hydrodynamic forces requires information exchange between particles “read-write” coupling

Scientific Computing Transaction Processing 1 : A transaction is an information processing operation that cannot be subdivided into smaller operations. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state. 2 Functional definition: A transaction is any computational task: 1. That cannot be easily subdivided because the overhead in doing so would exceed the time required for the non- divided form to complete. 2. Where any further subdivisions cannot be written in such a way that they are independent of one another. 2 From Wikipedia 1 term borrowed (and generalized with apologies) from database management

Scientific Computing Functional definition: A transaction is any computational task: 1. That cannot be easily subdivided because the overhead in doing so would exceed the time required for the non-divided to complete. Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM To resolve the gravitational force on any single particle requires the entire dataset “read-only” coupling

Scientific Computing Functional definition: A transaction is any computational task: 2. Where any further subdivisions cannot be written in such a way that they are independent of one another. Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM To resolve the hydrodynamic forces requires information exchange between particle s “read-write” coupling

Scientific Computing In most business and web applications: A single CPU usually processes many transactions per second Transaction sizes are typically small

Scientific Computing In many science applications: A single transaction can take CPU hours, days, or years Transaction sizes can be extremely large

What Made Computers “Super”? Since the transaction is memory- resident in order to not be I/O bound, the next bottleneck is memory. The original Supercomputers differed from “ordinary” computers in their memory bandwidth and latency characteristics.

The “Golden Age” of Supercomputing 1976-1982: The Cray-1 is the most powerful computer in the world The Cray-1 is a vector platform: i.e. it performs the same operation on many contiguous memory elements in one clock tick. Memory subsystem was optimized to feed data to the processor at its maximum flop rate.

The “Golden Age” of Supercomputing 1985-1989: The Cray-2 is the most powerful computer in the world The Cray-2 is also a vector platform

Scientists Liked Supercomputers. They were simple to program! 1. They were serial machines 2. “Caches? We don’t need no stinkin’ caches!” Scalar machines had no memory latency This is as close as you get to an ideal computer Vector machines offered substantial performance increases over scalar machines if you could “vectorize” your code.

“Triumph” of the Masses In the 1990s, commercial off-the-shelf (COTS) technology became so cheap, it was no longer cost-effective to produce fully-custom hardware

“Triumph” of the Masses Instead of producing faster processors with faster memory, supercomputer companies built machines with lots of processors in them. A single processor Cray-2A 1024-processor Cray (CRI) T3D

“Triumph” of the Masses These were known as massively parallel platforms, or MPPs. A single processor Cray-2A 1024-processor Cray T3D

“Triumph” of the Masses(?) A single processor Cray-2, The world’s fastest computer in 1989 A 1024-processor Cray T3D, The world’s fastest computer in 1994 (almost)

Part II: The Present Why “network” means more net work on your part

The “Social Impact” of MPPs The transition from serial supercomputers to MPPs actually resulted in far fewer scientists using supercomputers. MPPs are really hard to program! Developing scientific applications for MPPs became an area of study in its own right: High Performance Computing (HPC)

Characteristics of HPC Codes Large dataset Data must be distributed across many compute nodes Processor Registers Main memory L2 cache L1 cache ~2 cycles ~10 cycles ~100 cycles Off-processor memory ~300,000 cycles! The CPU memory hierarchy The MPP memory hierarchy An N-Body cosomology simulation Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

What makes computers “super” anymore? Cray T3D in 1994: Cray-built interconnect fabric PSC Cray XT3 in 2006: Cray-built interconnect fabric PSC “Terascale Compute System” (TCS) in 2000: Custom interconnect fabric by Quadrics

What makes computers “super” anymore? I would propose the following definition: A “supercomputer” differs from “a pile of workstations” in that: a supercomputer is optimized to spread a single large transaction across many many processors. In practice, this means that the network interconnect fabric is identified as the principle bottleneck.

What makes computers “super” anymore? Google’s 30-acre campus in The Dalles, Oregon

Review: Hallmarks of Computing FORTRAN heralded as the world’s first “high-level” language Seymour Cray develops the CDC 6600, the first “supercomputer” Cray-1 marks the beginning of the Golden Age of supercomputing Cray-2 marks the end of the Golden Age of supercomputing MPPs are born (e.g. CM5, T3D, KSR1, etc) 1966: 1976: 1989: 1990s: 1956: 1986:Pittsburgh Supercomputer Center is founded Seymour Cray founds Cray Research Inc (CRI)1972: 1998:Google Inc. is founded 20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN

Review: HPC High-Performance Computing (HPC) refers to a type of computation whereby a single, large transaction is spread across 100s to 1000s of processors. In general, this kind of computation is sensitive to network bandwidth and latency. Therefore, most modern-day “supercomputers” seek to maximize interconnect bandwidth and minimize interconnect latency within economic limits.

Naïve algorithm is Order N 2 Gasoline: N-Body Treecode (Order N log N) Began development in 1994…and continues to this day PE kd-tree (subset of Binary Space Partitioning tree)

Example HPC Application: Cosmological N-Body Simulation

Cosmological N-Body Simulation Everything in the Universe attracts everything else Dataset is far too large to replicate in every PE’s memory Difficult to parallelize PROBLEM:

Cosmological N-Body Simulation Everything in the Universe attracts everything else Dataset is far too large to replicate in every PE’s memory Difficult to parallelize PROBLEM: Only 1 in 3000 memory fetches can result in an off-processor message being sent!

Characteristics of HPC Codes Large dataset Data must be distributed across many compute nodes Processor Registers Main memory L2 cache L1 cache ~2 cycles ~10 cycles ~100 cycles Off-processor memory ~300,000 cycles! The MPP memory hierarchy An N-Body cosomology simulation Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

Features Advanced interprocessor data caching Application data is organized into cache-lines Read cache: Requests for off-PE data result in fetching of “cache line” Cache line is stored locally and used for future requests Write cache: Updates to off-PE data are processed locally, then flushed to remote thread when necessary < 1 in 100,000 off-PE requests actually result in communication.

Features Load Balancing The amount of work each particle required for step t is tracked. This information is used to distribute work evenly amongst processors for step t+1

Performance 85% linearity on 512 PEs with pure MPI (Cray XT3) 92% linearity on 512 PEs with one-sided comms (Cray T3E Shmem) 92% linearity on 2048 PEs on Cray XT3 for optimal problem size (>100,000 particles per processor)

Features Portability Interprocessor communication by high-level requests to “Machine-Dependent Layer” (MDL) Only 800 lines of code per architecture MDL is rewritten to take advantage of each parallel architecture (e.g. one-sided communication). MPI-1, POSIX Threads, SHMEM, Quadrics, & more Parallel Thread GASOLINEMDL Parallel Thread GASOLINEMDL Communication

Applications Galaxy Formation (10 million particles)

Applications Solar System Planet Formation (1 million particles)

Applications Asteroid Collisions (2000 particles)

Applications Piles of Sand (?!) (~1000 particles)

Summary N-Body simulation are difficult to parallelize: Gravity says: everything interacts with everything else GASOLINE achieves high scalability by using several beneficial concepts: Interprocessor data caching for both reads and writes Maximal exploitation of any parallel architecture Load balancing on a per-particle basis GASOLINE proved useful for a wide range of applications that simulate particle interactions Flexible client-server architecture aids in porting to new science domains

Part III: The Future Turning a huge pile of processors into something that scientists can actually use.

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on workstation Step 3: Extract meaningful scientific knowledge (happy scientist) Using 300 processors: (circa 1996)

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on server Step 3: Extract meaningful scientific knowledge (happy scientist) Using 1000 processors: (circa 2000)

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? (unhappy scientist) Using 2000+ processors: (circa 2005) X

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? Using 100,000 processors?: (circa 2012) X The NSF has announced that it will be providing $200 million to build and operate a Petaflop machine by 2012.

Turning TeraFlops into Scientific Understanding Problem: The size of simulations is no longer limited by the scalability of the simulation code, but by the scientists inability to process the resultant data.

Turning TeraFlops into Scientific Understanding As MPPs increase in processor count, analysis tools must also run on MPPs! PROBLEM: 1. Scientists usually write their own analysis programs 2. Parallel program are hard to write! HPC world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to spend lots of time writing the code. Example: Gasoline required 10 FTE-years of development!

Turning TeraFlops into Scientific Understanding Data analysis implies: Rapidly changing scientific inqueries Much less code reuse Data analysis requires rapid algorithm development! We need to rethink how we as scientists interact with our data!

A Solution(?): N tropy Scientists tend to write their own code So give them something that makes that easier for them. Build a framework that is: Sophisticated enough to take care of all of the parallel bits for you Flexible enough to be used for a large variety of data analysis applications

N tropy: A framework for multiprocessor development GOAL: Minimize development time for parallel applications. GOAL: Enable scientists with no parallel programming background (or time to learn) to still implement their algorithms in parallel by writing only serial code. GOAL: Provide seamless scalability from single processor machines to MPPs…potentially even several MPPs in a computational Grid. GOAL: Do not restrict inquiry space.

Methodology Limited Data Structures: Astronomy deals with point-like data in an N-dimension parameter space Most efficient methods on these kind of data use trees. Limited Methods: Analysis methods perform a limited number of fundamental operations on these data structures.

N tropy Design GASOLINE already provides a number of advanced services GASOLINE benefits to keep: Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service. Portability Interprocessor communication occurs by high-level requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture. Advanced interprocessor data caching < 1 in 100,000 off-PE requests actually result in communication.

N tropy Design Dynamic load balancing (available now) Workload and processor domain boundaries can be dynamically reallocated as computation progresses. Data pre-fetching (To be implemented) Predict request off-PE data that will be needed for upcoming tree nodes.

N tropy Design Computing across grid nodes Much more difficult than between nodes on a tightly-coupled parallel machine: Network latencies between grid resources 1000 times higher than nodes on a single parallel machine. Nodes on a far grid resources must be treated differently than the processor next door: Data mirroring or aggressive prefetching. Sophisticated workload management, synchronization

N tropy Features By using N tropy you will get a lot of features “for free”: Tree objects and methods Highly optimized and flexible Automatic parallelization and scalability You only write serial bits of code! Portability Interprocessor communication occurs by high-level requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture. MPI, ccNUMA, Cray XT3, Quadrics Elan (PSC TCS), SGI Altix

N tropy Features By using N tropy you will get a lot of features “for free”: Collectives AllToAll, AllGather, AllReduce, etc. Automatic reduction variables All of your routines can return scalars to be reduced across all processors Timers 4 automatic N tropy timers 10 custom timers Automatic communication and I/O statistics Quickly identify bottlenecks

Serial Performance N tropy vs. an existing serial n-point correlation function calculator: N tropy is 6 to 30 times faster in serial! Conclusions: 1. Not only does it takes much less time to write an application using N tropy, 2. You application may run faster than if you wrote it from scratch!

Performance 10 million particles Spatial 3-Point 3->4 Mpc This problem is substantially harder than gravity! 3 FTE months of development time!

N tropy “Meaningful” Benchmarks The purpose of this framework is to minimize development time! Development time for: 1. N-point correlation function calculator 3 months 2. Friends-of-Friends group finder 3 weeks 3. N-body gravity code 1 day!* *(OK, I cheated a bit and used existing serial N-body code fragments)

N tropy Conceptual Schematic Computational Steering Layer C, C++, Python (Fortran?) Framework (“Black Box”) User serial collective staging and processing routines Web Service Layer (at least from Python) Domain Decomposition/ Tree Building Tree Traversal Parallel I/O User serial I/O routines VO WSDL? SOAP? Key: Framework Components Tree Services User Supplied Collectives Dynamic Workload Management User tree traversal routines User tree and particle data

Scientists no longer run on their simulations on the biggest MPPs because they cannot analyze the output. Scientists are seriously bummed. Summary Scientists run on serial supercomputers. Scientists write many programs for them. Scientists are happy. MPPs are born. Scientists scratch their heads and figure out how to parallelize their algorithms. early 1990s: Ancient times: mid 1990s:Scientists start writing scalable code for MPPs. After much effort, scientists are kind of happy again. early 2000s: Prehistoric times: FORTRAN is heralded as the first “high-level” language. 20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN

Summary N tropy is an attempt to allow scientists to rapidly develop their analysis codes for a multiprocessor environment. Our results so far show that it is worthwhile to invest time developing a individual frameworks that are: 1. Serially optimized 2. Scalable 3. Flexible enough to be customized to many different applications, even applications that you do not currently envision. Is this a solution for the 100,000 processor world of tomorrow??

Pittsburgh Supercomputing Center Founded in 1986 Joint venture between Carnegie Mellon University, University of Pittsburgh, and Westinghouse Electric Co. Funded by several federal agencies as well as private industries. Main source of support is National Science Foundation, Office of Cyberinfrastructure

Pittsburgh Supercomputing Center PSC is the third largest NSF sponsored supercomputing center BUT we provide over 60% of the computer time used by the NSF research AND PSC is the only academic supercomputing center in the U.S. to have had the most powerful supercomputer in the world (for unclassified research)

Pittsburgh Supercomputing Center GOAL: To use cutting edge computer technology to do science that would not otherwise be possible

Conclusions Most data analysis in astronomy is done using trees as the fundamental data structure. Most operations on these tree structures are functionally identical. Based on our studies so far, it appears feasible to construct a general purpose multiprocessor framework that users can rapidly customize to their needs.

Cosmological N-Body Simulation Time required for 1 floating point operation: 0.25 ns Time required for 1 memory fetch: ~10ns (40 floats) Time required for 1 off-processor fetch: ~10ms (40,000 floats) Lesson: Only 1 in 1000 memory fetches can result in network activity! Timings:

The very first “Super Computer” 1929: New York World newspaper coins the term “super computer” when talking about a giant tabulator custom-built by IBM for Columbia University

Review: Hallmarks of Computing FORTRAN heralded as the world’s first “high-level” language Seymour Cray develops the CDC 6600, the first “supercomputer” Cray-1 marks the beginning of the Golden Age of supercomputing Cray-2 marks the end of the Golden Age of supercomputing MPPs are born (e.g. CM5, T3D, KSR1, etc) 1966: 1976: 1989: 1990s: 1956: 1986:Pittsburgh Supercomputer Center is founded 1995:Cray Computer Corporation (CCC) goes bankrupt Seymour Cray founds Cray Research Inc (CRI)1972: Seymour Cray leaves CRI and founds Cray Computer Corp. (CCC)1989: 1996:Cray Research Inc. acquired by SGI 1998:Google Inc. is founded 20??:Google achieves world domination; Scientists still program in a “high-level” language they call FORTRAN

The T3D MPP 1024 Dec Alpha processors (COTS) 128MB of RAM per processor (COTS) Cray Custom-built network fabric ($$$) A 1024-processor Cray T3D in 1994

General characteristics of MPPs COTS processors COTS memory subsystem Linux-based kernel Custom networking Custom networking in MPPs has replaced the custom memory systems of vector machines The 2068 processor Cray XT3 at PSC in 2006 Why??

Example Science Applications: Weather Prediction Looking for Tornados (credits: PSC, Center for Analysis and Prediction of Storms)

Reasons for being sensitive to communication latency A given processor (PE) may “touch” a very large subsample of the total dataset Example: self-gravitating system PEs must exchange information many times during a single transaction Example: along domain boundaries of a fluid calculation

Features Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service. Computational steering involves trivial serial programming

Design Computational Steering Layer Parallel Management Layer Serial Layer Gravity CalculatorHydro Calculator Gasoline Functional Layout Executes on master processor only Coordinates execution and data distribution among processors Executes “independently” on all processors Machine Dependent Layer (MDL) Interprocessor communication Serial layers Parallel layers

Cosmological N-Body Simulation Simulate how structure in the Universe forms from initial linear density fluctuations: 1. Linear fluctuations in early Universe supplied by cosmological theory. 2. Calculate non-linear final states of these fluctuations. 3. See if these look anything like the real Universe. 4. No? Go to step 1 SCIENCE:

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Similar presentations

Presentation on theme: "Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

Similar presentations

Presentation on theme: "Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University."— Presentation transcript:

Similar presentations

About project

Feedback