Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers to use, What type of code to write, What kind of CPU time and memory requirement your jobs will have, What tools (e.g., visualization software) to use to analyze the data.
Definitions – Clock Cycles computer chip operates at discrete intervals called clocks. Often measured in nanoseconds (ns) or megahertz. 500 mHz (Pentium III) -> 2 ns 100 mhz (Cray J90) -> 10 ns May take 4 clocks to do one multiplication May take 30 clocks to start a procedure May take 2 clocks to access memory mHz not the only measure
Definitions – FLOPS Floating Operations / Second Mflops – million FLOPS A good measure of code performance – typically one add is one flop, one multiplication is also on flop Cray J90 Perk = 200 Mflops, most codes achieves only 1/3 of peak Cray T90 Perk = 3.2 Gflops Earth Simulator (NEC XS-5) = 8 Gflops Fastest Workstation Processor (DEC Alpha) ~ 1Gflops
MIPS Million instructions per second – also a measure of computer speed – used most the old days
Bandwidth The speed at which data flow across a network or wire 56K Modem = 56 kilobits / second T1 link = mbits / sec T3 link = 45 mbits / sec FDDI = 100 mbits / sec Fiber Channel = 800 mbits /sec 100 BaseT (fast) Ethernet = 100 mbits/ sec Brain system = 3 Gbits / s
Hardware Evolution Mainframe computers Supercomputers Workstations Microcomputers / Personal Computers Desktop Supercomputers Workstation Super Clusters Handheld, Palmtop, Calculators, et al….
Types of Processors Scalar (Serial) One operation per clock cycle Vector Multiple (tens to hundreds) operations per clock cycle. Typically achieved at the loop level where the instructions are the same or similar for each loop index Superscalar Several instructions per clock cycle
Types of Computer Systems Single Processor Scalar (e.g., ENIAC, IBM704, IBM-PC) Single Processor Vector (CDC7600, Cray-1) Multi-Processor Vector (e.g., Cray XMP, Cray C90, Cray J90, NEC SX-5), Single Processor Super-scalar (IBM RS/6000 such as Bluesky) Multi-processor scalar (e.g., Multi-processor Pentium PC) Multi-processor super-scalar (e.g., DEC Alpha based Cray T3E, RS/6000 based IBM SP-2, SGI Origin 2000) Clusters of the above (e.g., Linux clusters, Earth Simulator – Cluster of multiple vector processor nodes)
Memory Architectures Shared Memory Systems Distributed Memory Systems Memory can be accessed and addressed uniformly by all processors Fast/expensive CPU, Memory, and networks Easy to use Difficult to scale to many (> 32) processors Each processor has its own memory Others can access its memory only via network communications Often off-the-shelf components, therefore low cost Hard to use, explicit user specification of communications often needed. Single CPU slow. Not suitable for inherently serial codes High-scalability - largest current system has nearly 10K processors
Memory Architectures Multi-level memory (cache and main memory) architectures Cache – fast and expensive memory Typical L1 cache size in current day microprocessors ~ 32 K L2 size ~ 256K to 8mb Main memory a few Mb to many Gb. Try to reuse the content of cache as much as possible before the content is replaced by new data or instructions
Issues with Parallel Computing Load-balance / Synchronization Try to give equal amount of workload to each processor Try to give processors that finish first more work to do (load rebalance) The goal is to keep all processors as busy as possible Communication / Locality Inter-processor communications typically the biggest overhead on MPP platforms, because network is slow relative to CPU speed Try to keep data access local E.g., 2 nd -order finite difference requires data at 3 points requires data at 5 points 4 th -order finite difference
A Few Simple Roles for Writing Efficient Code Use multiplies instead of divides whenever possible Make innermost loop the longest Slower loop: Do 100 i=1000 Do 10 j=1,10 a(i,j)=… 10 continue Faster loop Do 100 j=100 Do 10 i=1,1000 a(i,j)=… 10 continue For the short loop like Do I=1,3, write out the associated expressions explicitly since the startup cost may be very high Avoid complicated logics (IF’s) inside Do loops Avoid subroutine and function calls inside DO loops Vectorizable codes typically also run faster on RISC based super-scalar processors Keep it simple.
Transition in Computing Architectures This chart depicts major NCAR SCD computers from the 1960s onward, along with the sustained gigaflops (billions of floating-point calculations per second) attained by the SCD machines from 1986 to the end of fiscal year Arrows at right denote the machines that will be operating at the start of FY00. The division is aiming to bring its collective computing power to 100 Gfps by the end of FY00, 200 Gfps in FY01, and 1 teraflop by FY03. (Source at