Computing Environment The computing environment rapidly evolving ‑ you need to know not only the numerical methods, but also How and when to apply them, Which computers to use, What type of code to write, What kind of CPU time and memory your jobs will need, What tools (e.g., visualization software) to use to analyze the output data. In short, how to make maximum advantage and to make most effective use of available computing resources. For time-critical real time applications such as numerical weather prediction (NWP), you want to choose and implement your numerical algorithms to obtain the most-accurate solution on the best computer platform available
Typical PC chipset/motherboard components
Memory Architectures Multi-level memory (cache and main memory) architectures Cache – fast and expensive memory Typical L1 cache size in current day microprocessors ~ 64 K L2 size ~ 256K to 16mb (Intel Pentium 4 has 512K to 2 mb) Main memory a few Mb to many Gb. many newer processors have L2 Cache on CPU die also Disk etc
Definitions – Clock Cycles, Clock Speed The clock rate is the fundamental rate in cycles per second (measured in hertz) at which a computer performs its most basic operations. Often measured in nanoseconds (ns) or megahertz. The clock rate of a CPU is normally determined by the frequency of an oscillator crystal. Clock rate/clock speed is basically the ‘heat beat’ rate of a CPU. The first commercial PC, the Altair (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz. The original IBM PC (1981) had a clock rate of 4.77 MHz. Today’s desktop PC or smart phone CPUs typically run at 1-3GHZ, but they usually contain multiple CPU cores The clock rate is the fundamental rate in cycles per second (measured in hertz) at which a computer performs its most basic operations such as adding two numbers or transferring a value from one processor register to another. Different chips on the motherboard may have different clock rates. Usually when referring to a computer, the term "clock rate" is used to refer to the speed of the CPU. The clock rate of a CPU is normally determined by the frequency of an oscillator crystal. The first commercial PC, the Altair (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz. The original IBM PC (c. 1981) had a clock rate of 4.77 MHz (4,770,000 cycles/second). In 1995, Intel's Pentium chip ran at 100 MHz (100 million cycles/second), and in 2002, an Intel Pentium 4 model was introduced as the first CPU with a clock rate of 3 GHz (three billion cycles/second).
Significance of Clock Speed Clock speed is not everything. May take several clocks to do one multiplication – as early generation CPUs often do May perform many operations per clock – most of today’s CPUs do Memory access also takes time, not just computation mHz is not the only measure of CPU speed. Different CPUs of the same mHz often differ in speed. E.g., older-generation of Intel CPUs of the same clock speed used in PCs are always slower than newer-generation CPUs. The clock rate of the computer's front side bus (the bus carrying data between CPU and memory controller hub), the clock rate of the RAM, the bandwidth of the CPU's memory bus, and the amount of Level 1, Level 2 and Level 3 caches also affect a computer’s speed. Other factors include video card (for visualization), disk drive (for I/O intensive jobs) speeds. Newer processors often have more cores, large cache memory, and the ability to do more processing per clock cycle. The A10 chip used in iPhone 7 is as fast as Intel i7 CPUs used in some laptops (see https://www.theverge.com/2016/9/16/12939310/iphone-7-a10-fusion-processor- apple-intel-future) The clock rate is the fundamental rate in cycles per second (measured in hertz) at which a computer performs its most basic operations such as adding two numbers or transferring a value from one processor register to another. Different chips on the motherboard may have different clock rates. Usually when referring to a computer, the term "clock rate" is used to refer to the speed of the CPU. The clock rate of a CPU is normally determined by the frequency of an oscillator crystal. The first commercial PC, the Altair (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz. The original IBM PC (c. 1981) had a clock rate of 4.77 MHz (4,770,000 cycles/second). In 1995, Intel's Pentium chip ran at 100 MHz (100 million cycles/second), and in 2002, an Intel Pentium 4 model was introduced as the first CPU with a clock rate of 3 GHz (three billion cycles/second).
FLOPS (FLoating-point Operations Per Second) FLOPS is a measure of computer performance, especially in fields of scientific calculations that make heavy use of floating-point calculations. One can calculate theoretical peak FLOPS using this equation: If a microprocessor can do 4 FLOPs per clock cycle, then a single-core 2.5 GHz processor has a theoretical peak performance of 10 billion FLOPS = 10 GFLOPS. The above equation ignores limits imposed by memory bandwidth and other constraints. In general, true reliable FLOPS rating is not determined by theoretical calculations such as this one; instead, they are measured by actual benchmarks of actual performance/throughput. Because this equation ignores all sources of overhead, in the real world, you will never get actual performance that is anywhere near to what this equation predicts, which is just theoretical. Specbench rating (http://www.specbench.org) is based on real world problems.
Definitions – FLOPS Floating Operations / Second Megaflops – million (106) FLOPS Gigaflops – billion (109) FLOPS Teraflops – trillion (1012) FLOPS Petaflops – quadrillion (1015) FLOPS Exaflops – quintillion (1018) FLOPS The fastest computer system since of today can achieve over 10’s of PetaFlops – see http://www.top500.org). A good measure of code performance – typically one add is one flop, one multiplication is also one flop Fastest US-made vector supercomputer CPU - Cray T90 peak = 3.2 Gflops The theoretical peak speed of all the processors (total of over 10,000 CPU cores) on the current OSCER Supercomputer Schooner is about 345 Teraflops. See http://www.oscer.ou.edu/hardsoft_dell_cluster_sandybridge_boomer.php See http://www.specbench.org for the latest benchmarks of processors for real world problems. Specbench numbers are relative.
Memory Architectures Multi-level memory (cache and main memory) architectures Cache – fast and expensive memory Typical L1 cache size in current day microprocessors ~ 64 K L2 size ~ 256K to 16mb (P4 has 512K to 2 mb) Main memory a few Mb to many Gb. Try to reuse the content of cache as much as possible before the content is replaced by new data or instructions Disk, CD ROM etc – the slowest forms of memory – still direct access Tapes, even slower, serial access only many newer processors have L2 Cache on CPU die also Disk etc
Register A special high-speed storage area within the CPU. All data must be represented in a register before it can be processed. For example, if two numbers are to be multiplied, both numbers must be in registers, and the result is also placed in a register. The number of registers that a CPU has and the size of each register (number of bits) help determine the power and speed of a CPU. For example a 32-bit CPU is one in which each register is 32 bits wide. Therefore, each CPU instruction can manipulate 32 bits of data. Usually, the movement of data in and out of registers is completely transparent to users, and even to programmers. Only assembly language programs can manipulate registers. In high-level languages, the compiler is responsible for translating high-level operations into low-level operations that access registers.
Cache Pronounced cash, a special high-speed storage mechanism. Two types of caching are commonly used in PCs: memory caching and disk caching. A memory cache, sometimes called a cache store or RAM cache, is a portion of memory made of high-speed static RAM (SRAM) instead of the slower and cheaper dynamic RAM (DRAM) used for main memory. Memory caching is effective because most programs access the same data or instructions over and over. By keeping as much of this information as possible in SRAM, the computer avoids accessing the slower DRAM. Some memory caches are built into the architecture of microprocessors. Intel Core-2 processors have 2-4 mb caches. Such internal caches are often called Level 1 (L1) caches. Most modern PCs also come with external cache memory (often also located on the CPU die), called Level 2 (L2) caches. These caches sit between the CPU and the DRAM. Like L1 caches, L2 caches are composed of SRAM but they are much larger. Some CPUs even have Level 3 caches. (Note most latest generation CPUs build L2 caches on CPU dies, running at the same clock rate as CPU). Disk caching works under the same principle as memory caching, but instead of using high-speed SRAM, a disk cache uses conventional main memory. The most recently accessed data from the disk (as well as adjacent sectors) is stored in a memory buffer. When a program needs to access data from the disk, it first checks the disk cache to see if the data is there. Disk caching can dramatically improve the performance of applications, because accessing a byte of data in RAM can be thousands of times faster than accessing a byte on a hard disk. Today’s hard drives often have 8 – 16 mb memory caches. When data is found in the cache, it is called a cache hit, (versus cache miss) and the effectiveness of a cache is judged by its hit rate. Many cache systems use a technique known as smart caching, in which the system can recognize certain types of frequently used data. The strategies for determining which information should be kept in the cache constitute some of the more interesting problems in computer science.
Storage and Memory They refer to storage areas in the computer. The term memory identifies data storage that usually comes in the form of chips, and the word storage is used for memory that exists on tapes or disks. However, the difference is blurring as solid state (SD) disk storages are in the form of chips. Moreover, the term memory is usually used as a shorthand for physical memory, which refers to the actual chips capable of holding data. Every computer comes with a certain amount of physical memory, usually referred to as main memory or RAM. Today’s computers also come with hard disk drives that are used to store permanent data. The content in memory is usually lost when the computer is powered down, unless it’s flash memory (as in your thumb drives or SD cards) but data on hard drives are not. The earliest PCs did not have hard drives – they store data on and boot from floppy disks.
What’s an Instruction? Load a value from a specific address in main memory into a specific register Store a value from a specific register into a specific address in main memory Add two specific registers together and put their sum in a specific register – or subtract, multiply, divide, square root, etc Determine whether two registers both contain nonzero values (“AND”) Jump from one sequence of instructions to another (branch) … and so on
Memory or Data Unit A bit is a binary digit, taking a value of either 0 or 1 . For example, the binary number 10010111 is 8 bits long, or in most cases, one modern PC byte. 1 byte = 8 bits Binary digits are a basic unit of information storage and communication in digital computing and digital information theory. Single precision floating point numbers on most computers consist of 32 bits, or 4 bytes. Quantities of bits kilobit (kb) 103 megabit (Mb) 106 gigabit (Gb) 109 terabit (Tb) 1012 petabit (Pb) 1015 exabit (Eb) 1018 zettabit (Zb) 1021 yottabit (Yb) 1024
Bandwidth The speed at which data flow across a network or wire 56K Modem = 56 kilobits / second Fiber Channel = 800 mbits /sec 100 BaseT (fast) Ethernet = 100 mbits/ sec Gigabit Ethernet = 1000 mbits /sec max Wireless 802.11g = 54 mbits/sec max Wireless 802.11n = 248 mbits/sec max Wireless 802.11ac = 1200 mbits/sec max 4G Wireless = up to 1 Gigabits /sec USB 1.1 = 12 mbits/sec, USB 2.0 = 480 mbits/sec USB 3.0 = 5 Gbits/sec Brain system = 3 Gbits / s 1 byte = 8 bits When you see kb/second, find out if b means bit or byte!
Central Processing Unit Also called CPU or processor: the “brain” Parts Control Unit: figures out what to do next -- e.g., whether to load data from memory, or to add two values together, or to store data into memory, or to decide which of two possible actions to perform (branching) Arithmetic/Logic Unit: performs calculations – e.g., adding, multiplying, checking whether two values are equal Registers: where data reside that are directly processed and stored by the CPU Modern CPUs usually also contain on-die Level-1, Level-2 caches and sometimes Level-3 cache too, very fast forms of memory.
Hardware Evolution Mainframe computers Vector Supercomputers Workstations Microcomputers Personal Computers Desktop Supercomputers Workstation Super Clusters Supercomputer Clusters Portables Handheld, Palmtop, cell phones, smart phones, wearables, et al….
Types of Processors Scalar (Serial) One operation per clock cycle Vector Multiple operations per clock cycle. Typically achieved at the loop level where the instructions are the same or similar for each loop index (SIMD). E.g. multiple two vectors of equal length element by element (dot product of two vectors) Superscalar (most of today’s microprocessors) Several operations per clock cycle
Types of Computer Systems Single Processor Scalar (e.g., ENIAC, IBM704, traditional IBM-PC and Mac) Single Processor Vector (CDC7600, Cray-1) Multi-Processor Vector (e.g., Cray XMP, C90, J90, T90, NEC SX-5, SX-6), Single Processor Super-scalar (most desktop PCs 10 years ago) Multi-processor scalar (early multi-processor workstations) Multi-processor super-scalar (e.g., Xeon workstations and multi-core PCs) Clusters of the above (e.g., OSCER Linux cluster Schooner – cluster of dual 10-core 2.3 GHz Intel Xeon “Haswell” E5-2650v3 CPUs) Xeon Linux Workstations essentially.
ENIAC – World’s first electronic computer (1946-1955) The world‘s first electronic digital computer, Electronic Numerical Integrator and Computer (ENIAC), was developed by Army Ordnance to compute World War II (1946)ballistic firing tables. ENIAC’s thirty separate units, plus power supply and forced-air cooling, weighed over thirty tons. It had 19,000 vacuum tubes, 1,500 relays, and hundreds of thousands of resistors, capacitors, and inductors and consumed almost 200 kilowatts of electrical power. See https://en.wikipedia.org/wiki/ENIAC
ENIAC – World’s first electronic computer (1946-1955) From http://ftp.arl.army.mil/ftp/historic-computers
ENIAC – World’s first electronic computer (1946-1955) From http://ftp.arl.army.mil/ftp/historic-computers
Top 10 Supercomputers as of June 2017 (top500.org)
NSF Track-2 supercomputers http://www.nics.tennessee.edu/computing-resources/kraken NICS/U. of Tennessee, Kraken (~66K cores, ~100K cores by 10/2009) (http://top500.org) June 2013: Tianhe-2, a supercomputer developed by China’s National University of Defense Technology, is still the world’s No. 1 system with a performance of 33.86 petaflop/s on the Linpack benchmark as of July 2015. Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors for a combined total of 3,120,000 computing cores. At No. 2 is Titan, a Cray XK7 system installed at the Department of Energy’s (DOE) Oak Ridge National Laboratory. Titan, the top system in the United States and one of the most energy-efficient systems on the list, achieved 17.59 petaflop/s on the Linpack benchmark.
Top 10 Supercomputers as of July 2015
Memory Architectures Shared Memory Parallel (SMP) Systems Distributed Memory Parallel (DMP) Systems Memory can be accessed and addressed uniformly by all processors with no user intervention Usually use fast/expensive CPU, Memory, and networks Easier to use Difficult to scale to many (> 32) processors Emerging multi-core processors will surely share memory, and in some cases even cache. Each processor has its own memory Others can access its memory only via network communications Often off-the-shelf components, therefore low cost Harder to use, explicit user specification of communications often needed. Not suitable for inherently serial codes High-scalability - largest current system has many thousands of processors – massively parallel systems
Memory Architectures Multi-level memory (cache and main memory) architectures Cache – fast and expensive memory Typical L1 cache size in current day microprocessors ~ 32 K L2 size ~ 256K to 16mb (P4 has 512K to 2 mb) Main memory a few Mb to many Gb. Try to reuse the content of cache as much as possible before the content is replaced by new data or instructions Disk, CD ROM etc – the slowest forms of memory – still direct access Tapes, even slower, serial access only many newer processors have L2 Cache on CPU die also Disk etc
OSCER IBM Regatta p690 Specs (of early 2000’s) 32 Power4 super-scalar processors running at 1.1 GHz 32 GB shared memory. Distributed memory parallelization is also supported via message passing (MPI) 3 levels of cache. 32KB L1 cache per CPU, 1.41MB L2 cache shared by each two CPUs on the same die (dual-core processors), and 128mb L3 cache shared by eight processors. The peak performance of each processor is 4.4 GFLOPS (4 FLOPS per cycle).
IBM p-Series server (Old Sooner) architecture – Power4 CPU chip (from http://www.redbooks.ibm.com/pubs/pdfs/redbooks/sg247041.pdf) The main components of the POWER4 chip are shown in Figure 2-1 on page 7. The POWER4 chip has a maximum of two microprocessors, each of which is a fully functional 64-bit implementation of the PowerPC AS Architecture specification. Also on the chip is a unified second-level cache, shared by both microprocessors through a core interface unit (CIU). The L2 cache is physically divided into three equal-sized parts, each having an L2 cache controller. The CIU connects each of the three L2 controllers to each processor though separate 32-byte wide data reload and instruction reload ports. Each microprocessor also has an 8-byte wide store port to the CIU that in turn is used to store data through the appropriate L2 controller. Each processor also has associated non-cacheable unit (NCU), shown in Figure 2-1 on page 7, responsible for handling instruction-serializing functions and performing any non-cacheable operations in the storage hierarchy. Logically, these are part of the L2 cache. To improve performance by reducing the latency to memory, the directory for the level 3 cache (L3 cache) and its controller are also located on the POWER4 chip (while the actual L3 arrays are located on the L3 MLD module). Additionally, for I/O device communication, the GX bus controller and the two associated four-byte wide GX bus, one on chip and one off chip, are on the chip as well. Each POWER4 chip contains a fabric controller that provides master control of the network of buses. These buses connect together the on-chip L2 controllers, L3, other POWER4 chips, and other POWER4 modules, and also perform snooping and coherency duties. The Fabric Controller directs a point-to-point network between each of the four chips on the MCM made up of unidirectional 16-byte wide buses running at half the processor frequency, the 8-byte buses also operating at half the processor speed connecting each chip to a corresponding chip on a neighboring MCM, and also controls the unidirectional 16-byte wide buses (running at 3:1 in the pSeries 690 Model 681) between the POWER4 chip and the L3 cache, as well as the buses to the NCU and GX controller.
IBM p-Series server (sooner) architecture (from http://www. redbooks
IBM p-Series server (sooner) architecture
OSCER Linux Cluster Schooner Specs – http://ou Most of the nodes contain dual Intel Xeon “Haswell” E5-2650v3 10-core 2.3 GHz, with 32 GB RAM Each CPU has 10 cores. Each CPU has 25 mb L3 cache, 10 x 256KB L2 cache Each general compute node has 32 GB memory Interconnet: InfiniBand (primary) and 1000Mbps Ethernet (backup) Total theoretical peak speed is about 345 TeraFlops System: Redhat Linux Enterprise