Multiclustered and Multithreaded Architecture

1 Multiclustered and Multithreaded Architecture

2 Multithreading The ability for a CPU to run multiple processes/threads at the same time, supported properly by the computer’s operating system. Multithreading is a major way of increasing a system’s throughput, leading to gains in performance as a result. Differs from Multiprocessing (another throughput-increasing method) in that all threads share the same set of resources. Often used in conjunction with Multiprocessing: Multithreading optimizes utilization of a single core, while Multiprocessing runs multiple cores in concert with each other.

3 Advantages Processes can continue to utilize unused resources if one process stalls out Maximizes usage CPU resources that would have been idle otherwise If multiple threads are using the same data, sharing the same cache can lead to better usage of the cache as well as data synchronization

4 Disadvantages Potential exists for threads to interfere with each other when sharing hardware resources Performance gains vary from system to system Hand-crafted assembly programs can actually see performance degradation Requires software support at both the operating system and application level to work properly

5 Types Temporal Multithreading (two main sub-categories that differ by their granularity) Coarse-Grained Fine-Grained (Interleaving) Simultaneous Multithreading Distinction between the two is how many threads can be at a given pipeline stage during a cycle: Temporal: Allows only one thread per execution cycle Simultaneous: Allows more than one per execution cycle

6 Coarse-Grained architecture
When a thread is stalled due to some event, switch to a different hardware context. CPU switches every few cycles to a different thread.

7 Fine-Grained Architecture
Also called Cycle-by-Cycle Interleaved. One core with separate sets of register to manage multiple threads The core can make a context switch from one thread to another at every cycle. When there is a long period of cache missed and the current thread is idle; you still be able to run another thread. Tolerates the control and data dependency latencies by overlapping the latency with useful work from other threads

8 Fine-Grained Architecture

9 Simultaneous Multithreading( SMT )
Used exclusively for increasing the efficiency of superscalar CPUs Initially developed for use in IBM’s supercomputer project during the 1960’s Allows multiple threads to issue instructions per CPU cycle Enabled without major changes to a processor’s architecture: Ability to accept instructions from multiple threads Larger than normal register to accommodate the data from extra threads

10 Simultaneous Multithreading( SMT )

11 Simultaneous Multithreading (Cont.)
Advantages: Increased processor performance (varies, see below) Increased power efficiency Cuts memory latency down to near unnoticeable levels Disadvantages: Can actually decrease performance depending on processor architecture if there are resource bottlenecks Makes software development more difficult, as testing needs to be done to determine if the application benefits or suffers from the feature followed by logic to turn it off if necessary Potential security issues with shared resources

12 Multithreading architecture summary

13 How do we increase computing power?
Increasing Performance: A farmer seeks to increase performance of his ox and plow Should the farmer try to breed a stronger ox?

14 How do we increase computing power?
Increasing Performance:

15 How do we increase computing power?
Increasing Performance: Or should the farmer use more oxen yoked together?

16 How do we increase computing power?
Increasing Performance: Processors have become faster, smaller, and transistor-denser, but these advances will quickly diminish while production costs increase rapidly Limitations of increasing Processor performance: Transistor density limited by electromagnetic / heat interference Cost increase per Performance increase diminishes, when compared to adding additional processors

17 Cluster Computing What is a cluster?
Commodity computers using customized operating systems, connected by network interconnects, managed by an application

18 Cluster Computing What is cluster computing used for?
Distributed computing: A network of computers that communicate with each other to achieve a common goal A job to be processed is split into tasks, and the tasks are processed by individual computers or nodes Amdahl’s Law: every algorithm has a section that must be executed serially, this limits the speedup that can be achieved, through distributed computing

19 Multicluster Architectures
Grid Computing: Loosely coupled and geographically dispersed clusters Generally used in scientific research by institutions Utilize thousands to hundreds of thousands of processor cores spread across many institutions Connected via Storage Area Network or SAN

20 Multicluster Architectures
Grid Computing: Tommy Minyard, TACC

21 Multicluster Architectures
Grid Computing Limitations: Suitable for computationally intensive jobs, but ill-equipped for handling and transferring large amounts of data SAN becomes a bottleneck, when large amounts of data must be transferred to multiple clusters

22 Multicluster Architectures
Supercomputers and High Performance Computing (HPC): Highly tuned computer clusters using commodity processors, with customized network interconnects and operating systems

23 Multicluster Architectures
Supercomputers and High Performance Computing (HPC): FLOPS: Floating-point Operations per second Currently the fastest Supercomputers operate at peta-scale Quadrillions of FLOPS or 1,000,000,000,000,000 (1015)

24 Multicluster Architectures
China’s Supercomputer Sunway TaihuLight: 93 petaFLOPS (2016) = 93,000,000,000,000,000 FLOPS

25 Multicluster Architectures
Hadoop Clusters for Big Data: Data Locality: data is stored locally on the nodes themselves; very fast Unlike grid architectures, there is no bottleneck in data transfer over SAN Unlike RDBMS, Hadoop clusters stream through data at disk transfer rate, rather than using point queries at slower disk “seek” rate 2008 – 1 TB sorted in 209 seconds using 900 nodes 2009 – 100 TB sorted in 173 minutes using 3400 nodes

26 Multicluster Architectures
Common Hadoop Cluster Networking scheme: Higher latency between racks Store data locally

27 Multicluster Architectures
Hadoop Clusters for Big Data: Fault tolerance Large number of parts, increases the likelihood of hardware failure in the system Hardware Redundancy: Data and Task outputs replicated, three copies are made Error Detection: Large quantities of data transferred, increases likelihood of data corruption in the system CRC – 32 (cyclic redundancy check)

