Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Similar presentations


Presentation on theme: "Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)"— Presentation transcript:

1 Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

2 Presentation Contains  Evolution of Tera MTA  Design goals of Tera MTA  Tera MTA Architecture  Interconnection Network  Applications  Advantages & Drawbacks  Current MTA Status

3 Evolution Of Tera MTA  1987: Tera Computer Company was established by Burton Smith in Washington, USA  1988: Software development starts  1991: Hardware development starts  1997: First MTA-1shipment to SDSC (San Diego Supercomputer Center)

4 Tera MTA: Design Goals  To solves the two major problems then faced by high-performance parallel computers scalability Programmability  To be suitable for very high-speed implementations  The architecture to be applicable to a wide spectrum of problems.  To Ease compiler implementation  To overcome John von Neumann’s bottleneck (a problem of memory usage)

5 About Tera MTA  The Tera MTA is a high performance system having scalar multithreaded processors with synchronization among threads uniform access shared memory i.e all data accessible with equal ease -No locality - No cache - No mapping simple programming zero cost context switching

6 About Multi-Threading architecture (MTA)  Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches  Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses  Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy  Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers  Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

7 Tera MTA Overview  Up to 256 processors with each processor running @ 260MHz  Up to 128 active threads per processor  Up to 256 I/O processors  Peak Performance of 256 GFlop/sec  Processors and memory modules populate a sparse 3D torus interconnection network  4096 interconnection network nodes  Flat, shared main memory ranging from 16 to 512 GB  Cost : $5 million to $40 million

8 A View of the Tera Multiprocessor

9 Key Architecture Details  Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control  The processor executes instructions from streams, that are not blocked, in a fair round robin fashion  A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy  The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute  Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden  Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

10 Key Architecture Details  Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles  Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing  The current MTA interconnect network is a 3–D toroidal mesh

11 Tera MTA’S Interconnection Network  The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors  Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick.  Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units.  Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

12 Tera MTA’S Interconnection Network  The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh  As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes

13 Multithreading on one processor Unused streams

14 Multithreading on multiple processors

15 Latency Tolerance In Tera MTA  The latency incurred in memory references is hidden by multithreading  As there may be up to 128 instruction streams (threads) and 8 memory references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated  The lookahead allows threads to achieve peak performance.  Three operations (M, A, C) can be executed simultaneously per instruction per processor

16 The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

17 Tera MTA Applications  PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases.  MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries.  Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping.  GAUSSIAN 98, a computational chemistry application used in molecular modeling.  MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena.  Used in seismic analysis, national security and weather forecasting.

18 Advantages of Tera MTA  Tera MTA uses multiple contexts to hide latency  Tera machines perform a context switch every clock cycle  Both pipeline latency and memory latency are hidden in the Tera approach  The thread creation is very cheap  With 128 contexts per processor, a large number(2k) of registers must be shared finely between threads  As long as there is plenty of parallelism in user programs to hide latency and plenty of compiler support, the performance is potentially very high.  The advantages of Tera's architecture are available to users via minimal changes to their application code.

19 Drawbacks of Tera MTA  The performance will be bad for limited parallelism, such as guaranteed low single-context performance.  A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.  Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.  Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

20 Tera MTA: Tools Tera provides two powerful tools Traceview and Canal that allow the programmer to:  Understand how the compiler has multithreaded a program  How effectively the program actually utilizes the hardware.

21 Customers  San Diego Supercomputer Center (SDSC)  Logicon, under a Naval research Lab  Tera computer company

22 Tera MTA Macro Architecture

23 Problems Solved using Tera MTA  irregular memory access patterns  Synchronization among threads  load balancing

24 Current Industry Status: Cray Inc (ex-Tera) Cray Inc. (Nasdaq NM: CRAY) Est.:April 1, 2000 ( Tera Computer + Cray Research) HQ:Seattle WA, USA Products: Supercomputers ( Vector, Micro Processor, Multithread ) Market: Government, Industry, Academic Research 1972 : Est. by Seymour Cray in Minnesota, USA 1976 : First Cray-1 shipment to Los Alamos 1980s : Ship follow-on products Cray XMP , Cray YMP, Cray-2 1990s : More follow-on products Cray C90 , Cray J90 , Cray T3D Cray T90 , Cray T3E, Cray SV1 1996 : Merged with Silicon Graphics ( SGI) 1987 : Est. by Burton Smith in Washington, USA 1988 : Software development starts 1991 : Hardware development starts 1997 : First MTA-1shipment to SDSC (San Diego Supercomputer Center) 2000 : Purchased Cray business unit from SGI

25 Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research)  Cray SX-6  Cray MTA-2  Cray SV1  Cray Red Storm  Cray X1  Cray XD1

26 Cray MTA-2, Multi-threaded Architecture 128 Virtual Processors in a CPU module Zero Overhead Thread Switching Up to 1TB Scalable Shared memory

27 Cray MTA-2 Overview Multithread system Cray MTA-2

28 Unique capability of Cray MTA Visualization of Nebula using MPIRE Application on Cray MTA system

29 References http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html http://www.cs.njit.edu/pact/eight/tutorial/tera.html http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf http://www.bearcave.com/misl/misl_tech/venture_capital.html


Download ppt "Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)"

Similar presentations


Ads by Google