Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Presentation Contains  Evolution of Tera MTA  Design goals of Tera MTA  Tera MTA Architecture  Interconnection Network  Applications  Advantages & Drawbacks  Current MTA Status

Evolution Of Tera MTA  1987: Tera Computer Company was established by Burton Smith in Washington, USA  1988: Software development starts  1991: Hardware development starts  1997: First MTA-1shipment to SDSC (San Diego Supercomputer Center)

Tera MTA: Design Goals  To solves the two major problems then faced by high-performance parallel computers scalability Programmability  To be suitable for very high-speed implementations  The architecture to be applicable to a wide spectrum of problems.  To Ease compiler implementation  To overcome John von Neumann’s bottleneck (a problem of memory usage)

About Tera MTA  The Tera MTA is a high performance system having scalar multithreaded processors with synchronization among threads uniform access shared memory i.e all data accessible with equal ease -No locality - No cache - No mapping simple programming zero cost context switching

About Multi-Threading architecture (MTA)  Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches  Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses  Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy  Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers  Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

Tera MTA Overview  Up to 256 processors with each processor running @ 260MHz  Up to 128 active threads per processor  Up to 256 I/O processors  Peak Performance of 256 GFlop/sec  Processors and memory modules populate a sparse 3D torus interconnection network  4096 interconnection network nodes  Flat, shared main memory ranging from 16 to 512 GB  Cost : $5 million to $40 million

A View of the Tera Multiprocessor

Key Architecture Details  Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control  The processor executes instructions from streams, that are not blocked, in a fair round robin fashion  A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy  The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute  Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden  Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

Key Architecture Details  Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles  Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing  The current MTA interconnect network is a 3–D toroidal mesh

Tera MTA’S Interconnection Network  The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors  Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick.  Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units.  Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

Tera MTA’S Interconnection Network  The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh  As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes

Multithreading on one processor Unused streams

Multithreading on multiple processors

Latency Tolerance In Tera MTA  The latency incurred in memory references is hidden by multithreading  As there may be up to 128 instruction streams (threads) and 8 memory references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated  The lookahead allows threads to achieve peak performance.  Three operations (M, A, C) can be executed simultaneously per instruction per processor

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

Tera MTA Applications  PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases.  MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries.  Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping.  GAUSSIAN 98, a computational chemistry application used in molecular modeling.  MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena.  Used in seismic analysis, national security and weather forecasting.

Advantages of Tera MTA  Tera MTA uses multiple contexts to hide latency  Tera machines perform a context switch every clock cycle  Both pipeline latency and memory latency are hidden in the Tera approach  The thread creation is very cheap  With 128 contexts per processor, a large number(2k) of registers must be shared finely between threads  As long as there is plenty of parallelism in user programs to hide latency and plenty of compiler support, the performance is potentially very high.  The advantages of Tera's architecture are available to users via minimal changes to their application code.

Drawbacks of Tera MTA  The performance will be bad for limited parallelism, such as guaranteed low single-context performance.  A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity.  Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine.  Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

Tera MTA: Tools Tera provides two powerful tools Traceview and Canal that allow the programmer to:  Understand how the compiler has multithreaded a program  How effectively the program actually utilizes the hardware.

Customers  San Diego Supercomputer Center (SDSC)  Logicon, under a Naval research Lab  Tera computer company

Tera MTA Macro Architecture

Problems Solved using Tera MTA  irregular memory access patterns  Synchronization among threads  load balancing

Current Industry Status: Cray Inc (ex-Tera) Cray Inc. (Nasdaq NM: CRAY) Est.:April 1, 2000 （ Tera Computer + Cray Research) HQ:Seattle WA, USA Products: Supercomputers （ Vector, Micro Processor, Multithread ） Market: Government, Industry, Academic Research 1972 ： Est. by Seymour Cray in Minnesota, USA 1976 ： First Cray-1 shipment to Los Alamos 1980s ： Ship follow-on products Cray XMP ， Cray YMP, Cray-2 1990s ： More follow-on products Cray C90 ， Cray J90 ， Cray T3D Cray T90 ， Cray T3E, Cray SV1 1996 ： Merged with Silicon Graphics （ SGI) 1987 ： Est. by Burton Smith in Washington, USA 1988 ： Software development starts 1991 ： Hardware development starts 1997 ： First MTA-1shipment to SDSC (San Diego Supercomputer Center) 2000 ： Purchased Cray business unit from SGI

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research)  Cray SX-6  Cray MTA-2  Cray SV1  Cray Red Storm  Cray X1  Cray XD1

Cray MTA-2, Multi-threaded Architecture 128 Virtual Processors in a CPU module Zero Overhead Thread Switching Up to 1TB Scalable Shared memory

Cray MTA-2 Overview Multithread system Cray MTA-2

Unique capability of Cray MTA Visualization of Nebula using MPIRE Application on Cray MTA system

References http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html http://www.cs.njit.edu/pact/eight/tutorial/tera.html http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf http://www.bearcave.com/misl/misl_tech/venture_capital.html

Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Similar presentations

Presentation on theme: "Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Similar presentations

Presentation on theme: "Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)"— Presentation transcript:

Similar presentations

About project

Feedback