Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel.

Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel G. Chavarría

Outline ☞ High-performance computing systems Processor architecture Memory subsystems Multithreaded processors Cray XMT multithreaded system Programming models & environments (Cray XMT) Examples

High-Performance Computing Systems Nowadays, HPC systems are parallel computing systems Consisting of hundreds of processors (or more) Connected by high bandwidth, low-latency networks Collections of PCs connected by Ethernet are not HPC systems Basic building block is a node: server-like computer (a few processor sockets, memory and network interconnect cards, possibly I/O devices). Nodes are parallel computers on their own: contain usually >= 2 processor sockets with multiple cores per processor HPC systems have a multiplicity of applications in scientific and engineering areas: physics, chemistry, material design, mechanical design.

HPC Systems (cont.) Two basic kinds of HPC systems: Distributed memory systems Shared memory systems Distributed memory HPC systems: Typical HPC system, processors only have direct access to local memory on the node. Remote memory on other nodes must be accessed indirectly via a library call. Can scale to tens and hundreds of thousands of processors (Blue Gene/P@LLNL and others) Shared memory HPC systems: Processors have direct access to local memory on the node and to remote memory on other nodes. Speed of access may vary More difficult to scale beyond a few thousand processors (Columbia SGI Altix@NASA)

HPC Systems (cont.) CPU Memory Node CPU Memory Node NIC Indirect Access Distributed Memory HPC System (high level view) Most HPC systems fit in this category Programming HPC systems is difficult, access to remote data usually requires message exchange or coordination between nodes

HPC Systems (cont.) CPU Memory Node CPU Node NIC Direct Access Shared Memory HPC System (high level view) SGI Altix, HP Superdome, SunFire servers, Cray XMT Shared memory helps a lot in easing parallel programming, however synchronizing accesses to shared data becomes challenging

Processor Architecture Execution Units L1 Cache Shared L2 Cache Memory Interface Modern Dual-core CPU 1x 1/10x 1/500x >= 2GHz Shared L3 Cache 1/50x

Processor Architecture (cont.) W. Wulf S. McKee, “Hitting the memory wall: Implications of the obvious”, ACM Computer Architecture News, 1995 L1 L2 L3 memory 1-2c 10c 50c 500c Processor speed Memory speed Memory Wall Problem

Outline High-performance computing systems Processor architecture Memory subsystems ☞ Multithreaded processors Cray XMT multithreaded system Programming models & environments (Cray XMT) Examples

Multithreaded Processors Commodity memory is slow, custom memory is very expensive: What can be done about it? Idea: cover latency of memory loads with other (useful) computation OK, how do we do this? Use multiple execution contexts on the same processor, switch between them when issuing load operations Execution contexts correspond to threads Examples: Cray ThreadStorm processors, Sun Niagara 1 & 2 processors, Intel Hyperthreading

Multithreaded Processors (cont.) Execution Units T0 T1 T2 T3 T4 T5 Each thread has its own independent instruction stream (program counter) Each thread has its own independent register set

Cray XMT multithreaded system ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary are mapped to uncontiguous physical locations Global shared memory

Cray XMT multithreaded system (cont.) 3 rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors Example Topology6x12x811x12x811x12x1622x12x1614x24x24 Processors5761056211242248064 Memory capacity9 TB16.5 TB33 TB66 TB126 TB Sustainable remote memory reference rate (per processor) 60 MW/s 45 MW/s33 MW/s30 MW/s Sustainable remote memory reference rate (aggregate) 34.6 GW/s63.4 GW/s95.0 GW/s139.4 GW/s241.9 GW/s Relative size1.01.83.77.314.0 Relative performance1.01.82.84.07.0

Cray XMT multithreaded system (cont.) MTX Linux Compute Service & IO Service Partition Linux OS Specialized Linux nodes Login PEs IO Server PEs Network Server PEs FS Metadata Server PEs Database Server PEs Compute Partition MTX (BSD) RAID Controllers Network PCI-X 10 GigE Fiber Channel PCI-X

Cray XMT multithreaded system (cont.) 15 4 DIMM Slots CRAY Seastar2™ CRAY Seastar2™ CRAY Seastar2™ CRAY Seastar2™ L0 RAS Computer Redundant VRMs

Outline High-performance computing systems Processor architecture Memory subsystems Multithreaded processors Cray XMT multithreaded system ☞ Programming models & environments (Cray XMT) Examples

Programming Models & Environments Compute nodes run MTX, Multithreaded UNIX Service nodes run Linux I/O calls are offloaded to the service nodes Service nodes are used for compilation 17  Operating Systems  LINUX on service and I/O nodes  MTX on compute nodes  Syscall offload for I/O  Run-Time System  Job launch  Node allocator  Logarithmic loader  Programming Environment – Cray MTA compiler -C, C++ – Debugger - mdb – Performance tools: canal, traceview  High Performance File Systems  Lustre  System Mgmt and Admin  Accounting  Red Storm Management System  RSMS Graphical User Interface

Programming Models & Environments (cont.) Cray MTA C/C++ Compiler Parallelizing compiler technology enabled by XMT hardware features Quasi-uniform memory latency No data caches Hardware operations for fine-grained synchronization and atomic updates Support for data parallel compilation Focus on loops operating on equivalent data sets User-provided pragmas (directives) to help out the compiler Support for parallelizing reductions: sum, prod., min, max Support for parallelizing linear recurrences in loops do i = 1, n X(i) = x(i – 1) + y end do 18

Examples Vector Dot Product: 19 double s, a[n], b[n]; … s = 0.0 for (i = 0; i < n; i++) s += a[i] * b[i]; Compiler automatically recognizes and parallelizes the reduction: using a combination of per-thread accumulation and then atomic updates to the global accumulator

Examples (cont.) Sparse Matrix-Vector Product C n x 1 = A n x m * B m x 1 Store A in packed row form A[nz], where nz is the number of non-zeros cols[nz] stores the column index of the non-zeros rows[n] stores the start index of each row in A 20 #pragma mta use 100 streams #pragma mta assert no dependence for (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum; }

Examples (cont.) Parallel Prefix Sum Input vector A of length N Result vector R of length N: Sequential code: 21 sum = 0; for (i = 0; i < n; i++) { sum += A[i]; R[i] = sum; } Difficult to parallelize since the value at position i depends on the sum of the previous values…

Examples (cont.) Parallel Prefix Sum Parallelization idea 22 Add pairs in parallel Update intermediate positions with pairwise sums

int sum[n]; for (i = 1; i < n; i++) { sum[i] = A[i – 1]; R[i] = sum[i]; } for (step = 1; step < n; step *= 2) { for (i = step; i < n; i++) R[i] = sum[i - step] + sum[i]; for (i = 0; i < n; i++) sum[i] = R[i]; } Examples (cont.) Parallel Prefix Sum Actual code: 23 Logarithmic number of sequential loops Parallel loops

Examples (cont.) Parallel Tree Population PDTree: “Partial Dimensions” Tree Count combinations of values for different variables 24

PDTree PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode If the ValueNode does not exist, it must be appended to the end of the list Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() XMT operations to create critical sections Take advantage of full/empty bits on each memory word As data analysis progresses the probability of conflicts between threads is lower 25

PDTree (cont.) 26 v i = j (count) v i = k (count) T 2 trying to grab the end pointer T 1 trying to grab the end pointer v i = j (count) v i = k (count) T 2 now has a lock to a non-end pointer T 1 succeeded and inserted a new node v i = m (count)

Questions Thanks for your attention. 27

Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel.

Similar presentations

Presentation on theme: "Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel.

Similar presentations

Presentation on theme: "Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel."— Presentation transcript:

Similar presentations

About project

Feedback