Download presentation
Presentation is loading. Please wait.
Published byLewis Beasley Modified over 10 years ago
1
Altix 4700
2
ccNUMA Architecture Distributed Memory - Shared address space
3
Altix HLRB II – Phase 2 19 partitions with 9728 cores Each with 256 Itanium dual-core processors, i.e., 512 cores –Clock rate 1.6 GHz –4 Flops per cycle per core –12,8 GFlop/s (6,4 GFlop/s per core) 13 high-bandwidth partitions –Blades with 1 processor (2 cores) and 4 GB memory –Frontside bus 533 MHz (8.5 GB/sec) 6 high-density partitions –Blades with 2 processors (4 cores) and 4 GB memory. –Same memory bandwidth. Peak Performance: 62,3 TFlops (6.4 GFlops/core) Memory: 39 TB
4
Memory Hierarchy L1D 16 KB, 1 cycle latency, 25,6 GB/s bandwidth cache line size 64 bytes L2D 256 KB, 6 cycles, 51 GB/s cache line size 128 bytes L3 9 MB, 14 cycles, 51 GB/s cache line size 128 bytes
5
Interconnect NUMAlink 4 2 links per blade Each link 2*3,2 GB/s bandwidth MPI latency 1-5µs
6
Disks Direct attached disks (temporary large files) 600 TB 40 GB/s bandwidth Network attached disks (Home Directories) 60 TB 800 MB/s bandwidth
7
Environment Footprint: 24 m x 12 m Weight: 103 metric tons Electrical power: ~1 MW
8
NUMAlink Building Block NUMALink 4 Router Level 1 SAN Switch 10 GE PCI/FC 8 cores (high bandwidth) 16 cores (high-density)
9
Blades and Rack
10
Interconnection in a Partition
11
Interconnection of Partitions Gray squares 1 partition with 512 cores L: Login B:Batch Lines 2 NUMALink4 planes with 16 cables each cable: 2 * 3,2 GB/s
12
Interactive Partition Login cores 32 for compile & test Interactive batch jobs 476 cores managed by PBS –daytime interactive usage –small-scale and nighttime batch processing –single partition only High-density blades 4 cores per memory 12 Login 4 OS 16 Login 12 Batch 4 Login 16
13
18 Batch Partitions Batch jobs 510 (508) cores managed by PBS large-scale parallel jobs single or multi-partition jobs 5 partitions with high- density blades 13 partitions with high- bandwidth blades 6 (12) 4 OS 8 (16)
14
Bandwidth
15
Coherence Implementatioin SHUB2 supports up to 8192 SHUBs (32768 cores) Coherence domain up to 1024 SHUBs(4096 cores) SGI term: "Sharing mode" Directory with one bit per SHUB Multiple shared copies are supported. Accesses of other coherence domains SGI term: "Exclusive sharing mode" Always translated in exclusive access Only single copy is supported Directory stores the address of SHUB(13 bits)
16
SHMEM Latency Model for Altix SHMEM get latency is sum of: 80 nsec for function call 260 nsec for memory latency 340 nsec for first hop 60 nsec per hop 20 nsec per meter of NUMAlink cable Example 64 P system: max hops is 4, max total cable length is 4. Total SHMEM get latency is: 1000 nsec = 80 + 260 + 340 + 60x4 + 20x4
17
Coherency Domain 1 Parallel Programming Models Linux Image 2 Coherency Domain 2 Intra-Host (512 cores) Intra-Coherency Domain (4096 cores) and across entire machine OpenMP Pthreads MPI SHMEM TM Global segments MPI SHMEM Global Segments Altix ® System Linux Image 1
18
Barrier Synchronization Frequent in OpenMP, SHMEM, MPI single sided ops (MPI_Win_fence) Tree-based implementation using multiple fetch-op variables to minimize contention on SHUB. Using uncached load to reduce NUMAlink traffic. CPU HUB ROUTER CPU Fetch-op variable
19
Programming Models OpenMP on an Linux image MPI SHMEM Shared segments (System V und Global Shared Memory)
20
SHMEM Can be used for MPI programs where all processes execute same code. Enables access within and across partitions. Static data and symmetric heap data (shmalloc or shpalloc) info: man intro_shmem
21
Example #include main() { long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; static long target[10]; MPI_Init(…) if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */ if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]); }
22
Global Shared Memory Programming Allocation of a shared memory segment via collective GSM_alloc. Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance. GSM segment can be distributed across partitions. – GSM_ROUNDROBIN: Pages are distributed in roundrobin across processes –GSM_SINGLERANK: Places all pages near to a single process –GSM_CUSTOM_ROUNDROBIN: Each process specifies how many pages should be placed in its memory. Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions.
23
Example #include placement = GSM_ROUNDROBIN; flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf; rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf); // Have one rank initialize the shared memory region if (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; } MPI_Barrier(MPI_COMM_WORLD); // Have every rank verify they can read from the shared memory for (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); }
24
Summary Altix 4700 is a ccNUMA system >60 TFlop/s MPI messages sent with two-copy or single-copy protocol Hierarchical coherence implementation Intranode Coherence domain Across coherence domains Programming models OpenMP MPI SHMEM GSM
25
The Compute Cube of LRZ
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.