OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001
unite.sara.nl SGI Origin 2000 Located at SARA in Amsterdam Hardware configuration : –128 MIPS R Mhz –64 Gbyte main memory –1 Tbyte disk storage – Mbits –1 1 Gbit
Contents Architecture –Overview –Module interconnect –Memory hierarchies Programming –Parallel models –Data placement Pros and cons
Overview - Features 64 bit RISC microprocessors Large main memory “Scalable” in CPU, memory and I/O Shared memory programming model
Overview - Applications Worldwide : +/ systems –~ 50 with >128 CPUs –~ 100 with CPUs –~ 500 with CPUs Computing serving : many CPUs and memory Database serving : many disks Web serving : many I/O
System architecture – 1 CPU CPU + cache One system bus Memory I/O (network + disk) Cached data
System architecture – N CPU Symmetric multi- processing (SMP) Multi-CPU + caches One shared bus Memory I/O
N CPU – cache coherency Problem: –Inconsistent cached data Solution: –Snooping –Broadcasting Not scalable
Architecture – Origin 2000 Node board 2 CPU + cache Memory Directory HUB I/O
Origin 2000 Interconnect Node boards Routers –Six ports
Interconnect Topology
Sample Topologies
128 Topology
Virtual Memory One CPU, multi programs Page Paging disk Page replacement
O2000 Virtual Memory Multi CPU, Multi progs Non-Uniform Memory Access Efficient programs: –Minimize data movement –Data “close” to CPU
Latencies and Bandwidth
Application performance Scientific computing –LU, ocean, barnes, radiosity Linear speedup –More CPUs -> performance
Programming support IRIX operating system Parallel programming –C source level with compiler pragmas –Posix Threads –UNIX processes Data placement –dplace, dlock, dperf Profiling –timex, ssrun
Parallel Programs Functional Decomposition –Decompose the problem into different tasks Domain Decomposition –Partition the problem’s data structure Consider –Mapping tasks/parts onto CPUs –Coordinate work and communication of CPUs
Task Decomposition Decompose problem Determine dependencies
Task Decomposition Map tasks on threads Compare: –Sequential case –Parallel case
Efficient programs Use many CPUs –Measure speedups Avoid: –Excessive data dependencies –Excessive cache misses –Excessive inter-node communication
Pros vs Cons Multi-processor (128 ) Large memory (64 Gbyte) Shared memory programming Slow integer CPU Performance penalty: –Data dependencies –Off board memory