The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT.

The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT

Jointly Sponsored by Chinese Ministry of Science and Technology Tianjin Binhai New Area Public information infrastructure To provide high performance computing service to whole China Open platform To accelerate education, economy and industry Jointly Sponsored by Chinese Ministry of Science and Technology Tianjin Binhai New Area Public information infrastructure To provide high performance computing service to whole China Open platform To accelerate education, economy and industry National HPC Platform North of China Background National Supercomputer Center in Tianjin( NSCC-TJ) NSCC-TJ Tianjin

Aerospace and Material Engineering Science Mechanics Engineering and Automation Electronic Science and Engineering Information System and Management Computer Science Photo-Electronic Science and Engineering Social Sciences Aerospace and Material Engineering Science Mechanics Engineering and Automation Electronic Science and Engineering Information System and Management Computer Science Photo-Electronic Science and Engineering Social Sciences 8 Colleges/Schools Hunan Province, South of Central China Background National University of Defense Technology( NUDT) NUDT Changsha

Outline Tianhe-1A System Applications – General applications – Meteorologic applications Summary

Tianhe-1A System

Design Motivation High productivity system – High performance – High bandwidth and low communication latency – High throughput and large capacity I/O – Maintenance and Usability Brief history of system implementation – Sep. 2009, CPU&GPU hybrid architecture validated by TH-1 system – Aug. 2010, Enhanced system installed in NSCC-TJ – Sept.~Oct. 2010, Debugging and performance testing – Nov. 2010, Put into public service

From Chip to the whole TH-1A System Quad cpu blade TH-1A System FT-1000 X5670 M2050 Chips Twin GPU blade Compute node rack Cabinet On-line storage TH-Net

Specification ItemsConfiguration Processors14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs Peak performance 4.7PF, Linpack 2.57PF InterconnectProprietary high-speed interconnection network TH-net Memory262TB in total StorageGlobal shared parallel storage system, 2PB Cabinets140 in total, 112 Compute,8 Service,14 Storage,6 Communication Power consumption4.04MW (635.15MF/W) CoolingWater cooling system

Node and Network Four types of node – computing node: 7168 in total CPU – 2 Xeon X5670, 140.64 GFlops GPU – NVIDIA Tesla M2050, 515 Gflops – Connected with CPU by PCI-E 655.64 GFlops in total 32GB memory per node – service node: 1024 in total 2 eight-core domestic CPUs CPU: FT-1000,SoC,1.0GHz, Peak performance 8Gflops 32GB memory per node For login, compile, and applications need throughput computing – I/O management node – I/O storage node Two network – the high-speed communication network (TH-Net) – the maintenance and diagnostic network

High-speed network TH-Net Hierarchy fat-tree structure 1th layer – 16 nodes connected by a 16-port switching – main boards of nodes connected with the switching board via a back board 2th layer – all part connected to 11 384-port switchs – Connected with QSFP optical fibers.

High radix router ASIC ： NRC – High-radix tile NoC – Efficient communication protocol – 16 symmetrical interconnect interfaces – Throughput of single NRC: 2.56Tbps Network interface ASIC ： NIC – NIC host a 16-lane PCIE 2.0 interface – connect the NRC via 8 × 10 Gbps links High-speed network 16*5Gbps 64*266MHz 8*10Gbp s

Performance – P2P bi-BW: 160Gbps latency: 22ns – Switch Ports: 384 switching density: 61.44Tbps – System collective BW: 1146.88Tbps bi-section BW: 307.2Tbps High-speed network

TH-1A software stack

Operating system Kylin Linux Custom compute node kernel Provide virtual running environment – Isolated running environments for different users – Custom software package installation QoS support Power aware computing

GLEX(Galaxy Express) communication system High bandwidth, Low latency, Scalable – Latency: 1.57 us; Bandwidth: 6340 MB/s User-level API – Reduced communication protocol – provides non-blocking operations, overlap communication and computation Kernel-level API – for other kernel modules TCP/IP driver Lustre lnet kernel module GPU-Direct support – zero-copy RDMA for CUDA pinned memory – improve Linpack performance more than 5% MPICH2-GLEX – MPI-2.2 standard compliance RMS

Object storage architecture (Lustre based) Performance: Collective BW >150GB/s – Optimized file system protocol over proprietary interconnection network – Fine-grain distributed file lock mechanism – Optimized file cache policy Reliability enhancement – Fault tolerance of network protocol – Data objects placement – Soft-raid Capacity: 2 PB Global parallel file system

Compiler system Heterogeneous programming framework – Accelerate the large scale, complex applications developing – Use the computing power of CPUs and GPUs efficiently Features – Hybrid Programming model MPI + OpenMP/Pthread + CUDA/OpenCL – Inter-node homogeneous parallel programming For users, hide parallel programming details – Intra-node heterogeneous parallel computing For computer experts, hide GPU programming

Hybrid cooperating Programming Model System arch. – Intra-node: Hybrid architecture (CPU+GPU) – Inter-node: Symmetric architecture Co-Multi-level parallelism 1: MPI process across nodes 2: OpenMP threads executed on CPU cores, 3: Some OpenMP threads invoke GPU kernels – GPU kernels execute on CUDA cores()

Programming environment Virtual running environments – Provide services on demand Parallel toolkits – Based on Eclipse – Editor, debugger, profiler Work flow support – Support QoS negotiate – Reservation

Applications

More than 100 groups Selected HPC users on TH-1A

Profile of resource usage

Some general (non-meteorologic) applications – Petroleum exploration – Bio-medical research – New energy research – New material research – Engineering design, simulation and analysis – … Meteorologic applications – Shallow water atmosphere model (CPU only) – GRAPES model (CPU only) – RRTM-LW radiation (CPU/GPU) Application Cases on Tianhe-1A

Petroleum seismic data processing on tianhe-1A – GeoEast-lightning single/double-way wave prestack depth migration software – using 85860 cores(7155 nodes) – 24.6TB data – 16hours (1050km 2 ) General case (CPU only) One-way wave depth migration profile Reverse-time depth migration depth slices Reverse-time depth migration profile --From BGP INC.,China National Petroleum Corporation

Thermal convection in the Earth’s outer core – Using 24576 cores – Parallel efficiency 87% – #unknowns: 60BIL Long-timescale molecular dynamics simulations of protein structure and function – 30000-atom system 384cores,400ns/day – 400,000 atom system 2048cores, 110ns/day – Near future,1 million all-atom system 10,000’s cores ， 100’s ns/day General case (CPU only) --from Dr. Chao Yang, Chinese Academy of Sciences --from Shanghai Institute of Materia Medica

General case (CPU+GPU) FreeBSD MD5 crypt cracker, Brute-force attack – Number of passwords checked on single node Without GPU, 50Kilo/s, With GPU, 250Kilo/s – Whole system （ 186368 cores ） lineal scalable – Number of passwords checked on Tianhe-1A 1.8 Billion per second Direct Numerical Simulation of Turbulent Flow – GPU-accelerated FFT solver (PKUFFT) – Taylor micro-scale Reynolds number up to 1164 – with the grid resolution up to 8192 3 – 7168nodes, >3.2million cuda cores(> 100,000 gpu cores) Key length6789 Alpha & Digits16s16.7m17.3h44.5d The expected time for a successful attack on Tianhe-1A

High speed collision system – Force calculation is accelerated by GPU – 21.9x speedup on a single GPU compared to a single CPU core – Excellent strong scalability up to 4096 nodes (106,496 cpu/gpu cores) for problems with up to 11.16 billion atoms on Tianhe-1A Trans-scale Simulation of Silicon Deposition Process – scalable bond-order potential (BOP) for the molecular dynamics simulation of crystalline silicon – 1.17Pflops in SP plus 92.1Tflops in DP on 7168 GPUs and 86016 CPUs – 1.87Pflops in single precision (SP) on 7168 GPUs General case (CPU+GPU)

--by Chinese Academy of Sciences Meteorologic Case(CPU only)

GRAPES:Global/Regional Assimilation and PrEdiction System – Co-supported by Chinese Ministry of Science and Technology CMA – To Develop a new generation of unified NWP model Research and Operation Global and Regional Hydrostatic and Non-Hydrostatic dynamic core Large-scale and Meso-Scale Weather Forecast and Climate Prediction --from Pro. Zhiyan Jin, China Meteorological Administration Meteorologic Case:GRAPES(CPU only)

GRAPES model - program structure Meteorologic Case:GRAPES

GRAPES global non-hydrostatic model: forecast test – Horizontal resolution ： 0.15° – Vertical level ： 36 – Time step:450s – Forecast:6 hours (48steps) and 10days – Size of input data: ~ 10GB – Num. of CPU cores : 2D mesh partition parallelism 256(32*8) 512(32*16) 1024(32*32) 2048(32*64) 4096(32*128)

Meteorologic Case:GRAPES GRAPES Walltime of different model parts Time(s)

Meteorologic Case:GRAPES GRAPES Speedup of different model parts Computation=dyn + phy

Meteorologic Case:GRAPES GRAPES ratio of main functions in computation

Meteorologic Case:GRAPES GRAPES Walltime and speedup for 10days forecast – linear scalable up to 2048 cpu cores 512 1024 2048 4096 Speedup Num. of cpu cores

Meteorologic Case:RRTM(CPU+GPU) RRTM_LW long-wave radiation physics – one of the five WRF parallel benchmarks – Also used in the GRAPES model – an Embarrassing parallel problem – Need to improve the computational efficiency To use CPU/GPU, questions to consider – How to hybrid program, programming mode? MPI/CUDA (MC) MPI/Openmp/CUDA(MOC) – How to balance the workload of CPUs and GPUs within single node --from Dr. Fengshun lu, NUDT

Meteorologic Case:RRTM MPI+CUDA (MC) – Muiti MPI-process – Multi small-data-copy through PCIE to GPU

Meteorologic Case:RRTM MPI+OpenMP/CUDA (MOC) – Single MPI-process – Multi Openmp threads – The master thread deals with GPU

Meteorologic Case:RRTM Balance of work load between CPUs and GPUs – M n-core CPUs linked to GPU within single node – α : ratio of load by GPU to the total load – P CPU :computing capacity of each CPU core – P GPU : computational capacity of single GPU – T : wall-clock time of CPU/GPU computing for W workload – T CPU (T GPU ) : wall-clock time of workload by CPU (GPU) – W : workload In Tianhe-1A – M=2, n=6 --from Dr. Fengshun lu, NUDT

Meteorologic Case:RRTM S: the GPU speedup to a single cpu core Is the predicted ratio useful?

Meteorologic Case:RRTM Validation of Workload Distribution Scheme – small scale em_real test caseof WRF2.1.1 42 x 42 x 27 Predicted ratio 0.485 Real test ratio 0.376 Difference 0.109

Meteorologic Case:RRTM Validation of Workload Distribution Scheme – medium scale Predicted ratio 0.597 Real test ratio 0.5 Difference 0.097 em_real test caseof WRF2.1.1 73 x 60 x 27

Meteorologic Case:RRTM Validation of Workload Distribution Scheme – large scale em_real test caseof WRF2.1.1 84 x 84 x 27 Predicted ratio 0.532 Real test ratio 0.601 Difference 0.068

Meteorologic Case:RRTM nodes time for RRTM_LW/s speedup CPUCPU/GPU 2567.1533.0122.375 5123.8191.7052.240 7682.8951.4711.968 10242.0310.9782.077 Wall-clock time of RRTM_LW 2.5 km resolution WPB test case Hybrid computing linearly scalable up to 1024 nodes

Summary – CPU only 1024 47% (<12% at 2009) 4096 22% (<5% at 2009) 4 apps scale to whole system(>80,000 core) – CPU + GPU 6 apps now Maximum scale to whole system(>80,000 cpu core+ 100,000 gpu core) 10s apps on the way Changes of parallel application scale

Summary Our principle – Practicality and Usability Mature technology ， correctness and functionality Optimization technique ， improve performance, scalability and reliability Still to do – Encourage more CPU+GPU applications – More wieldy programming framework – Continue to optimize the performance and resiliency of the system based on the feedback – Extensive collaboration

Thank you

The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT.

Similar presentations

Presentation on theme: "The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT.

Similar presentations

Presentation on theme: "The Tianhe-1A Supercomputer System and Some Applications in NSCC-TJ For CAS2K11 Workshop Associate Prof. Zhang Lilun NUDT."— Presentation transcript:

Similar presentations

About project

Feedback