Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29,

Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29, 2013

Outline Related R&D programs in China HPC system development Application service environment Applications

Related R&D programs in China

HPC-related R&D Under NSFC NSFC Key initiative “Basic algorithms for high performance scientific computing and computable modeling” 2011-2018 180 million RMB Basic algorithms and high efficient implementation Computable modeling Verification by solving domain problems

HPC-related R&D Under 863 program 3 Key projects in the last 12 years High performance computer and core software (2002-2005) High productivity computer and Grid service environment (2006-2010) High productivity computer and application environment (2011-2016) 3 Major projects Multicore/many-core programming support (2012-2015) High performance parallel algorithms and parallel coupler development for earth systems study (2010-2013) HPC software support for earth system modeling (2010- 2013)

HPC-related R&D Under 973 program 973 program High performance scientific computing Large scale scientific computing Aggregation and coordination mechanisms in virtual computing environment Highly efficient and trustworthy virtual computing environment

There is no national long-term R&D program on extreme scale computing Coordination between different programs needed

Shift of 863 program emphasis 1987: Intelligent computers, following the 5th generation computer program in Japan 1990: from intelligent computers to high performance parallel computers 1999: from individual HPC system to the national HPC environment 2006: from high performance computers to high productivity computers

History of HPC development under 863 program 1990: parallel computers identified as priority topic of the 863 program National Intelligent Computer R&D Center established 1993: Dawning 1, 640MIPS, SMP 1995: Dawning 1000, 2.5GFlops, MPP Downing company established in 1995 1996: Dawning 1000A, cluster system First product-oriented system of Dawning 1998: Dawning 2000, 100GFlops, cluster

History of HPC development under 863 program 2000: Dawning 3000, 400GFlops, cluster, First system commercialized 2002: Lenovo DeepComp 1800, 1TFlops, cluster Lenovo entered the HPC market 2003: Lenovo DeepComp 6800, 5.3TFlops, cluster 2004: Dawning 4000A, 11.2TFlops

History of HPC development under 863 program 2008: Lenovo DeepComp 7000 150TFlops, Heterogeneous cluster Dawning 5000A 230TFlops, cluster 2010: Dawning 6000 3PFlops, Heterogeneous system CPU+GPU TH-1A 4.7PFlops, Heterogeneous CPU+GPU 2011: Sunway-Bluelight IPFlops+100TFlops Based on domestic processor 2013: TH-2 Heterogeneous system with CPU+MIC

863 key projects on HPC and Grid: 2002-2010 “High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local government, application organizations, and industry Major outcomes: China National Grid (CNGrid) “High productivity Computer and Grid Service Environment” Period: 2006-2010 (extended to now) 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources

Current 863 key project “High productivity computer and application environment” 2011-2015 (2016) 1.3B YUAN investment secured Develop leading level high performance computers Transfer CNGrid into an application service environment Develop parallel applications in selected areas

Projects launched The first round of projects launched in 2011 High productivity computer (1) 100PF by the end of 2015 HPC applications (6) Fusion simulation Simulation for aircraft design Drug discovery Digital media Structural mechanics for large machinery Simulation of electro-magnetic environment Parallel programming framework (1) Application service environment will be supported in the second round Emphasis on application service support Technologies for new mode of operation

HPC system development

Major challenges Power consumption Performance obtained by the applications Programmability Resilience Major obstacles memory walls Power walls I/O walls …

Power consumption The limiting factor to implementation of extreme scale computers Impossible to increase performance by expanding system scale only Cooling of the system is difficult and affects reliability of the system Energy cost is a heavy burden and prevent acceptance of extreme scale computers by end users

Performance obtained by applications Systems installed at general purpose computing centers Serving a large population of users supporting a wide range of applications LinPack is not everything Need to be efficient for both general- purpose and special-purpose computing Need to support both computing-intensive and data-intensive applications

Programmability Must handle Concurrency/locality Heterogeneity of the system Legacy programs porting Lower the skill requirement for application developers

Resilience Very short MTBF for extreme scale systems Long-time continuous operation System must self-heal/recover from hardware faults/failures System must detect and tolerate errors in software

Constrained design principle We must set strong constrains to the extreme scale system implementation Power consumption 50GF/W or less before 2020 5GF/W in 2015 Systems scale <100,000 processors <200 cabinets Cost <300 million dollars (or <2 B YUAN) We can only design and implement extreme scale system with those constrains

How to address the challenges? Architectural support Technology innovation Hardware and software coordination

Architectural support Using the most appropriate architecture to achieve the goal Making trade-offs between performance, power consumption, programmability, resilience, and cost Hybrid architecture (TH-1A & TH-2) General purpose + high density computing (GPU or MIC) HPP architecture (Dawning 6000/Loonson) Enable different processors to co-exist Support global address space Multi-level of parallelism Multi-conformation and Multi-scale adaptive architecture (SW/BL) Cluster implemented with Intel processor for supporting commercial software Homogeneous system implemented with domestic multicore processors for computing-intensive applications Support parallelism at different levels

Classification of current major architectures Classifying architectures using “homogeneity/heterogeneity” and “CPU only/CPU+Accelerator” Homo-/Hetero refers to the ISA CPU onlyCPU+Acc Homogeneous Sequoia K-computer Sunway/BL Stampede TH-2 Heterogeneous Dawning 6000/HPP (AMD+Loonson) TH-1A Dawning 6000/Nebulae, Tsubame 2.0

Comparison of different architectures powerperformanceProgrammability /productivity resilience Homo/CPU only poor/ fair good/excel lent good/goodvary Heter/CPU only poorgoodfair/fairvary Homo/CPU +ACC fairgood/excel lent good/poor?vary Heter/CPU +ACC goodgood/excel lent fair/poor?vary

TH-1A architecture Hybrid system architecture Computing sub-system Service sub-system Communication networks Storage sub-system Monitoring and diagnosis sub-system Storage sub-system Compute sub-system Service sub-system Service sub-system Communication sub-system CPU + GPU CPU + GPU CPU + GPU CPU + GPU CPU + GPU CPU + GPU Operation node MDS OSS … … CPU + GPU CPU + GPU CPU + GPU CPU + GPU Operation node Monitor and diagnosis sub-system Monitor and diagnosis sub-system

Dawning/Loonson HPP (Hyper Parallel Processing) architecture Hyper node composed of AMD and Loonson processors Separation of OS & appl. processors Multiple interconnect H/W global synchronization

Sunway BlueLight Architecture SW1600 CPU: 16 cores/975~1100MHz/1 24.8~ 140.8Gflops; Fat-tree based interconnection, QDR 4×10Gbps high speed serial transmission between nodes, MPI message latency of 2μs; SWCC/C++/Fortran/ UPC/MPICC/Scientific library; Storage: 2PB, theoretical I/O bandwidth: 200GB/s, IOR(~60GB/s); Main features

Technology innovations Innovation at different levels Device Component system New processor architectures Heter. Many-core, accelerators, re-configurable Address memory wall new memory devices 3D stacking New cache architectures High performance interconnect All optical network Silicon photonics High density system design Low power design

CPUSW1600 Release timeAug,2010 Processor cores16 Peak performance140.8GFlops@1.1GHz Clock frequency0.975~1.1GHz Process generation65nm Power35~70W a general-purpose multi-core processor power efficient, achieve 2.0GFlops/W Next generation processor is under development SW1600 processor features

SparcV9 ， 16 cores ， 4 SIMD 40nm, 1.8GHz Performance ： 144GFlops Typical power: ~65W FT-1500 CPU

Similar ISA, different ALU 2 Intel Ivy Bridge CPU + 3 Intel Xeon Phi 16 Registered ECC DDR3 DIMMs, 64GB 3 PCI-E 3.0 with 16 lanes PDP Comm. Port Dual Gigabit LAN Peak Perf. : 3.432Tflops GDDR5 Memor y GDDR5 Memor y MIC CPU QPI PCH DMI 16X PCIE IPMB CPLD 16X PCIE GE PDP 16X PCIE Comm. Port Dual Gigabit LAN Heterogeneous Compute Node (TH-2)

Interconnection network (TH-2) Fat-tree topology using 13 576-port top level switches Optical-electronic hybrid transport tech. Proprietary network protocol

High radix router ASIC: NRC Feature size: 90nm Die size: 17.16mm x 17.16mm Package: FC-PBGA 2577 pins Throughput of single NRC: 2.56Tbps Network interface ASIC: NIC Same Feature size and package Die size: 10.76mm x 10.76mm 675 pins, PCI-E G2 16X Interconnection network(TH-2)

High density system design (SW/BL) computing node Basic element, one processor +memory node complex High density assembly, 2 computing nodes+network interface Supernode 256 nodes (processors), tightly coupled interconnect cabinet 1024 computing nodes (4 supernodes) Multi/many- core processor Computing node Node complex supernode system

Low power design Low power design at different levels Low power processors Low power interconnect High efficient cooling High efficient power supply Low power management Fine-grain real-time power consumption monitor System status sensing Multi-layer power consumption control Low power programming Default system tools like debugging and tuning? Code power consumption modeling Sampling the code power consumption as code performance Feedback to programming

Power supply (SW/BL) DC UPS Conversion efficiency 77% Highly reliable Power monitoring associated

Efficient Cooling (TH-2) Close-coupled chilled water cooling Customized Liquid Cooling Unit High Cooling Capacity: 80kW Use city cooling system to supply cooling water to LCUs

Efficient Cooling (SW/BL) Water cooling to the board (node complex) Energy-saving Environment-friendly High room temperature Low noise

HW/SW coordination Using combination of hardware and software technologies to address the technical issues Achieving performance while maintaining flexibility Compilation support Parallel programming framework Performance tools HW/SW coordinated reliability measures User level checkpointing Redundancy based reliability measure

Software stack of TH-2

Features Support C, Fortran and SIMD extension Libc for computing kernel Support storage hierarchy Programming model for many-core acceleration Collaborative cache date prefetch Instruction prefetch optimization Static/dynamic instruction scheduling optimization Compiler for many-core

Basic math lib based on many-core structure Basic function lib SIMD extended function lib Fortran function lib Technical features Standard function call interface Customized optimization Support accuracy analysis Basic math lib for many-core

Technical features Unified architecture for heterogeneous many-cores Low overhead virtualization High efficient resource management Parallel OS

Covering program development, testing, tuning, parallelization and code translation Collaborative tuning framework Tolls for parallelism analysis and parallelization Integrated translation tools for multiple source codes Parallel application development platform

Parallel programming framework Hide the complexity of programming millions of cores Integrate high efficient implementation of fast parallel algorithms Provide efficient data structures and solver libraries Support software engineering concept for code extensibility.

Superco mputer Applic ations middl eware Peta-scale flops100P flops Program wall ： Think parallel Write sequential 100 times High Performance Computing Applications Infrastructure Materials, Climate, nuclear energy…

Infrastructure: Four types computing Structured Mesh Finite Element Unstructured Mesh Combinatory Geometry HPCHPC JAUMIN ： J A daptive U nstructured M eshes applications IN frastructure 并行自适应非结构网格支撑软件框架 JCOGIN ： J mesh-free CO mbinatory G eometry IN frastructure 并行三维无网格组合几何计算支撑软件框架 JASMIN ：（ J Adaptive S tructured M eshes applications IN frastructure ）并行自适应结构网格支撑软件框架 PHG ： P arallel H ierarchical G rid infrastructure 并行自适应有限元计算软件平台

Reliability design High-quality components, strict screening test Water cooling to prolong the lifetime of components High density assembly, reduce the length of wires, improve data transfer reliability Multiple error correction codes to deal with instantaneous errors Redundant design for memory, computing node, networks, I/O, power supply, and water cooling

Hardware monitoring (SW/BL) Basis for reliability, availability, maintainability of the system Monitor major components Maintenance Diagnosis Dedicated management network

High availability (SW/BL) SW/HW coordinated multi-level fault- tolerant architecture Local fault suppression, fault isolation, fault components replacement, fault recovering

Delivered system: TH-1A Tianhe: Galaxy in Chinese Hybrid arch. :CPU & GPU Peak performance 4.7PF Linpack 2.57PF Power consumption 4.04MWItemsConfigurationProcessors 14,336 XEON CPUs + 7,168 nVIDIA GPUs + 2,048FT CPUs Memory 262TB in total Interconnect Proprietary high-speed interconnect network Storage Global shared parallel storage system, 2PB Racks 120 Compute racks+14 Storage racks + 6 Communication racks 52

Delivered system: Delivered system: Dawning 6000 Hybrid system Service unit (Nebula) 3PF peak performance 1.27PF Linpack performance 2.6 MW Computing unit Experiment on using Loonson processor

Delivered system: Delivered system: Sunway BlueLight Installed in September, 2011 at the National Supercomputing Center in Jinan. Implemented completely with domestic 16-core ShenWei processor SW1600 8704 ShenWei processors in total Peak performance: 1.07PFlops (with 8196 processor) Linpack performance: 796TFlops (with 8196 processor) Power consumption: 1074KWatt. (with Linpack execution)

Delivered system: TH-2

TH-2 specifications Hybrid Architecture Xeon CPU & Xeon Phi

Application service environment

China National Grid (CNGrid) 14 sites SCCAS (Beijing, major site) SSC (Shanghai, major site ) NSC-TJ (Tianjin) NSC-SZ (Shenzhen) NSC-JN (Jinan) Tsinghua University (Beijing) IAPCM (Beijing) USTC (Hefei) XJTU (Xi’an) SIAT (Shenzhen) HKU (Hong Kong) SDU (Jinan) HUST (Wuhan) GSCC (Lanzhou) The CNGrid Operation Center (based on SCCAS)

CNGrid sites CPU/GPUStorage SCCAS 157TF/300TF 1.4PB SSC200TF600TB NSC-TJ 1PF/3.7PF 2PB NSC-SZ 716TF/1.3PF 9.2PB NSC-JN1.1PF2PB THU 104TF/64TF 1PB IAPCM40TF80TB USTC10TF50TB XJTU5TF50TB SIAT 30TF/200TF 1PB HKU 23TF/7.7TF 130TB SDU10TF50TB HUST3TF22TB GSCC 13TF/28TF 40TB THU IAPCM NSCT J NSCJ N SSC USTC NSCS Z HKU SIAT HUST SCCAS GSCC XJTU SDU

CNGrid GOS Architecture Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps OS (Linux/Unix/Windows) PC Server (Grid Server) J2SE( 1.4.2_07, 1.5.0_07 ) Tomcat( 5.0.28 ) + Axis( 1.2 rc2 ) Axis Handlers for Message Level Security Core, System and App Level Services

CNGrid GOS deployment CNGrid GOS deployed on 11 sites and some application Grids Support heterogeneous HPCs: Galaxy, Dawning, DeepComp Support multiple platforms Unix, Linux, Windows Using public network connection, enable only HTTP port Flexible client Web browser Special client GSML client

CNGrid: Resources 14 sites >3PF aggregated computing power >15PB storage

CNGrid: Service and Users >450 services >2800 users China commercial Aircraft Corp Bao Steel automobile institutes of CAS universities ……

CNGrid ： applications Supporting >700 projects 973, 863, NSFC, CAS Innovative, and Engineering projects

Application Villages Support domain applications Industrial product design optimization New drug discovery Digital media Introducing Cloud Computing concept CNGrid—as IaaS and partially PaaS Application villages—as SaaS and partially PaaS Build up business models for HPC applications

Applications

CNGrid applications

Grid applications Drug Discovery Weather forecasting Scientific data Grid and its application in research Water resource Information system Grid-enabled railway freight Information system Grid for Chinese medicine database applications HPC and Grid for Aerospace Industry (AviGrid) National forestry project planning, monitoring and evaluation

HPC applications Computational chemistry Computational Astronomy Parallel program for large fluid machinery design Fusion ignition simulation Parallel algorithms for bio- and pharmacy applications Parallel algorithms for weather forecasting based on GRAPES 10000+ core scale simulation for aircraft design Seismic imaging for oil exploration Parallel algorithm libraries for PetaFlops systems

China’s status in the related fields Significant progress in developing HPC systems and HPC service environment Lack of long-term strategic study and plan Still far behind in many aspects Lack of kernel technologies Processors, memory, interconnect, system software, algorithms… Especially weak in applications Need multi-disciplinary research Shortage in cross-disciplinary talents Sustainable development is crucial Lack of regular budget for e-Infrastructure Always competing funding with other disciplines

Pursuing international Cooperation We wish to cooperate with international HPC communities Joint work on grand challenge problems Climate change New energy Environment protection Disaster mitigation Jointly address challenges towards Extreme scale systems Low power system design and implementation Performance obtained by applications Heterogeneous system programming Resilience of large systems

Thank you!

Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29,

Similar presentations

Presentation on theme: "Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29,

Similar presentations

Presentation on theme: "Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29,"— Presentation transcript:

Similar presentations

About project

Feedback