CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Slides:



Advertisements
Similar presentations
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Advertisements

1 Agenda … HPC Technology & Trends HPC Platforms & Roadmaps HP Supercomputing Vision HP Today.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
1. Topics Is Cloud Computing the way to go? ARC ABM Review Configuration Basics Setting up the ARC Cloud-Based ABM Hardware Configuration Software Configuration.
Publishing applications on the web via the Easa Portal and integrating the Sun Grid Engine Publishing applications on the web via the Easa Portal and integrating.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Linux vs. Windows. Linux  Linux was originally built by Linus Torvalds at the University of Helsinki in  Linux is a Unix-like, Kernal-based, fully.
History of Distributed Systems Joseph Cordina
By Aaron Thomas. Quick Network Protocol Intro. Layers 1- 3 of the 7 layer OSI Open System Interconnection Reference Model  Layer 1 Physical Transmission.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Parallel Computing Overview CS 524 – High-Performance Computing.
CSS430 Introduction1 Textbook Ch1 These slides were compiled from the OSC textbook slides (Silberschatz, Galvin, and Gagne) and the instructor’s class.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Introduction to Networks Networking Concepts IST-200 VWCC 1.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Srihari Makineni & Ravi Iyer Communications Technology Lab
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Cluster Software Overview
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Tackling I/O Issues 1 David Race 16 March 2010.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Evolution at CERN E. Da Riva1 CFD team supports CERN development 19 May 2011.
CRESCO Project: Salvatore Raia
CLUSTER COMPUTING.
Hybrid Programming with OpenMP and MPI
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Cluster Computers.
Presentation transcript:

CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Introduction to CD-adapco 1987 CD e adapco 2001 CD adapco Group 2004 CD-adapco STAR-CD v1 STAR-CD v3 STAR-CCM+ e STAR-CD v4

Introduction to CD-adapco Company dedicated to development, support and sale of CCM software solutions and consultancy Date of foundation: 1980 Users > 6,000 Income > 80 M Euro; 400 employees Largest CFD company in Japan and Germany World leader in the automotive market Technology Partner in F1: Renault Technology Partner in America’s Cup: BMW-Oracle Racing, Luna Rossa

CD-adapco HW/SW resources Peter S. MacDonald President SALES & SUPPORT MARKETING TECHNOLOGY DIRECT SERVICES WITH CLIENTS SUPPORT CONSULTANCY SALES ITALY GERMANY ... ITALY GERMANY ... ITALY GERMANY ...

CD-adapco PRODUCT OVERVIEW CFD environment STAR-CD STAR-CCM+

CFD solver & GUI use FORTRAN F77 Latest version STAR 4 uses FORTRAN 90 STAR-CD OVERVIEW CFD solver & GUI use FORTRAN F77 Latest version STAR 4 uses FORTRAN 90 GUI written in Xmotif/Open GL (requires X server ) GUI and solver communicate by FILES Mature code, wide spectrum of physical models Solver and GUI programmability

Multiple windows, multiple files STAR-CD OVERVIEW Multiple windows, multiple files

STAR-CD PARALLEL OVERVIEW 1/2 PARALLEL STAR-CD IS EQUIVALENT TO SERIAL BUT - geometry MUST be decomposed before running - solution MUST be merged before post-processing STAR-CD IS BASED ON MPI MESSAGE PASSING SUPPORTED PROTOCOLS ARE - MPICH - Scali MPI - LAM - SCore - RapidArray MPI - MPICH-GM - HP-MPI

STAR-CD PARALLEL OVERVIEW 2/2 SUPPORTED QUEUEING SYSTEMS LSF OpenPBS SGE IMPLEMENTATION IN CLUSTER ENVIRONMENT - STAR-PnP + user scripting - STARNET Direct interface to queueing systems

STAR-CD : PARALLEL PERFORMANCE 2 examples of the official testsuite Large data set External Aero Linear solver: SCALAR-CGS Solution method: STEADY STATE Mesh 6Mil, Hybrid Small data set Engine Block Linear solver: SCALAR-CGS Solution method: STEADY STATE Mesh 160K, Hexa

Large data set: CRAY XD1- AMD Opteron RapidArray interconnect

HP RX1620 – Intel Itanium2 Infiniband

Effect of network HW on Opteron Cluster

HPC Case Study : Dell 2004 study Major factors influencing HPC performance: CACHE SIZE INTERCONNECTION LATENCY AND BANDWIDTH CPU SPEED MEMORY LATENCY AND BANDWIDTH FILESYSTEM STRUCTURE ( NFS / local )

Effect of network HW Performance improvement of Myrinet compared to Gigabit E Multiple processes conflict over node memory and network card Effect is worse for larger models

Price-performance considerations 1 PPN solution are more expensive than 2 PPN A-class dataset Engine-block dataset Highest price-performance obtained with: DUAL CPU NODES LOW-LATENY, HIGH BADWIDTH INTERCONNECT

GUI and solver communication is Client-Server Extreme easy of use STAR-CCM+ OVERVIEW C++, JAVA GUI and solver communication is Client-Server Extreme easy of use Powerful integrated mesher “Young” code, physical models list rapidly growing

Single windows, single file STAR-CCM+ OVERVIEW Single windows, single file

STAR-CCM+ PARALLEL OVERVIEW 1/2 SUPPORTED NETWORK PROTOCOLS ETHERNET MYRINET GM / MX INFINIBAND - Voltaire - Mellanox - Silverstorm SGI SHARED MEMORY QUADRICS ELAN INTERCONNECT CRAY RAPID ARRAY SUPPORTED QUEUEING SYSTEMS OpenPBS LSF LoadLeveler SGE

STAR-CCM+ PARALLEL OVERVIEW 2/2 GUI can connect and interact with parallel solver Post-Process can also be done during the simulation Workstation Cluster WORKER-NODE1 WORKER-NODE2 GUI (client) CONTROLLER (server) …………… WORKER-NODE-N

STAR-CCM+ PARALLEL PERFORMANCE Use of dual-core CPUs Effect of interconnect speed

EFFECT OF SWITCHING FROM SINGLE TO DUAL CORE CPUS AND USING ALL CORES Using dual core CPUs and all available cores, the speed-up curve (based on serial run) is shifted to the left with respect to single core runs. In any case it is convenient to have dual core runs, at least in the range from 1 to 24

COMPARISON AT EQUAL # of DOMAINS # of STAR-CCM+ parallel Domains # of Nodes used # of CPUs 4 X single core 2 4 4 X dual core 1 8 X single core 8 8 X dual core 16 X single core 16 16 X dual core -0.6 % +3 % +3 % # of parallel CFD domains Using the same number of parallel CFD domains, the overhead of using dual core instead of one is around 3 % of the elapsed time, but using half the number of nodes and half the number of CPUs.

COMPARISON AT EQUAL # of NODES and CPUS # of Physical Cluster nodes # of CFD domains # of CPUs 1 N single core 2 1 N dual core 4 2 N single core 2 N dual core 8 4 N single core 4 N dual core 16 8 N single core 8 N dual core 32 12 N single core 24 12 N dual core 48 46% time reduction 43 % 41 % 18 % 14 % # of physical cluster nodes Using the same number of nodes, the advantage of using dual core is very similar to doubling the number of CPUs used.

EFFECT OF INTERCONNECT

LACK OF LOCAL HW/SW RESOURCE UNEVEN WORKLOAD WHY REMOTE ??? LACK OF LOCAL HW/SW RESOURCE UNEVEN WORKLOAD OUTSOURCE SW/HW MAINTENANCE HUGE CASES HIGH NUMBER OF CASES PAY PER USE

CLUSTER ACCESSIBILITY GUI &SOLVER REMOTE DESKTOP CLIENT-SERVER PORTALS FUNCTIONALITY SHELL SOLVER ONLY SPEED OF USE

LOCAL CLIENT-SERVER CLIENT SERVER operation thru rsh/ssh within the same “firewall environment” (inside the same company) INTERNET

##?? REMOTE CLIENT-SERVER CLIENT SERVER operation cannot be used in separate environments: cannot predict which “return” ports to restrict information not encrypted INTERNET ##??

Use Desktop remotization tool (e.g RealVNC) REMOTE DESKTOP Use Desktop remotization tool (e.g RealVNC) SSH Tunnelling can be managed by both firewalls INTERNET

REMOTE CLIENT SERVER: OPEN ISSUES CONNECTION DROP “ROBUSTNESS” server should not be killed when connection goes down. This is currently the case USE ONLY ONE ENCRYPTED PORT This would allow SSH tunneling, extra-company client-server firewall-controlled operation. CLIENT CONNECTION FOR BATCH JOBS currently a the client can connect only to runs that were started by a GUI client.

FUTURE DEVELOPMENTS MPI ERROR HANDLING Efficient and robust error handling for large # of CPUS. User should not pay high price for hardware failures DYNAMIC LOAD BALANCING Capability to re-distribute the workload depending on the status of the run and the nodes ( Pre-requisite for adaptive meshing ) PARALLEL MESHING Use power of clusters to speed-up meshing times

CONCLUSIONS - QUESTIONS CD-adapco is actively supporting and developing state-of-the-art HPC solutions Availability to support services of remote computing with our software products Open to cooperation to implement new remote-computing friendly features ANY QUESTION IS WELCOME !

IBM p5 - Power 5 DDR 1 memory

SGI Altix- Itanium2

SUN X2100- AMD Opteron Gigabit Ethernet

Effect of NFS / local system Performance improvement of local over NFS NFS and local systems gives similar performance up to 16 cpus NFS can decrease performance up to 50% for 32 cpus

Effect of single vs dual processor node Performance degradation of using two cpus per node Multiple processes conflict over node memory and network card Effect is worse for larger models

Conclusions The larger the problem being solved, the less the need is for a low-latency, high-bandwidth interconnect as long as each processor spends more time computing than communicating. • A low-latency interconnect becomes very important if a small data set must be solved in parallel using more than four processors, because each processor does very little computation. • Performance characteristics of an application are highly dependent on the size and nature of the problem being solved. • With up to 16 processors, SMP affects performance less than 20 percent if the data set is large and uses memory extensively. With more than 16 processors, when communication becomes a large contributor to overall processing time, the SMP use of a shared NIC can degrade performance.Use of a low-latency, high-bandwidth interconnect can help reduce the contention for the shared resource. • Beyond 16 processors, the choice of file system for output files can make a significant difference in performance—up to 60 percent.

Test Description CFD Testcase External Aerodynamics ( Fiat Stilo with Flat underbody) 1.8 Million Polyhedrals + prism layers 100 iterations ( not including first iteration) Software STAR-CCM+ version 2.01.702 x86_64 intel8.1 (Public beta) Hardware 12 Nodes AMD Opteron 275 (dual-core) cluster 2 CPU per node, 4 cores per node Gigabit + Myrinet interconnect (Gigabit only used for the test) Linux Suse Enterprise 9 (Kernel 2.6.5-7.191-smp)

Test environment configuration Application STAR-CD v3150A.012 Compiler Intel Fortran and C++ Compilers 7.1 Middleware MPICH 1.2.4 and MPICH-GM 1.2.5..10 Operating system Red Hat Enterprise Linux AS 2.1, kernel 2.4.18-e.37smp Protocol TCP/IP GM-2 Interconnect Gigabit Ethernet, Myrinet Platform Dell PowerEdge 3250 servers in a 32-node cluster Each node – 2 CPU Itanium2 1.3 Ghz , 3MB L2 cache