CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Name: CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco
Uploaded: 2017-12-05T22:30:34+00:00
Duration: PTM16S6
Channel: Elaine Clarke
Description: CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Introduction to CD-adapco
1987 CD e adapco 2001 CD adapco Group 2004 CD-adapco STAR-CD v1 STAR-CD v3 STAR-CCM+ e STAR-CD v4

Introduction to CD-adapco
Company dedicated to development, support and sale of CCM software solutions and consultancy Date of foundation: 1980 Users > 6,000 Income > 80 M Euro; 400 employees Largest CFD company in Japan and Germany World leader in the automotive market Technology Partner in F1: Renault Technology Partner in America’s Cup: BMW-Oracle Racing, Luna Rossa

CD-adapco HW/SW resources
Peter S. MacDonald President SALES & SUPPORT MARKETING TECHNOLOGY DIRECT SERVICES WITH CLIENTS SUPPORT CONSULTANCY SALES ITALY GERMANY ... ITALY GERMANY ... ITALY GERMANY ...

CD-adapco PRODUCT OVERVIEW
CFD environment STAR-CD STAR-CCM+

CFD solver & GUI use FORTRAN F77 Latest version STAR 4 uses FORTRAN 90
STAR-CD OVERVIEW CFD solver & GUI use FORTRAN F77 Latest version STAR 4 uses FORTRAN 90 GUI written in Xmotif/Open GL (requires X server ) GUI and solver communicate by FILES Mature code, wide spectrum of physical models Solver and GUI programmability

Multiple windows, multiple files
STAR-CD OVERVIEW Multiple windows, multiple files

STAR-CD PARALLEL OVERVIEW 1/2
PARALLEL STAR-CD IS EQUIVALENT TO SERIAL BUT - geometry MUST be decomposed before running - solution MUST be merged before post-processing STAR-CD IS BASED ON MPI MESSAGE PASSING SUPPORTED PROTOCOLS ARE - MPICH - Scali MPI - LAM - SCore - RapidArray MPI - MPICH-GM - HP-MPI

STAR-CD PARALLEL OVERVIEW 2/2
SUPPORTED QUEUEING SYSTEMS LSF OpenPBS SGE IMPLEMENTATION IN CLUSTER ENVIRONMENT - STAR-PnP + user scripting - STARNET Direct interface to queueing systems

STAR-CD : PARALLEL PERFORMANCE
2 examples of the official testsuite Large data set External Aero Linear solver: SCALAR-CGS Solution method: STEADY STATE Mesh 6Mil, Hybrid Small data set Engine Block Linear solver: SCALAR-CGS Solution method: STEADY STATE Mesh 160K, Hexa

Large data set: CRAY XD1- AMD Opteron
RapidArray interconnect

HP RX1620 – Intel Itanium2 Infiniband

Effect of network HW on Opteron Cluster

HPC Case Study : Dell 2004 study
Major factors influencing HPC performance: CACHE SIZE INTERCONNECTION LATENCY AND BANDWIDTH CPU SPEED MEMORY LATENCY AND BANDWIDTH FILESYSTEM STRUCTURE ( NFS / local )

Effect of network HW Performance improvement of Myrinet compared to Gigabit E Multiple processes conflict over node memory and network card Effect is worse for larger models

Price-performance considerations
1 PPN solution are more expensive than 2 PPN A-class dataset Engine-block dataset Highest price-performance obtained with: DUAL CPU NODES LOW-LATENY, HIGH BADWIDTH INTERCONNECT

GUI and solver communication is Client-Server Extreme easy of use
STAR-CCM+ OVERVIEW C++, JAVA GUI and solver communication is Client-Server Extreme easy of use Powerful integrated mesher “Young” code, physical models list rapidly growing

Single windows, single file
STAR-CCM+ OVERVIEW Single windows, single file

STAR-CCM+ PARALLEL OVERVIEW 1/2
SUPPORTED NETWORK PROTOCOLS ETHERNET MYRINET GM / MX INFINIBAND - Voltaire - Mellanox - Silverstorm SGI SHARED MEMORY QUADRICS ELAN INTERCONNECT CRAY RAPID ARRAY SUPPORTED QUEUEING SYSTEMS OpenPBS LSF LoadLeveler SGE

STAR-CCM+ PARALLEL OVERVIEW 2/2
GUI can connect and interact with parallel solver Post-Process can also be done during the simulation Workstation Cluster WORKER-NODE1 WORKER-NODE2 GUI (client) CONTROLLER (server) …………… WORKER-NODE-N

STAR-CCM+ PARALLEL PERFORMANCE Use of dual-core CPUs Effect of interconnect speed

EFFECT OF SWITCHING FROM SINGLE TO DUAL CORE CPUS AND USING ALL CORES
Using dual core CPUs and all available cores, the speed-up curve (based on serial run) is shifted to the left with respect to single core runs. In any case it is convenient to have dual core runs, at least in the range from 1 to 24

COMPARISON AT EQUAL # of DOMAINS
# of STAR-CCM+ parallel Domains # of Nodes used # of CPUs 4 X single core 2 4 4 X dual core 1 8 X single core 8 8 X dual core 16 X single core 16 16 X dual core -0.6 % +3 % +3 % # of parallel CFD domains Using the same number of parallel CFD domains, the overhead of using dual core instead of one is around 3 % of the elapsed time, but using half the number of nodes and half the number of CPUs.

COMPARISON AT EQUAL # of NODES and CPUS
# of Physical Cluster nodes # of CFD domains # of CPUs 1 N single core 2 1 N dual core 4 2 N single core 2 N dual core 8 4 N single core 4 N dual core 16 8 N single core 8 N dual core 32 12 N single core 24 12 N dual core 48 46% time reduction 43 % 41 % 18 % 14 % # of physical cluster nodes Using the same number of nodes, the advantage of using dual core is very similar to doubling the number of CPUs used.

EFFECT OF INTERCONNECT

LACK OF LOCAL HW/SW RESOURCE UNEVEN WORKLOAD
WHY REMOTE ??? LACK OF LOCAL HW/SW RESOURCE UNEVEN WORKLOAD OUTSOURCE SW/HW MAINTENANCE HUGE CASES HIGH NUMBER OF CASES PAY PER USE

CLUSTER ACCESSIBILITY
GUI &SOLVER REMOTE DESKTOP CLIENT-SERVER PORTALS FUNCTIONALITY SHELL SOLVER ONLY SPEED OF USE

LOCAL CLIENT-SERVER CLIENT SERVER operation thru rsh/ssh within the same “firewall environment” (inside the same company) INTERNET

##?? REMOTE CLIENT-SERVER
CLIENT SERVER operation cannot be used in separate environments: cannot predict which “return” ports to restrict information not encrypted INTERNET ##??

Use Desktop remotization tool (e.g RealVNC)
REMOTE DESKTOP Use Desktop remotization tool (e.g RealVNC) SSH Tunnelling can be managed by both firewalls INTERNET

REMOTE CLIENT SERVER: OPEN ISSUES
CONNECTION DROP “ROBUSTNESS” server should not be killed when connection goes down. This is currently the case USE ONLY ONE ENCRYPTED PORT This would allow SSH tunneling, extra-company client-server firewall-controlled operation. CLIENT CONNECTION FOR BATCH JOBS currently a the client can connect only to runs that were started by a GUI client.

FUTURE DEVELOPMENTS MPI ERROR HANDLING Efficient and robust error handling for large # of CPUS. User should not pay high price for hardware failures DYNAMIC LOAD BALANCING Capability to re-distribute the workload depending on the status of the run and the nodes ( Pre-requisite for adaptive meshing ) PARALLEL MESHING Use power of clusters to speed-up meshing times

CONCLUSIONS - QUESTIONS
CD-adapco is actively supporting and developing state-of-the-art HPC solutions Availability to support services of remote computing with our software products Open to cooperation to implement new remote-computing friendly features ANY QUESTION IS WELCOME !

IBM p Power 5 DDR 1 memory

SGI Altix- Itanium2

SUN X AMD Opteron Gigabit Ethernet

Effect of NFS / local system
Performance improvement of local over NFS NFS and local systems gives similar performance up to 16 cpus NFS can decrease performance up to 50% for 32 cpus

Effect of single vs dual processor node
Performance degradation of using two cpus per node Multiple processes conflict over node memory and network card Effect is worse for larger models

Conclusions The larger the problem being solved, the less the need is for a low-latency, high-bandwidth interconnect as long as each processor spends more time computing than communicating. • A low-latency interconnect becomes very important if a small data set must be solved in parallel using more than four processors, because each processor does very little computation. • Performance characteristics of an application are highly dependent on the size and nature of the problem being solved. • With up to 16 processors, SMP affects performance less than 20 percent if the data set is large and uses memory extensively. With more than 16 processors, when communication becomes a large contributor to overall processing time, the SMP use of a shared NIC can degrade performance.Use of a low-latency, high-bandwidth interconnect can help reduce the contention for the shared resource. • Beyond 16 processors, the choice of file system for output files can make a significant difference in performance—up to 60 percent.

Test Description CFD Testcase External Aerodynamics ( Fiat Stilo with Flat underbody) 1.8 Million Polyhedrals + prism layers 100 iterations ( not including first iteration) Software STAR-CCM+ version x86_64 intel8.1 (Public beta) Hardware 12 Nodes AMD Opteron 275 (dual-core) cluster 2 CPU per node, 4 cores per node Gigabit + Myrinet interconnect (Gigabit only used for the test) Linux Suse Enterprise 9 (Kernel smp)

Test environment configuration
Application STAR-CD v3150A.012 Compiler Intel Fortran and C++ Compilers 7.1 Middleware MPICH and MPICH-GM Operating system Red Hat Enterprise Linux AS 2.1, kernel e.37smp Protocol TCP/IP GM-2 Interconnect Gigabit Ethernet, Myrinet Platform Dell PowerEdge 3250 servers in a 32-node cluster Each node – 2 CPU Itanium2 1.3 Ghz , 3MB L2 cache

CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Similar presentations

Presentation on theme: "CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco

Similar presentations

Presentation on theme: "CFD ON REMOTE CLUSTERS Carlo Pettinelli CD-adapco"— Presentation transcript:

Similar presentations

About project

Feedback