A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of Mathematical Sciences Chennai.

Slides:



Advertisements
Similar presentations
Designing a cluster for geophysical fluid dynamics applications Göran Broström Dep. of Oceanography, Earth Science Centre, Göteborg University.
Advertisements

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Farms/Clusters of the future Large Clusters O(1000), any existing examples ? YesSupercomputing, PC clusters LLNL, Los Alamos, Google No long term experience.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Designing Lattice QCD Clusters Supercomputing'04 November 6-12, 2004 Pittsburgh, PA.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.
Lappeenranta University of Technology June 2015Igor Monastyrnyi HPC2N - High Performance Computing Center North System hardware.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Workshop on Commodity-Based Visualization Clusters Learning From the Stanford/DOE Visualization Cluster Mike Houston, Greg Humphreys, Randall Frank, Pat.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Peter Wegner, DESY CHEP03, 25 March LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Processor Technology John Gordon, Peter Oliver e-Science Centre, RAL October 2002 All details correct at time of writing 09/10/02.
PC DESY Peter Wegner 1. Motivation, History 2. Myrinet-Communication 4. Cluster Hardware 5. Cluster Software 6. Future …
SuperMike: LSU’s TeraScale, Beowulf-class Supercomputer Presented to LASCI 2003 by Joel E. Tohline, former Interim Director Center for Applied Information.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Tools for TEIN2 operation Yoshitaka Hattori Jin Tanaka APAN-JP.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Interconnection Structures
DELL PowerEdge 6800 performance for MR study Alexander Molodozhentsev KEK for RCS-MR group meeting November 29, 2005.
Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.
Chipset Introduction The chipset is commonly used to refer to a set of specialized chips on a computer's motherboard or.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
QCD Project Overview Ying Zhang September 26, 2005.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
Lattice results on QCD-strings N.D. Hari Dass Institute of Mathematical Sciences Chennai In collaboration with Dr Pushan Majumdar, U. Muenster.
Parallelization of 2D Lid-Driven Cavity Flow
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
On High Performance Computing and Grid Activities at Vilnius Gediminas Technical University (VGTU) dr. Vadimas Starikovičius VGTU, Parallel Computing Laboratory.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Weekly Report By: Devin Trejo Week of June 21, 2015-> June 28, 2015.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Interconnection network network interface and a case study.
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
Scalable Systems and Technology Einar Rustad Scali AS
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lattice QCD Computing Project Review
Cluster Active Archive
Hot Processors Of Today
LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P
Cluster Computers.
Presentation transcript:

A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of Mathematical Sciences Chennai

Indian Lattice Community IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass. IISC(Bangalore): Apoorva Patel. TIFR (Mumbai): Rajiv Gavai, Sourendu Gupta. SINP (Kolkata): Asit De, Harindranath. HRI (Allahabad): S. Naik SNBOSE (Kolkata): Manu Mathur. It is small but very active and well recognised. So far its research mostly theoretical or small scale simulations except for international collaborations.

At the International Lattice Symposium held in Bangalore in 2000, the Indian Lattice Community decided to change this situation. Form the Indian Lattice Gauge Theory Initiative(ILGTI). Develop suitable infrastructure at different institutions for collective use. Launch new collaborations that would make the best use of such infrastructure. At IMSc we have finished integrating a 288-CPU Xeon Linux cluster. At TIFR a Cray X1 with 16 CPU’s has been acquired. At SINP plans are under way to have substantial computing resources.

Comput Nodes and Interconnect After a lot of deliberations it was decided that the compute nodes shall be dual Intel GHz. The motherboard and 1U rackmountable chassis developed by Supermicro. For the interconnect the choice was the SCI technology developed by Dolphinics of Norway.

Interconnect Technologies WANLAN I/O MemoryProcessor ATM Myrinet, cLan ∞ 1 Design space for different technologies Distance Bandwidth Latency Infiniband FibreChannel Cache Proprietary Busses Application areas: Application requirements: Bus Ethernet Cluster Interconnect Requirements SCSI Network PCI Dolphin SCI Technology Rapid IO Hyper Transport

PCI-SCI Adapter Card 1 slot 3 dimensions SCI ADAPTERS (64 bit - 66 MHz) –PCI / SCI ADAPTER (D336) –Single slot card with 3 LCs –EZ-Dock plug-up module –Supports 3 SCI ring connections –Used for WulfKit 3D clusters –WulfKit Product Code D236 SCI PSB PCI LC SCI LC

Theoretical Scalability with 66MHz/64bits PCI Bus Gbytes/s Courtesy of Scali NA

Channel Bonding option High Performance Interconnect: Torus Topology IEEE/ANSI std SCI 667MBytes/s/segment/ring Shared Address Space System Interconnects Maintenance and LAN Interconnect: 100Mbit/s Ethernet Courtesy of Scali NA

Scali’s MPI Fault Tolerance 2D or 3D Torus topology –more routing options XYZ routing algorithm –Node 33 fails (3) –Nodes on 33’s ringlets becomes unavailable –Cluster fractured with current routing setting Courtesy of Scali NA

SCAMPI Fault Tolerance cont. Scali advanced routing algorithm: –From the “Turn Model” family of routing algorithms All nodes but the failed one can be utilised as one big partition Courtesy of Scali NA

It was decided to build the cluster in stages. A 9-node Pilot cluster as the first stage. Actual QCD codes as well as extensive benchmarkings were run.

Integration starts on 17 Nov 2003

KABRU in Final Form

Kabru Configuration Number of Nodes : 144 Nodes: Intel Dual 2.4 GHz Motherboard: Supermicro X5DPA-GG Chipset: E MHz FSB Memory: 266 MHz ECC DDRAM Memory: 2 GB/Node x GB/N x 24 Interconnect: Dolphin 3D SCI OS: Red Hat Linux v.8.0 Scali MPI

Physical Characterstics 1U rackmountable servers Cluster housed in 6 42U racks. Each rack has 24 nodes. Nodes connected in 6x6x4 3D Torus Topology. Entire system in a small 400 sqft hall.

Communication Characterstics With the PCI slot at 33MHz the highest sustained bandwidth between nodes is 165 MB/s on a packetsize of 16 MB. Between processors on the same node it is 864 MB/s on a packet size of 98 KB. With the PCI slot at 66 MHz these double. Lowest latency between nodes is 3.8  s. Latency between procs on same node is 0.7 microsecs.

HPL Benchmarks Best performance with GOTO BLAS and dgemm from Intel was 959 GFlops on all 144 nodes(problem size ). Theoretical peak: GFlops. Efficiency: 70% With 80 nodes best performance was at 537 GFlops. Between 80 and 144 nodes the scaling is nearly 98.5%

MILC Benchmarks Numerous QCD codes with and without dynamical quarks have been run. Independently developed SSE2 assembly codes for double precision implementation of MILC codes. For the ks_imp_dyn1 codes we got 70% scaling as we went from 2 to 128 nodes with 1 proc/node, and 74% as we went from 1 to 64 nodes with 2 procs/node. These were for 32x32x32x48 lattices in single precision.

MILC Benchmarks Contd. For 64^4 lattices with single precision the scaling was close to 86%. For double precision runs on 32^4 lattices the scaling was close to 80% as the number of nodes were increased from 4 to 64. For pure-gauge simulations with double precision on 32^4 lattices the scaling was78.5% as one went from 2 to 128 nodes.

Physics Planned on Kabru Very accurate simulations in pure gauge theory (with Pushan Majumdar) using the Luscher-Weisz multihit algorithm. A novel parallel code both for Wilson loop as well as Polyakov loop correlators has been developed and preliminary runs carried out on lattices upto 32^4. Requires 200 GB memory for 64^4 simulations with double precision.

Physics on Kabru….. Using the same multihit algorithm we have a long term plan to carry out very accurate measurements of Wilson loops in various representations as well as their correlation functions to get a better understanding of confinement. We also plan to study string breaking in the presence of dynamical quarks. We propose to use scalar quarks to bypass the problems of dynamical fermions. With Sourendu Gupta (TIFR) we are carrying out preliminary simulations on sound velocity in finite temperature QCD.

Why KABRU?