Download presentation
Presentation is loading. Please wait.
1
A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of Mathematical Sciences Chennai
2
Indian Lattice Community IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass. IISC(Bangalore): Apoorva Patel. TIFR (Mumbai): Rajiv Gavai, Sourendu Gupta. SINP (Kolkata): Asit De, Harindranath. HRI (Allahabad): S. Naik SNBOSE (Kolkata): Manu Mathur. It is small but very active and well recognised. So far its research mostly theoretical or small scale simulations except for international collaborations.
3
At the International Lattice Symposium held in Bangalore in 2000, the Indian Lattice Community decided to change this situation. Form the Indian Lattice Gauge Theory Initiative(ILGTI). Develop suitable infrastructure at different institutions for collective use. Launch new collaborations that would make the best use of such infrastructure. At IMSc we have finished integrating a 288-CPU Xeon Linux cluster. At TIFR a Cray X1 with 16 CPU’s has been acquired. At SINP plans are under way to have substantial computing resources.
4
Comput Nodes and Interconnect After a lot of deliberations it was decided that the compute nodes shall be dual Intel Xeon@2.4 GHz. Xeon@2.4 The motherboard and 1U rackmountable chassis developed by Supermicro. For the interconnect the choice was the SCI technology developed by Dolphinics of Norway.
5
Interconnect Technologies WANLAN I/O MemoryProcessor ATM Myrinet, cLan 100 00010 0001 000100101 20 50 000 1 100 000 1 100 1 000 10 000 ∞ 1 Design space for different technologies Distance Bandwidth Latency Infiniband FibreChannel Cache Proprietary Busses Application areas: Application requirements: Bus Ethernet Cluster Interconnect Requirements SCSI Network PCI Dolphin SCI Technology Rapid IO Hyper Transport
6
PCI-SCI Adapter Card 1 slot 3 dimensions SCI ADAPTERS (64 bit - 66 MHz) –PCI / SCI ADAPTER (D336) –Single slot card with 3 LCs –EZ-Dock plug-up module –Supports 3 SCI ring connections –Used for WulfKit 3D clusters –WulfKit Product Code D236 SCI PSB PCI LC SCI LC
7
Theoretical Scalability with 66MHz/64bits PCI Bus Gbytes/s Courtesy of Scali NA
8
Channel Bonding option High Performance Interconnect: Torus Topology IEEE/ANSI std. 1596 SCI 667MBytes/s/segment/ring Shared Address Space System Interconnects Maintenance and LAN Interconnect: 100Mbit/s Ethernet Courtesy of Scali NA
9
Scali’s MPI Fault Tolerance 2D or 3D Torus topology –more routing options XYZ routing algorithm –Node 33 fails (3) –Nodes on 33’s ringlets becomes unavailable –Cluster fractured with current routing setting 14243444 1323 33 43 12223242 11213141 Courtesy of Scali NA
10
SCAMPI Fault Tolerance cont. Scali advanced routing algorithm: –From the “Turn Model” family of routing algorithms All nodes but the failed one can be utilised as one big partition 431323 33 42122232 41112131 44142434 Courtesy of Scali NA
11
It was decided to build the cluster in stages. A 9-node Pilot cluster as the first stage. Actual QCD codes as well as extensive benchmarkings were run.
12
Integration starts on 17 Nov 2003
13
KABRU in Final Form
15
Kabru Configuration Number of Nodes : 144 Nodes: Intel Dual Xeon @ 2.4 GHz Motherboard: Supermicro X5DPA-GG Chipset: E7501 533 MHz FSB Memory: 266 MHz ECC DDRAM Memory: 2 GB/Node x 120+4 GB/N x 24 Interconnect: Dolphin 3D SCI OS: Red Hat Linux v.8.0 Scali MPI
16
Physical Characterstics 1U rackmountable servers Cluster housed in 6 42U racks. Each rack has 24 nodes. Nodes connected in 6x6x4 3D Torus Topology. Entire system in a small 400 sqft hall.
17
Communication Characterstics With the PCI slot at 33MHz the highest sustained bandwidth between nodes is 165 MB/s on a packetsize of 16 MB. Between processors on the same node it is 864 MB/s on a packet size of 98 KB. With the PCI slot at 66 MHz these double. Lowest latency between nodes is 3.8 s. Latency between procs on same node is 0.7 microsecs.
18
HPL Benchmarks Best performance with GOTO BLAS and dgemm from Intel was 959 GFlops on all 144 nodes(problem size 183000). Theoretical peak: 1382.4 GFlops. Efficiency: 70% With 80 nodes best performance was at 537 GFlops. Between 80 and 144 nodes the scaling is nearly 98.5%
19
MILC Benchmarks Numerous QCD codes with and without dynamical quarks have been run. Independently developed SSE2 assembly codes for double precision implementation of MILC codes. For the ks_imp_dyn1 codes we got 70% scaling as we went from 2 to 128 nodes with 1 proc/node, and 74% as we went from 1 to 64 nodes with 2 procs/node. These were for 32x32x32x48 lattices in single precision.
20
MILC Benchmarks Contd. For 64^4 lattices with single precision the scaling was close to 86%. For double precision runs on 32^4 lattices the scaling was close to 80% as the number of nodes were increased from 4 to 64. For pure-gauge simulations with double precision on 32^4 lattices the scaling was78.5% as one went from 2 to 128 nodes.
21
Physics Planned on Kabru Very accurate simulations in pure gauge theory (with Pushan Majumdar) using the Luscher-Weisz multihit algorithm. A novel parallel code both for Wilson loop as well as Polyakov loop correlators has been developed and preliminary runs carried out on lattices upto 32^4. Requires 200 GB memory for 64^4 simulations with double precision.
22
Physics on Kabru….. Using the same multihit algorithm we have a long term plan to carry out very accurate measurements of Wilson loops in various representations as well as their correlation functions to get a better understanding of confinement. We also plan to study string breaking in the presence of dynamical quarks. We propose to use scalar quarks to bypass the problems of dynamical fermions. With Sourendu Gupta (TIFR) we are carrying out preliminary simulations on sound velocity in finite temperature QCD.
23
Why KABRU?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.