Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology.

Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology Research Center, AIST 2 High Energy Accelerator Research Organization (KEK) 3 Tokyo Institute of Technology 4 Software Research Associates, Inc. CHEP03 San Diego, CA March 2003

ATLAS/Grid Datafarm project: CERN LHC Experiment Detector for ALICE experiment Detector for LHCb experiment Truck ATLAS Detector 40mx20m 7000 Tons LHC Perimeter 26.7km ~2000 physicists from 35 countries Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo

Petascale Data-intensive Computing Requirements  Peta/Exabyte scale files  Scalable parallel I/O throughput > 100GB/s, hopefully > 1TB/s within a system and between systems  Scalable computational power > 10TFLOPS, hopefully > 100TFLOPS  Efficiently global sharing with group-oriented authentication and access control  Resource Management and Scheduling  System monitoring and administration  Fault Tolerance / Dynamic re-configuration  Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support  Cluster-of-cluster filesystem on the Grid File replicas among clusters for fault tolerance and load balancing Extension of striping cluster filesystem  Arbitrary file block length  Filesystem node = compute node + I/O node. Each node has large fast local disks.  Parallel I/O, parallel file transfer, and more  Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view  Fault tolerance – file recovery Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at http://datafarm.apgrid.org/

Distributed disks across the clusters form a single Gfarm file system Distributed disks across the clusters form a single Gfarm file system n Each cluster generates the corresponding part of data n The data are replicated for fault tolerance and load balancing (bandwidth challenge!) n Analysis process is executed on the node that has the data BaltimoreTsukuba Indiana San DiegoTokyo

Extreme I/O bandwidth support example: gfgrep - parallel grep % gfrun –G gfarm:input gfgrep –o gfarm:output regexp gfarm:input –o gfarm:output regexp gfarm:input CERN.CH KEK.JP input.1 input.2 input.3 input.4 open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) close(f1); close(f2) grep regexp Host2.ch Host1.ch Host3.ch Host4.jp gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp gfmd input.5 Host5.jp output.4output.2 output.5 output.3 output.1 gfgrep File affinity scheduling

Application: FADS/Goofy zMonte Carlo Simulation Framework with Geant4 (C++) zFADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object- oriented Folly http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/ zModular I/O package selection: Objectivity/DB and/or ROOT I/O on top of Gfarm filesystem with good scalability zCPU intensive event simulation with high speed file replication and/or distribution Refer to Y. Morita ’ s talk Category 3, March 25 http://chep03.ucsd.edu/files/407.pdf

Presto III Gfarm Development Cluster (Prototype) at Titech Dual 256 node/512 proc AthlonMP 1900+ (1.6Ghz) Rpeak 1.6 TeraFlops Dual 256 node/512 proc AthlonMP 1900+ (1.6Ghz) Rpeak 1.6 TeraFlops AMD 760MP Chipset AMD 760MP Chipset Full Myrinet 2K network Full Myrinet 2K network 100Terabyte Storage for storage intensive/ DataGrid apps 100Terabyte Storage for storage intensive/ DataGrid apps June 2002 47th Top 500, 716GFlops June 2002 47th Top 500, 716GFlops  2nd Fastest PC cluster at the time Collaboration with AMD, Bestsystems Co., Tyan, Appro, Myricom Collaboration with AMD, Bestsystems Co., Tyan, Appro, Myricom Direct 1Gbps connection to HEP WAN network on SuperSINET Direct 1Gbps connection to HEP WAN network on SuperSINET June 2003 June 2003  CPU 1.6Ghz -> 2Ghz planned

Performance Evaluation – 64-node Presto III Gfarm Development Cluster (Prototype)  Parallel I/O File affinity scheduling Local file view 64 nodes, 640 GB of file  1742 MB/s on writes  1974 MB/s on reads  Parallel file replication Myrinet 2000 interconnects 10 GB each fragment  443 MB/s (= 3.7 Gbps) using 23 parallel streams [MB/s] Breakdown of bandwidth on each node [1] O. Tatebe, et al, Grid Datafarm architecture for Petascale Data Intensive Application, Proceedings on CCGrid 2002, 2002 Gfarm parallel copy bandwidth [MB/sec] 0 100 200 300 400 05101520 The number of nodes (fragments) Seagate ST380021A Maxtor 33073H3 180 MB/s 7 parallel streams open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f); Presto III 64 node, 640 GB data size

Design of AIST Gfarm Cluster I  Cluster node (High density and High performance) 1U, Dual 2.8GHz Xeon, GbE 800GB RAID with 4 3.5 ” 200GB HDDs + 3ware RAID  97 MB/s on writes, 130 MB/s on reads  80-node experimental cluster (operational from Feb 2003) Force10 E600 + KVM switch + keyboard + LCD Totally 70TB RAID with 384 IDE disks  7.7 GB/s on writes, 9.8 GB/s on reads for a 1.7TB file  1.7 GB/s (= 14.5 Gbps) on file replication of a 640GB file with 32 streams WAN emulation nodes with NistNET  can emulate to be a cluster of up to five clusters

Performance Evaluation – 64-node AIST Gfarm Cluster I  Parallel I/O File affinity scheduling Local file view 64 nodes, 1.25 TB of file  6180 MB/s on writes  7695 MB/s on reads  Parallel file replication Gigabit Ethernet 20 GB each fragment  1726 MB/s (= 14.5 Gbps) using 32 parallel streams  54 MB/s (= 452 Mbps) per node [MB/s] Breakdown of bandwidth on each node open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f); AIST Gfarm 64 node, 1.25 TB data size

Network and cluster configuration for SC2002 Bandwidth Challenge SCinet 10 GE SC2002, Baltimore Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG Grid Cluster Federation Booth OC-12 StarLight OC-12 ATM (271 Mbps) E1200 Tsukuba WAN 20 Mbps 1 Gbps SuperSINET 1 Gbps OC-12 10 GE US ESnet NII-ESnet HEP PVC 1 Gbps KEKTitechAISTICEPP SDSCIndiana USC2002 Total bandwidth from/to SC2002 booth: 2.137 Gbps Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops

Network and cluster configuration SC2002 booth (Baltimore) 12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200 Performance in LAN Network bandwidth: 930Mbps File transfer bandwidth: 75MB/s(=629Mbps) GTRC, AIST (Tsukuba, Japan) The same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin Indiana Univ 15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12 SDSC (San Diego) 8-node PC cluster connected with GbE connects to outside with OC12 TransPAC north and south routes Default north route South route is used by static routing for 3 nodes in each SC booth and AIST RTT between AIST and SC booth: north 199ms, south 222ms South route is shaped to 271Mbps PC E1200 PC GbE SCinet 10GE OC192 12 nodes Network configuration at SC2002 booth

Challenging points of TCP-based file transfer Large latency, high Bandwidth (aka LFN) Big socket size for large congestion window Fast window-size recovery after packet loss High Speed TCP (internet draft by Sally Floyd) Network striping Packet loss due to real congestion Transfer control Poor disk I/O performance [AIST Gfarm cluster] 3ware RAID with 4 3.5 ” HDDs on each node Over 115 MB/s (~ 1 Gbps network bandwidth) Network striping vs. disk striping access # streams, stripe size Limited number of nodes Need to achieve maximum file transfer performance

High Speed TCP Performance Evaluation (1) US -> Japan (TransPAC North) 1 node to 1 node, 2 streams, 8MB socket buffer TransPAC North, OC-12 POS [sec] Bandwidth recovered very soon High Speed TCP performs very well!

High Speed TCP Performance Evaluation (2) Japan -> US (TransPAC North) 1 node to 1 node, 2 streams, 8MB socket buffer TransPAC North, OC-12 POS Traffic is slightly heavy from Japn to US. High Speed TCP performs very well

High Speed TCP Performance Evaluation (3) Japan -> US (TransPAC South) 1 streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: 251 Mbps 10min avg: 85.9 Mbps Critical packet loss problem when traffic rate is high

High Speed TCP Performance Evaluation (4) Japan -> US (TransPAC South) 2 100Mbps streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: 190.9 Mbps 10min avg: 166.1 Mbps Rate control performs well

High Speed TCP Performance Evaluation (5) Japan -> US (TransPAC South) 3 100Mbps streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: 278.8 Mbps 10min avg: 160.0 Mbps

High Speed TCP bandwidth from US to Japan via APAN/TransPAC Northern route: 2 nodes Southern route: 3 nodes (100Mbps) Totally 753Mbps (10-sec avg) (theoretical peak: 622+271=893Mbps ） 10-sec average bandwidth 5-min average bandwidth Northern route Southern route

Parameters of US-Japan file transfer parameter Northern route Southern route socket buffer size 610 KB 250 KB Traffic control per stream 50 Mbps 28.5 Mbps # streams per node pair 16 streams 8 streams # nodes 3 hosts 1 host stripe unit size 128 KB # node pairs # streams 10-sec average BW Transfer time (sec) Average BW 1 (N1) 16 (N16x1) 113.0152Mbps 2 (N2) 32 (N16x2) 419 Mbps 115.9297Mbps 3 (N3) 48 (N16x3) 593 Mbps 139.6369Mbps 4 (N3 S1) 56 (N16x3 + S8x1) 741 Mbps 150.0458Mbps

File replication between US and Japan Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! （ out of 893 Mbps, 10-sec average bandwidth ） 10-sec average bandwidth

File replication performance between one node at SC booth and other US sites 34.9 MB/s (= 293 Mbps) RTT: SC-Indiana 30ms SC-SDSC 86ms

Parameters for SC2002 bandwidth challenge # nodes in Baltimore Remote site # nodes # streams /node Socket buffer size, rate limit Measured BW (1-2min avg) 3SDSC511MB>60MB/s 2Indiana811MB 56.8 MB/s 3 AIST N 316 610 KB, 50 Mbps 44.0 MB/s 1 AIST S 116 346 KB, 28.5 Mbps 10.6 MB/s 9SD,In,AIST5,8,4- >171 MB/s 1SDSC377MB 23.1 MB/s 1*Indiana411MB 34.9 MB/s 1* AIST N 11 610 KB, 50 Mbps ? MB/s 1 AIST S 11 346 KB ? MB/s 3SD,In,AIST3,4,2- >58 MB/s Outgoing traffic Incoming traffic (= > 1.43Gbps) (= > 487Mbps)

SC02 Bandwidth Challenge Result 10-sec average bandwidth 1-sec average bandwidth We achieved 2.286 Gbps using 12 nodes! (outgoing 1.691 Gbps, incoming 0.595 Gbps) 0.1-sec average bandwidth

Summary  Petascale Data Intensive Computing Wave  Key technology: Grid and cluster  Grid datafarm is an architecture for Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance  Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 7.7 GB/s, 9.8 GB/s on writes and reads of a 1.7TB file on 80 cluster nodes of AIST Gfarm I 1.7 GB/s (= 14.5 Gbps) on file replication of a 640GB file with 32 parallel streams on AIST Gfarm I  Metaserver overhead is negligible  Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan! datafarm@apgrid.org http://datafarm.apgrid.org/

Special thanks to  Rick McMullen, John Hicks (Indiana Univ, PRAGMA)  Phillip Papadopoulos (SDSC, PRAGMA)  Hisashi Eguchi (Maffin)  Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN)  Chris Robb (Indiana Univ, Abilene)  Force 10 Networks, Inc  METI, Network computing project

Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology.

Similar presentations

Presentation on theme: "Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology.

Similar presentations

Presentation on theme: "Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology."— Presentation transcript:

Similar presentations

About project

Feedback