Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology.

Slides:



Advertisements
Similar presentations
National Institute of Advanced Industrial Science and Technology Belle/Gfarm Grid Experiment at SC04 Osamu Tatebe Grid Technology Research Center, AIST.
Advertisements

A Proposal of Capacity and Performance Assured Storage in The PRAGMA Grid Testbed Yusuke Tanimura 1) Hidetaka Koie 1,2) Tomohiro Kudoh 1) Isao Kojima 1)
Gfarm v2 and CSF4 Osamu Tatebe University of Tsukuba Xiaohui Wei Jilin University SC08 PRAGMA Presentation at NCHC booth Nov 19,
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.
1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Belle computing upgrade Ichiro Adachi 22 April 2005 Super B workshop in Hawaii.
GridPP meeting Feb 03 R. Hughes-Jones Manchester WP7 Networking Richard Hughes-Jones.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
ALICE data access WLCG data WG revival 4 October 2013.
Large File Transfer on 20,000 km - Between Korea and Switzerland Yusung Kim, Daewon Kim, Joonbok Lee, Kilnam Chon
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology High speed WAN data transfers for science Session Recent Results.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Data GRID Activity in Japan Yoshiyuki WATASE KEK (High energy Accelerator Research Organization) Tsukuba, Japan
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial.
Masaki Hirabaru NICT Koganei 3rd e-VLBI Workshop October 6, 2004 Makuhari, Japan Performance Measurement on Large Bandwidth-Delay Product.
Sep 02 IPP Canada Remote Computing Plans Pekka K. Sinervo Department of Physics University of Toronto 4 Sep IPP Overview 2 Local Computing 3 Network.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Challenges of deploying Wide-Area-Network Distributed Storage System under network and reliability constraints – A case study
PC clusters in KEK A.Manabe KEK(Japan). 22 May '01LSCC WS '012 PC clusters in KEK s Belle (in KEKB) PC clusters s Neutron Shielding Simulation cluster.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
30 June Wide Area Networking Performance Challenges Olivier Martin, CERN UK DTI visit.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
National Institute of Advanced Industrial Science and Technology Gfarm Grid File System for Distributed and Parallel Data Computing Osamu Tatebe
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
National Institute of Advanced Industrial Science and Technology Gfarm v2: A Grid file system that supports high-performance distributed and parallel data.
HTCC coffee march /03/2017 Sébastien VALAT – CERN.
Performance measurement of transferring files on the federated SRB
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Status report on LHC_2: ATLAS computing
Realization of a stable network flow with high performance communication in high bandwidth-delay product network Y. Kodama, T. Kudoh, O. Tatebe, S. Sekiguchi.
The transfer performance of iRODS between CC-IN2P3 and KEK
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Networking between China and Europe
The INFN TIER1 Regional Centre
Grid Datafarm and File System Services
Grid Canada Testbed using HEP applications
Support for ”interactive batch”
PRAGMA Telescience at iGRID 2002
GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
Jan. 24th, 2003 Kento Aida (TITECH) Sissades Tongsima (NECTEC)
Presentation transcript:

Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology Research Center, AIST 2 High Energy Accelerator Research Organization (KEK) 3 Tokyo Institute of Technology 4 Software Research Associates, Inc. CHEP03 San Diego, CA March 2003

ATLAS/Grid Datafarm project: CERN LHC Experiment Detector for ALICE experiment Detector for LHCb experiment Truck ATLAS Detector 40mx20m 7000 Tons LHC Perimeter 26.7km ~2000 physicists from 35 countries Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo

Petascale Data-intensive Computing Requirements  Peta/Exabyte scale files  Scalable parallel I/O throughput > 100GB/s, hopefully > 1TB/s within a system and between systems  Scalable computational power > 10TFLOPS, hopefully > 100TFLOPS  Efficiently global sharing with group-oriented authentication and access control  Resource Management and Scheduling  System monitoring and administration  Fault Tolerance / Dynamic re-configuration  Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support  Cluster-of-cluster filesystem on the Grid File replicas among clusters for fault tolerance and load balancing Extension of striping cluster filesystem  Arbitrary file block length  Filesystem node = compute node + I/O node. Each node has large fast local disks.  Parallel I/O, parallel file transfer, and more  Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view  Fault tolerance – file recovery Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at

Distributed disks across the clusters form a single Gfarm file system Distributed disks across the clusters form a single Gfarm file system n Each cluster generates the corresponding part of data n The data are replicated for fault tolerance and load balancing (bandwidth challenge!) n Analysis process is executed on the node that has the data BaltimoreTsukuba Indiana San DiegoTokyo

Extreme I/O bandwidth support example: gfgrep - parallel grep % gfrun –G gfarm:input gfgrep –o gfarm:output regexp gfarm:input –o gfarm:output regexp gfarm:input CERN.CH KEK.JP input.1 input.2 input.3 input.4 open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) close(f1); close(f2) grep regexp Host2.ch Host1.ch Host3.ch Host4.jp gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp gfmd input.5 Host5.jp output.4output.2 output.5 output.3 output.1 gfgrep File affinity scheduling

Application: FADS/Goofy zMonte Carlo Simulation Framework with Geant4 (C++) zFADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object- oriented Folly zModular I/O package selection: Objectivity/DB and/or ROOT I/O on top of Gfarm filesystem with good scalability zCPU intensive event simulation with high speed file replication and/or distribution Refer to Y. Morita ’ s talk Category 3, March 25

Presto III Gfarm Development Cluster (Prototype) at Titech Dual 256 node/512 proc AthlonMP (1.6Ghz) Rpeak 1.6 TeraFlops Dual 256 node/512 proc AthlonMP (1.6Ghz) Rpeak 1.6 TeraFlops AMD 760MP Chipset AMD 760MP Chipset Full Myrinet 2K network Full Myrinet 2K network 100Terabyte Storage for storage intensive/ DataGrid apps 100Terabyte Storage for storage intensive/ DataGrid apps June th Top 500, 716GFlops June th Top 500, 716GFlops  2nd Fastest PC cluster at the time Collaboration with AMD, Bestsystems Co., Tyan, Appro, Myricom Collaboration with AMD, Bestsystems Co., Tyan, Appro, Myricom Direct 1Gbps connection to HEP WAN network on SuperSINET Direct 1Gbps connection to HEP WAN network on SuperSINET June 2003 June 2003  CPU 1.6Ghz -> 2Ghz planned

Performance Evaluation – 64-node Presto III Gfarm Development Cluster (Prototype)  Parallel I/O File affinity scheduling Local file view 64 nodes, 640 GB of file  1742 MB/s on writes  1974 MB/s on reads  Parallel file replication Myrinet 2000 interconnects 10 GB each fragment  443 MB/s (= 3.7 Gbps) using 23 parallel streams [MB/s] Breakdown of bandwidth on each node [1] O. Tatebe, et al, Grid Datafarm architecture for Petascale Data Intensive Application, Proceedings on CCGrid 2002, 2002 Gfarm parallel copy bandwidth [MB/sec] The number of nodes (fragments) Seagate ST380021A Maxtor 33073H3 180 MB/s 7 parallel streams open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f); Presto III 64 node, 640 GB data size

Design of AIST Gfarm Cluster I  Cluster node (High density and High performance) 1U, Dual 2.8GHz Xeon, GbE 800GB RAID with ” 200GB HDDs + 3ware RAID  97 MB/s on writes, 130 MB/s on reads  80-node experimental cluster (operational from Feb 2003) Force10 E600 + KVM switch + keyboard + LCD Totally 70TB RAID with 384 IDE disks  7.7 GB/s on writes, 9.8 GB/s on reads for a 1.7TB file  1.7 GB/s (= 14.5 Gbps) on file replication of a 640GB file with 32 streams WAN emulation nodes with NistNET  can emulate to be a cluster of up to five clusters

Performance Evaluation – 64-node AIST Gfarm Cluster I  Parallel I/O File affinity scheduling Local file view 64 nodes, 1.25 TB of file  6180 MB/s on writes  7695 MB/s on reads  Parallel file replication Gigabit Ethernet 20 GB each fragment  1726 MB/s (= 14.5 Gbps) using 32 parallel streams  54 MB/s (= 452 Mbps) per node [MB/s] Breakdown of bandwidth on each node open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f); AIST Gfarm 64 node, 1.25 TB data size

Network and cluster configuration for SC2002 Bandwidth Challenge SCinet 10 GE SC2002, Baltimore Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG Grid Cluster Federation Booth OC-12 StarLight OC-12 ATM (271 Mbps) E1200 Tsukuba WAN 20 Mbps 1 Gbps SuperSINET 1 Gbps OC GE US ESnet NII-ESnet HEP PVC 1 Gbps KEKTitechAISTICEPP SDSCIndiana USC2002 Total bandwidth from/to SC2002 booth: Gbps Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops

Network and cluster configuration SC2002 booth (Baltimore) 12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200 Performance in LAN Network bandwidth: 930Mbps File transfer bandwidth: 75MB/s(=629Mbps) GTRC, AIST (Tsukuba, Japan) The same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin Indiana Univ 15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12 SDSC (San Diego) 8-node PC cluster connected with GbE connects to outside with OC12 TransPAC north and south routes Default north route South route is used by static routing for 3 nodes in each SC booth and AIST RTT between AIST and SC booth: north 199ms, south 222ms South route is shaped to 271Mbps PC E1200 PC GbE SCinet 10GE OC nodes Network configuration at SC2002 booth

Challenging points of TCP-based file transfer Large latency, high Bandwidth (aka LFN) Big socket size for large congestion window Fast window-size recovery after packet loss High Speed TCP (internet draft by Sally Floyd) Network striping Packet loss due to real congestion Transfer control Poor disk I/O performance [AIST Gfarm cluster] 3ware RAID with ” HDDs on each node Over 115 MB/s (~ 1 Gbps network bandwidth) Network striping vs. disk striping access # streams, stripe size Limited number of nodes Need to achieve maximum file transfer performance

High Speed TCP Performance Evaluation (1) US -> Japan (TransPAC North) 1 node to 1 node, 2 streams, 8MB socket buffer TransPAC North, OC-12 POS [sec] Bandwidth recovered very soon High Speed TCP performs very well!

High Speed TCP Performance Evaluation (2) Japan -> US (TransPAC North) 1 node to 1 node, 2 streams, 8MB socket buffer TransPAC North, OC-12 POS Traffic is slightly heavy from Japn to US. High Speed TCP performs very well

High Speed TCP Performance Evaluation (3) Japan -> US (TransPAC South) 1 streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: 251 Mbps 10min avg: 85.9 Mbps Critical packet loss problem when traffic rate is high

High Speed TCP Performance Evaluation (4) Japan -> US (TransPAC South) 2 100Mbps streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: Mbps 10min avg: Mbps Rate control performs well

High Speed TCP Performance Evaluation (5) Japan -> US (TransPAC South) 3 100Mbps streams, 8MB socket buffer TransPAC South, OC-12 ATM 271Mbps shaping 5sec peak: Mbps 10min avg: Mbps

High Speed TCP bandwidth from US to Japan via APAN/TransPAC Northern route: 2 nodes Southern route: 3 nodes (100Mbps) Totally 753Mbps (10-sec avg) (theoretical peak: =893Mbps ) 10-sec average bandwidth 5-min average bandwidth Northern route Southern route

Parameters of US-Japan file transfer parameter Northern route Southern route socket buffer size 610 KB 250 KB Traffic control per stream 50 Mbps 28.5 Mbps # streams per node pair 16 streams 8 streams # nodes 3 hosts 1 host stripe unit size 128 KB # node pairs # streams 10-sec average BW Transfer time (sec) Average BW 1 (N1) 16 (N16x1) Mbps 2 (N2) 32 (N16x2) 419 Mbps Mbps 3 (N3) 48 (N16x3) 593 Mbps Mbps 4 (N3 S1) 56 (N16x3 + S8x1) 741 Mbps Mbps

File replication between US and Japan Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! ( out of 893 Mbps, 10-sec average bandwidth ) 10-sec average bandwidth

File replication performance between one node at SC booth and other US sites 34.9 MB/s (= 293 Mbps) RTT: SC-Indiana 30ms SC-SDSC 86ms

Parameters for SC2002 bandwidth challenge # nodes in Baltimore Remote site # nodes # streams /node Socket buffer size, rate limit Measured BW (1-2min avg) 3SDSC511MB>60MB/s 2Indiana811MB 56.8 MB/s 3 AIST N KB, 50 Mbps 44.0 MB/s 1 AIST S KB, 28.5 Mbps 10.6 MB/s 9SD,In,AIST5,8,4- >171 MB/s 1SDSC377MB 23.1 MB/s 1*Indiana411MB 34.9 MB/s 1* AIST N KB, 50 Mbps ? MB/s 1 AIST S KB ? MB/s 3SD,In,AIST3,4,2- >58 MB/s Outgoing traffic Incoming traffic (= > 1.43Gbps) (= > 487Mbps)

SC02 Bandwidth Challenge Result 10-sec average bandwidth 1-sec average bandwidth We achieved Gbps using 12 nodes! (outgoing Gbps, incoming Gbps) 0.1-sec average bandwidth

Summary  Petascale Data Intensive Computing Wave  Key technology: Grid and cluster  Grid datafarm is an architecture for Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance  Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 7.7 GB/s, 9.8 GB/s on writes and reads of a 1.7TB file on 80 cluster nodes of AIST Gfarm I 1.7 GB/s (= 14.5 Gbps) on file replication of a 640GB file with 32 parallel streams on AIST Gfarm I  Metaserver overhead is negligible  Gfarm file replication achieves Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan!

Special thanks to  Rick McMullen, John Hicks (Indiana Univ, PRAGMA)  Phillip Papadopoulos (SDSC, PRAGMA)  Hisashi Eguchi (Maffin)  Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN)  Chris Robb (Indiana Univ, Abilene)  Force 10 Networks, Inc  METI, Network computing project