Presentation is loading. Please wait.

Presentation is loading. Please wait.

TeraGrid Data Transfer Joint EGEE and OSG Workshop on Data Handling in Production Grids June 25, 2007 - Monterey, CA Derek Simmel

Similar presentations


Presentation on theme: "TeraGrid Data Transfer Joint EGEE and OSG Workshop on Data Handling in Production Grids June 25, 2007 - Monterey, CA Derek Simmel"— Presentation transcript:

1 TeraGrid Data Transfer Joint EGEE and OSG Workshop on Data Handling in Production Grids June 25, 2007 - Monterey, CA Derek Simmel dsimmel@psc.edudsimmel@psc.edu Pittsburgh Supercomputing Center

2 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC2 TeraGrid Data Transfer Topics –TeraGrid Network (June 2007) –TeraGrid Data Kits –GridFTP –HPN-SSH –WAN Filesystems Lustre-WAN and GPFS-WAN –Advanced Solutions Scheduled Data Jobs - DMOVER Getting data to/from MPPs - PDIO

3 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC3 TeraGrid Network

4 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC4 network.teragrid.org

5 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC5 Performance Monitoring

6 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC6 TeraGrid Data Kits Data Movement –GridFTP, HPN-SSH –TeraGrid Globus deployment includes VDT- contributed improvements to the Globus toolkit Data Management –SRB support WAN Filesystems –Development: GPFS-WAN, Lustre-WAN –Future: pNFS with GPFS-WAN & Lustre-WAN client modules

7 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC7 TeraGrid GridFTP service Standard target names –gridftp.{system}.{site}.teragrid.org Sets of striped servers –Most sites have deployed multiple (4~12) data stripe servers per (HPC) system –Mix of 10GbE and 1GbE deployments –Multiple data stripes services started on 10GbE GridFTP data transfer servers

8 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC8 speedpage.teragrid.org

9 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC9 GridFTP Observations Specify server configuration parameters in an external file (-c option) –Allows updates to configuration on the fly between invocations of GridFTP server –Facilitate custom setups for dedicated user jobs Make the server block size parameter match the default (parallel) filesystem block size for the filesystem visible to the GridFTP server –How to accommodate user configurable filesystem block sizing (e.g. Lustre)? Don’t know yet… -vb is still broken –Calculate throughput using time as a wrapper instead

10 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC10 GridFTP server configuration Recommended Striping for TeraGrid sites: –10GbE: 4- or 6-way striping per interface Not more since most 10GbE are limited by PCI-X bus –1GbE: 1 stripe each Factors: –TeraGrid network is uncongested Multiple stripes/flows are not necessary to mitigate congestion-related loss –Mix of 10GbE and 1GbE Striping is determined by the receiving server config - 8x1GbE -> 2x10GbE = 2 stripes unless the latter are configured with multiple stripes each

11 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC11 globus-url-copy -tcp-bs TCP Buffer Size –Goal is to optimize the buffer size large enough to handle as many bytes as can typically be in flight between the source and target Too small: waste time in transfer waiting at source for responses from target when you could have been sending data Too big: waste time having to retransmit packets that got dropped at the target because the target ran out of buffer space and/or could not process them fast enough –TeraGrid tgcp tool uses TCP buffer size values calculated from measurements between TeraGrid sites over the TeraGrid network –Autotuning kernels/OSs Linux kernel 2.6.9 or later Microsoft Windows Vista Observed superior performance at TACC and ORNL on systems with autotuning enabled

12 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC12 Other Performance Factors Other TCP implementation factors that will affect network performance –RFC 1323 TCP extensions for high performance Window scaling - you’re limited to 64K max without this Timestamps –Protection against wrapped sequence numbers in high-capacity networks –RFC 2018 SACK (Selective ACK support) Receiver sends ACK’s with info about what packets it has seen - allowing the sender to only resend missing packets, thus reducing retransmissions –RFC 1191 Path MTU discovery Packet sizes should be maximized for network MTU=9000 bytes on TeraGrid network (mix of 1Gb & 10Gb i/fs)

13 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC13 Additional Tuning Resources TCP tuning guide: –http://www.psc.edu/networking/projects/tcptune/http://www.psc.edu/networking/projects/tcptune/ Autotuning: –Jeff Semke, Jamshid Mahdavi, Matt Mathis - 1998 auto tuning paper: http://www.psc.edu/networking/ftp/papers/autotune_sigco mm98.ps –Dunigan Oak Ridge auto tuning: http://www.csm.ornl.gov/~dunigan/netperf/auto.html

14 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC14 GridFTP 4.1.2 Dev Release TeraGrid GIG Data Transfer team is investigating new GridFTP features –“pipeline” mode to more-efficiently transfer large numbers of files –Automatic data stripe server failure recovery –sshftp:// - transfers to/from SSH servers

15 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC15 HPN-SSH So What’s Wrong with SSH? –Standard SSH is slow in wide area networks –Internal bottlenecks prevent SSH from using all of the network you have What is HPN-SSH? –A set of patches to greatly improve the network performance of OpenSSH Where do I get it? –http://www.psc.edu/networking/projects/hpn-ssh

16 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC16 (Current) Standard SSH

17 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC17 The Real Problem with SSH It is *NOT* the encryption process! –If it was: Faster computers would give faster throughput. Which doesn’t happen. Transfer rates would be constant in local and wide area network. Which they aren’t. In fact transfer rates seem dependent on RTT, the farther away the slower the transfer. Any time rates are strongly linked to RTT it implies a receive buffer problem

18 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC18 What’s the Big Deal? Receive buffers are used to regulate the data rate of TCP The receive buffer is how much data can be unacknowledged at any one point. The sender will only send out that much data *until* it gets an ACK –If your buffer is set to 64k the sender can only send 64k per round trip no matter how fast the network actually is

19 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC19 RTTLinkBDPUtilization 100ms10Mbs125KB50% 100ms100Mbs1.25MB5% 100ms1000Mbs12.5MB0.5% How Bad Can it Be? Pretty bad –Lets say you have a 64KB receive buffer

20 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC20 SSH is RWIN Limited Analysis of the code reveals –SSH Protocol V2 is multiplexed Multiple channels over one TCP connection –Must implement a flow control mechanism per channel Essentially the same as the TCP receive window –This application level RWIN is effectively set to 64KB. So real connection RWIN is MIN(TCPrwin, SSHrwin) Thus TPUTmax = 64KB/RTT

21 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC21 Solving the Problem Use getsockopt() to get TCP(rwin) and dynamically set SSH(rwin) –Performed several times throughout transfer to handle autotuning kernels Results in 10x to 50x faster throughput depending on cipher used on well tuned system

22 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC22 HPN-SSH versus SSH

23 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC23 HPN-SSH Advantages Users already know how to use scp –Keys, ~/.ssh/config file preferences & shortcuts Speed is roughly comparable to single-stripe GridFTP and Kerberized FTP Use existing authentication infrastructure –GSISSH now includes HPN patches –Do both GSI and Kerberos authn with MechGlue Can be used with other applications – rsync, svn, SFTP, ssh port forwarding, etc.

24 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC24 HPN-SSH Issues Users are accustomed to using scp/sftp to transfer files to/from login nodes –Now that HPN-scp can be a bandwidth hog like GridFTP, interactive login nodes are no longer the best place for it –3rd-party transfer scp a:file b:file2 = (ssh a; scp a:file b:file2) Tricky to configure hosts that you don’t want to give interactive ssh access on?

25 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC25 TeraGrid HPN-SSH service Currently available with default SSH service on many HPC login hosts –Login hosts running current GSISSH include HPN-SSH patches Likely to move HPN-SSH to dedicated data service nodes –e.g. existing GridFTP data server pools

26 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC26 WAN Filesystems A common filesystem (or at least the transparent semblance of one) is one of the most commonly user-requested enhancements for TeraGrid WAN Filesystems on TeraGrid: –Lustre-WAN –GPFS-WAN

27 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC27 TeraGrid Lustre-WAN Active TeraGrid sites include PSC, Indiana Univ., and ORNL. NCAR to be added soon We’ve seen good performance across the TeraGrid network –As high as 977MB/s for a single client over 10GbE 2 active Lustre-WAN filesystems (PSC & IU) Currently experimenting with alpha version of Lustre that supports Kerberos authentication, encryption (metadata, data) & UID mapping –Uses some of the NFSv4 infrastructure built by UMICH

28 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC28 TeraGrid GPFS-WAN 700TB GPFS-WAN filesystem housed at San Diego Supercomputer Center Currently mounted across TeraGrid network at SDSC, NCSA, ANL and PSC Divided into three categories –Collections 150TB –Projects 475TB - User projects apply for space –Scratch 75TB (purged periodically) –Note: GPFS-WAN filesystems are not backed up

29 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC29 Advanced Solutions TeraGrid staff actively work on custom solutions to meet the needs of the NSF user community Examples: –DMOVER –Parallel Direct I/O (PDIO)

30 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC30 Scheduled Data Jobs Traditional HPC batch jobs waste CPU allocations staging data in/out from –Why be charged CPU hours for thousands of CPUs sitting idle while data is moved? Goals –Schedule data movement as its own separate job –Exploit opportunities for parallelism to reduce transfer time –Co-schedule with HPC application jobs as needed Approach: Create a “canned” job that users can run to instantiate a file transfer service

31 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC31 DMOVER Designed for use on lemieux.psc.edu to allow data movement in/out of /scratch Data relayed from lemieux.psc.edu compute nodes via interconnect to Access Gateway nodes on WAN Portable DMOVER edition currently under development for use on other TeraGrid platforms

32 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC32 DMOVER job script example #PBS -l rmsnodes=4:4 #PBS -l agw_nodes=4 # root of the file(s)/directory(s) to transfer (a convenience) export SrcDirRoot=$SCRATCH/mydata/ # path to the target sources, relative to SrcDirRoot (wildcards allowed) export SrcRelPath="*.dat" # destination host name (one or more, round-robin) export DestHost=tg-c001.sdsc.teragrid.org, tg-c002.sdsc.teragrid.org,tg-c003.sdsc.teragrid.org, tg-c004.sdsc.teragrid.org # root of the file(s)/directory(s) at the other side (dest path) export DestDirRoot=/gpfs/ux123456/mydata/ # run the process manager /scratcha1/dmover/dmover_process_manager.pl "$SrcDirRoot" "$SrcRelPath" "$DestHost" "$DestDirRoot" "$RMS_NODES"

33 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC33 DMOVER Process Perl Script for ($i=0; $i<=$#file; $i++){ # pick host IDs, unless we just got them from wait() if ($i<$nStreams){ $shostID = $i % $ENV{'RMS_NODES'}; $dhostID = $i % ($#host+1); $dest=$host[$dhostID]; } # command to launch the transfer agent $cmd = "prun -N 1 -n 1 -B `offset2base $shostID` $DMOVERHOME/dmover_transfer.sh $SrcDirRoot $file[$i] $dest $DestDirRoot $shostID" $child = fork(); if ($child){ $cid{$child}[0] = $shostID; $cid{$child}[1] = $dhostID; } if (!$child){ $ret = system($cmd); } # keep the number of streams constant if ($nStreams<=$i+1){ $pid = wait; # re-use whichever source host just finished... $shostID = $cid{$pid}[0]; # re-use whichever remote host just finished... $dhostID = $cid{$pid}[1]; delete($cid{$pid}); } while (-1 != wait){ sleep(1); }

34 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC34 DMOVER Transfer Agent export X509_USER_PROXY=$HOME/.proxy export GLOBUS_LOCATION=/usr/local/globus/globus-2.4.3 export GLOBUS_HOSTNAME=`/bin/hostname -s`.psc.edu. $GLOBUS_LOCATION/etc/globus-user-env.sh # set up Qsockets. $DMOVERHOME/agw_setup.sh $5 SrcDirRoot=$1 SrcRelPath=$2 DestHost=$3 DestDirRoot=$4 args="-tcp-bs 8388608" cmd="$GLOBUS_LOCATION/bin/globus-url-copy $args file://$SrcDirRoot/$SrcRelPath gsiftp://$DestHost/$DestDirRoot/$SrcRelPath" echo `/bin/hostname -s` : $cmd time agw_run $cmd

35 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC35 What about MPPs?

36 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC36 Getting Data to/from MPPs Massively-Parallel Processing (MPP) systems (e.g., bigben.psc.teragrid.org - Cray XT3) do not have per- node connectivity to WANs Nodes only run a microkernel (e.g. Cray catamount) Users need a way to: –Stream data into and out from a running application –Steer their running application Approach: –Dedicate nodes in a running job to data I/O –Relay data in/out of the system via a proxy service on a dedicated WAN I/O node

37 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC37 Portals Direct I/O (PDIO) Remote Virtual File System middleware –User calls pdio_write() on compute node* –Data is routed to external network and written on any remote host on the WAN in real-time (while your simulation is running) –Useful for live demos, interactive steering, remote post-processing, checkpointing, etc. –New development beta simply hooks standard POSIX file I/O No need for users to customize source code - just relink with PDIO library

38 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC38 WAN ETF Net Compute Nodes I/O Nodes PPM Cray XT3: “BigBen” Computation, Disconnected Vis cluster No Steering render Steering Viz Server Remote Site input Before PDIO

39 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC39 pdiod Compute Nodes I/O Nodes input PPM Cray XT3: “BigBen” recv render Steering Viz Server Remote Site WAN ETF Net Compute Node I/O, Portals to TCP routing, WAN FS Virtualization Real-time, remote control & I/O After PDIO

40 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC40 pdio_write() Performance (6% less than local) Pittsburgh, PA to Tampa, FL

41 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC41 Scientific Applications & User Communities using PDIO Hercules: Earthquake Modeling J. Bielak, et al. PPM: Solar Turbulence P. Woodward, et al. Nektar: Arterial Blood Flow G. Karniadakis, et al.

42 June 25, 2007TeraGrid Data Transfer - dsimmel@psc.edu - ©2007 PSC42 Acknowledgements Chris Rapier, PSC –HPN-SSH research, development and presentation materials Kathy Benninger, PSC –Network analysis of TeraGrid GridFTP servers behavior, transfer performance and TCP tuning recommendations Doug Balog, PSC; Steve Simms, Indiana U. –Lustre-WAN Nathan Stone, PSC –DMOVER, Parallel Direct I/O


Download ppt "TeraGrid Data Transfer Joint EGEE and OSG Workshop on Data Handling in Production Grids June 25, 2007 - Monterey, CA Derek Simmel"

Similar presentations


Ads by Google