Download presentation
Presentation is loading. Please wait.
Published byFerdinand Day Modified over 9 years ago
1
USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab
2
2 Outline USATLAS Network/Storage Infrastructures: Platform of Performing Load Test. Load Test Motivation and Goals. Load Test Status Overview. Critical Components in Load Test: Control and Monitoring and Network Monitoring and Weather Maps. Detailed Plots for Single v.s. Multiple host Load Tests. Problems. Proposed Solutions: Network Research and Its Role in Dynamic Layer 2 Circuits between BNL and US ATLAS Tier 2 sites.
3
3 BNL 20 Gig-E Architecture Based on CISCO6513 20 GBps LAN for LHCOPN 20GBps for Production IP Full Redundancy: can survive the failure of any network switch. No Firewall for LHCOPN, as shown in the green lines. Two Firewalls for all other IP networks. Cisco Firewall Services Module (FWSM), a line card plugged into CISCO chassis with 5*1Gbps capacity, allows outgoing connection.
4
4 20 Gb/s HPSS Mass Storage System dCache SRM and Core Servers Gridftp door (8 nodes) 2x10 Gb/s WAN BNL LHC OPN VLAN Write Pool Farm Pool (434 nodes / 360 TB) 8 x 1 Gb/s Tier 1 VLANS 2x10 Gb/s 8 x 1 Gb/s dCache.... N x 1 Gb/s.... 20 Gb/s Logical Connections FTS controlled Srmcp path T0 Export Pool (>=30 nodes) New Farm Pool (80 nodes, 360TB Raw ) Thumpers (30 nodes, 720TB Raw ) dCache and Network Integration ESnet Load Testing Hosts New Panda and Panda DB.
5
5 Tier 2 Network Example: ATLAS Great Lakes Tier 2
6
6 Need More Details of Tier 2 Network/Storage Infrastructures Hope to see architectural maps from each Tier2 to describe the integration of Tier 2 network and production and testing storage systems in the site reports.
7
7 Goal Develop a toolkit for testing and viewing I/O performance at various middleware layers (network, grid-ftp, FTS) in order to isolate problems. Single-host transfer optimization at each layer. 120 MB/s is ideal for memory to memory transfer and high performance storage. 40 MB/s is ideal for disk transfer to a regular worker node. Multi-host transfer optimization for site with 10Gbps connectivity. Starting point: Sustained 200 MB/s disk-to-disk transfer for 10 minutes between Tier1 and each Tier2 is goal (Rob Gardner). Then increase disk-to-disk transfer to 400MBytes/second. For site with 1Gbps bottleneck, we should max out the network capacity.
8
Status Overview MonALISA control application has been developed for specifying single-host transfer, protocol, duration, size, stream range, tcp buffer range, etc. Currently only run by Jay Packard at BNL, but may eventually be run by multiple administrators at other sites within MonALISA framework. MonALISA monitoring plugin has been developed to display current results in graphs. They are available in monALISA client ( http://monalisa.cacr.caltech.edu/ml_client/MonaLisa.jnlp) and will soon be available on a web page. http://monalisa.cacr.caltech.edu/ml_client/MonaLisa.jnlp
9
Status Overview... Have been performing single-host tests for past few months. Types Network memory to memory (using iperf) Grid-ftp memory to memory (using globus-url-copy) Grid-ftp memory to disk (using globus-url-copy) Grid-ftp disk to disk (using globus-url-copy) At least one host at each site has been TCP tuned, which has shown dramatic improvements at some sites in the graphs (e.g. 5 MB/s to 100 MB/s for iperf tests) If Tier 2 has 10Gbps, there is significant improvement for single TCP stream, from 50mbps to close to 1Gbps. (IU, UC, BU, UMich). If Tier 2 has 1Gbps bottleneck, network performance can be improved with multiple TCP streams. Simple TCP buffer size tune can not improve single TCP stream performance due to bandwidth competition. Discovered problems: dirty fiber, CRC error on network interface, and moderate TCP buffer size, details can be found in Shawn’s talk. Coordinating with Michigan and BNL (Hiro Ito, Shawn McKee, Robert Gardner, Jay Packard) to measure and optimize total throughput using FTS disk-to-disk. We are trying to leverage high performance storage (Thumper at BNL and Dell NAS at Michigan) to achieve our goal.
10
MonALISA Control Application Our Java class implements MonALISA's AppInt interface as a plug-in. 900 lines of code currently. Does the following: Generates and prepares source files for disk to disk transfer Starts up remote iperf server and local iperf client using globus-job-run remotely and ssh locally Runs iperf or grid-ftp for a period of time and collects output Parses output for average and maximum throughput Generates output understood by monitoring plugin Cleans up destination files Stops iperf servers Flexible to account for heterogeneous sites (e.g., killing iperf is done differently on a managed fork gatekeeper; one site runs BWCTL instead of iperf). This flexibility in the code requires frequently watching the output of the application and augmenting the code to handle many circumstances.
11
MonALISA Control Application... Generates the average and maximum throughput during a 2 minute interval, which interval is required for the throughput to “ramp up”. Sample configuration for grid-ftp memory-to-disk: command=gridftp_m2d startHours=4,16 envScript=/opt/OSG_060/setup.sh fileSizeKB=5000000 streams=1, 2, 4, 8, 12 repetitions=1 repetitionDelaySec=1 numSrcHosts=1 timeOutSec=120 tcpBufferBytes=4000000, 8000000 hosts=dct00.usatlas.bnl.gov, atlas-g01.bu.edu/data5/dq2-cache/test/, atlas.bu.edu/data5/dq2-cache/test/, umfs02.grid.umich.edu/atlas/data08/dq2/test/, umfs05.aglt2.org/atlas/data16/dq2/test/, dq2.aglt2.org/atlas/data15/mucal/test/, iut2- dc1.iu.edu/pnfs/iu.edu/data/test/, uct2-dc1.uchicago.edu/pnfs/uchicago.edu/data/ddm1/test/, gk01.swt2.uta.edu/ifs1/dq2_test/storageA/test/, tier2-02.ochep.ou.edu/ibrix/data/dq2-cache/test/, ouhep00.nhn.ou.edu/raid2/dq2- cache/test/, osgserv04.slac.stanford.edu/xrootd/atlas/dq2/tmp/
12
MonALISA Monitoring Application Java class that implements MonALISA's MonitoringModule interface. Much simpler than controlling application (only 180 lines of code). Parses log file produced by controlling application in the format (time, site name, module, host, statistic, value: 1195623346000, BNL_ITB_Test1, Loadtest, bnl->uta(dct00->ndt), network_m2m_avg_01s_08m, 6.42 (01s = 1 stream, 08m = TCP buffer size of 8 MB) Data pulled by MonALISA server, which displays graph upon demand.
13
Single-host Tests Too many graphs to show all, but two key graphs will be shown. For one stream:
14
Single-host Tests... For 12 streams (notice disk-to-disk improvement):
15
Multi-host tests Using FTS to perform tests from BNL to Michigan initially and then to other Tier 2 sites. Goal is sustained 200 MB/s disk-to-disk transfer for 10 minutes from Tier 1 to each Tier 2. Can be in addition to existing traffic. Trying to find optimum number of streams and TCP buffer size to use by finding optimum for single-host transfer between two high performance machines. Low disk-to-disk, one-stream performance from BNL's thumper to Michigan's Dell NAS of 2 MB/s whereas iperf mem-to-mem, one-stream gives 66 MB/s between same hosts (Nov 21, 07). Should this be higher for one stream? Found that the more streams the higher the throughput, but cannot use too many especially with a high TCP buffer size or applications will crash. Disk-to-disk throughput currently so low that a larger TCP buffer doesn't matter.
16
Multi-host Tests and Monitoring Monitoring using netflow graphs rather than Monalisa available at http://netmon.usatlas.bnl.gov/netflow/tier2.html. http://netmon.usatlas.bnl.gov/netflow/tier2.html Some sites will likely require the addition of more storage pools and doors that are each TCP tuned to achieve the goal.
17
Problems Getting reliable testing results amidst existing traffic Each test runs for a couple minutes and produces several samples, so hopefully a window exists when the traffic is low during which the maximum is attained. The applications could be changed to output the maximum of the last few tests (tricky to implement). Use dedicated Network Circuits: TeraPaths Disk-to-disk bottleneck Not sure if problem is the hardware or the storage software (e.g. dCache, Xrootd). FUSE (Filesystem in Userspace), or filesystem in memory, which provides could help isolate storage software degradation. Bonnie could help isolate hardware degradation. Is there anyone that could offer disk performance expertise? Discussed in Shawn McKee's presentation, 'Optimizing USATLAS Data Transfers.'". Progress is happening slowly due to a lack of in-depth coordination, scheduling difficulties, and a lack of manpower (Jay is using ~1/3 FTE). Too much on the agenda at the Computing Integration and Operations meeting to allow for in-depth coordination. Ideas for improvement
18
TeraPaths and Its Role in Improving Network Connectivities between BNL and US ATLAS Tier 2 sites. The problem: support efficient/reliable/predictable peta-scale data movement in modern high-speed networks Multiple data flows with varying priority Default “best effort” network behavior can cause performance and service disruption problems Solution: enhance network functionality with QoS features to allow prioritization and protection of data flows Treat network as a valuable resource Schedule network usage (how much bandwidth and when) Techniques: DiffServ (DSCP), PBR, MPLS tunnels, dynamic circuits (VLANs) Collaboration with ESnet (OSCARS) and Internet 2 (DRAGON) to dynamically create end-to-end paths, and dynamically forward traffic into these paths. Software is being deployed to US ATLAS Tier 2 sites. Option 1: Layer 3: MPLS tunnels (Umich and SLAC) Option 2: Layer 2: VLANs (BU, UMichi, demonstrated at SC’07)
19
Northeast Tier 2 Dynamic Network Links
20
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.