Download presentation
Presentation is loading. Please wait.
Published byCecil Hall Modified over 9 years ago
1
Slide 1 9/29/15 End-to-End Performance Tuning and Best Practices Moderator: Charlie McMahon, Tulane University Jan Cheetham, University of Wisconsin-Madison Chris Rapier, Pittsburgh Supercomputing Center Paul Gessler, University of Idaho Maureen Dougherty, USC Wednesday, September 29, 2015
2
Slide 2 9/29/15 Slide 2 Professor & Director, Northwest Knowledge Network University of Idaho Paul Gessler
3
Slide 3 9/29/15 Slide 3 Enabling 10 Gbps connections to the Idaho Regional Optical Network UI Moscow campus network core Northwest Knowledge Network and DMZ DOE’s Idaho National Lab Implemented perfSONAR monitoring over Idaho Institute for Biological and Evolutionary Studies
4
Slide 4 9/29/15
5
Slide 5 9/29/15
6
Slide 6 9/29/15 Slide 6 Research and Instructional Technologies Consultant University of Wisconsin-Madison Jan Cheetham
7
Slide 7 9/29/15 Slide 7 University of Wisconsin Campus Network HEP Biotech IceCUBE SSEC Engineeri ng LOCI WID WEI CHTC Campus Network Distribution Science DMZ Internet2 Innovation Network 100G perfSONAR
8
Slide 8 9/29/15 Slide 8 Diagnosing Network Issues PerfSONAR helps uncover problems with: TCP window size issues to San Diego Optical fiber cut affecting latency-sensitive link between SSEC and NOAA Line card failure resulting in dropped packets on research partner’s (WID) LAN Transfers from internal data stores to distributed computer resources (HTCondor pools)
9
Slide 9 9/29/15 Slide 9 Dealing with Firewalls Can’t use firewall Security baseline for research computing Must be behind a firewall Upgrade firewall to high speed backplane to allow 10G throughput to campus in preparation for campus network upgrade Plan to use SDN to shunt some traffic (identified uses within our security policy)
10
Slide 10 9/29/15 Slide 10 Challenges 100 GE line card failure (pursuing buffer overflow) Separating spiky research traffic from the rest of campus network traffic Distributed campus—getting the word out to enable everyone to take advantage Internal network environments limitations for researchers Storage bottleneck
11
Slide 11 9/29/15 Slide 11 Senior Research Programmer Pittsburgh Supercomputing Center Chris Rapier
12
Slide 12 9/29/15 Slide 12 XSight & Web10G Goal: Use the metrics provided by Web10G to enhance workflow by early identification of pathological flows. A distributed set of Web10G enabled listeners on Data Transfer Nodes across multiple domains. Gather data on all flows of interest and collate at centralized DB. Analyze data to find marginal and failing flows Provide NOC with actionable data in near real time
13
Slide 13 9/29/15 Slide 13 Implementation Listener: C application periodically polls all TCP flows. Applies rule set to Database: InfluxDB. Time series DB. Analysis engine: Currently applies heuristic approach. Development of models in progress. UI: Web based logical map. Allows engineers to drill down to failing flows and display collected metrics.
14
Slide 14 9/29/15 Slide 14 Results Analysis engine and UI still in development Looking for partners for listener deployment (includes NOCs) 6 months left under EAGER grant. Will be seeking to renew grant.
15
Slide 15 9/29/15 Slide 15 Director, Center for High-Performance Computing USC Maureen Dougherty
16
Trojan Express Network II Goal: Develop Next Generation research network in parallel to production network to address increasing research data transfer demands Leverage existing 100G Science DMZ Instead of expensive routers, use cheaper high-end network switches Use OpenFlow running on a server to control the switch PerfSONSAR systems for metrics and monitoring
17
Trojan Express Network Buildout
18
Collaborative Bandwidth Tests 72.5ms round trip between USC and Clemson 100Gbps Shared Link 12 machine OrangeFS cluster at USC –Directly connected to Brocade Switch at 10Gbps Each 12 clients at Clemson USC ran nuttcp sessions between pairs of USC and Clemson hosts Clemson ran file copies to the USC OrangeFS cluster
19
Linux Network Configuration Bandwidth Delay Product 72.5ms x 10Gbits/second = 90625000 bytes (90Mbytes) net.core.rmem_max = 96468992 net.core.wmem_max = 96468992 net.ipv4.tcp_rmem = 4096 87380 96468992 net.ipv4.tcp_wmem = 4096 65536 96468992 net.ipv4.tcp_congestion_control = yeah jumbo frames enabled (mtu 9000)
20
Nuttcp Bandwidth Test Peak Transfer of 72Gb/s with 9 nodes
21
Slide 21 9/29/15 Slide 21 Contact Information Charlie McMahon, Tulane University cpm@tulane.edu Jan Cheetham University of Wisconsin-Madison jan.cheetham@wisc.edu Chris Rapier, Pittsburgh Supercomputing Center rapier@psc.edu Paul Gessler, University of Idaho paulg@uidaho.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.