Presentation is loading. Please wait.

Presentation is loading. Please wait.

Long Distance experiments of Data Reservoir system

Similar presentations


Presentation on theme: "Long Distance experiments of Data Reservoir system"— Presentation transcript:

1 Long Distance experiments of Data Reservoir system
Data Reservoir / GRAPE-DR project

2 List of experiment members
The University of Tokyo Fujitsu Computer Technologies WIDE project Chelsio Communications Cooperating networks (West to East) WIDE APAN / JGN2 IEEAF / Tyco Telecommunications Northwest Giga Pop CANARIE (CA*net 4) StarLight Abilene SURFnet SCinet The Universiteit van Amsterdam CERN

3 Computing System for real Scientists
Fast CPU, huge memory and disks, good graphics Cluster technology, DSM technology, Graphics processors Grid technology Very fast remote file accesses Global file system, data parallel file systems, Replication facilities Transparency to local computation No complex middleware, or no small modification to existing software Real Scientists are not computer scientists Computer scientists are not work forces for real scientists

4 Objectives of Data Reservoir / GRAPE-DR(1)
Sharing Scientific Data between distant research institutes Physics, astronomy, earth science, simulation data High utilization of available bandwidth Transferred file data rate > 90% of available bandwidth Including header overheads, initial negotiation overheads OS and File system transparency Storage level data sharing (high speed iSCSI protocol on stock TCP) Fast single file transfer

5 Objectives of Data Reservoir / GRAPE-DR(2)
GRAPE-DR:Very high-speed processor 2004 – 2008 Successor of Grape-6 astronomical simulator 2PFLOPS on 128 node cluster system 1G FLOPS / processor 1024 processor / chip Semi-general-purpose processing N-body simulation, gridless fluid dyinamics Linear solver, molecular dynamics High-order database searching (genome, protein data etc.)

6 Data analysis at University of Tokyo
Data intensive scientific computation through global networks X-ray astronomy Satellite ASUKA Nuclear experiments Nobeyama Radio Observatory (VLBI) Belle Experiments Data Reservoir Very High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses Grape6 Data analysis at University of Tokyo

7 Basic Architecture Disk-block level Parallel and Multi-stream transfer
High latency Very high bandwidth Network Data Reservoir Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute Shared Data (DSM like architecture) Local file accesses Cache Disks

8 File accesses on Data Reservoir
Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server IBM x345 (2.6GHz x 2)

9 Global Data Transfer IP Switch User Programs Scientific Detectors
File Server IP Switch Disk Server iSCSI Bulk Transfer Global Network

10 1st Generation Data Reservoir
Efficient use of network bandwidth (SC2002 technical paper) Single filesystem image (by striping of iSCSI disks) Separation of local file accesses by users to remote disk to disk replication Low level data-sharing (disk block level, iSCSI protocol) 26 servers for 570 Mbps data transfer High sustained efficiency (≒93%) Low TCP performance Unstable TCP performance

11 Problems of BWC2002 experiments
Low TCP bandwidth due to packet losses TCP congestion window size control Very slow recovery from fast recovery phase (>20min) Unbalance among parallel iSCSI streams Packet scheduling by switches and routers User and other network users have interests only to total behavior of parallel TCP streams

12 2nd Generation Data Reservoir
Inter-layer coordination for parallel TCP streams (SC2004 technical paper) Transmission Rate Controlled TCP for Long Fat pie Netowks “Dulling Edges of Cooperative Parallel streams (DECP (load balancing among parallel TCP streams) LFN tunneling NIC (Comet TCP) 10 times performance increase from 2002 7.01 Gbps (disk to disk, TRC-TCP), 7.53 Gbps (CometTCP)

13 24,000km(15,000miles) OC-48 x 3 GbE x 4 OC-192 15,680km (9,800miles)
Juniper T320

14 Results BWC2002 Tokyo → Baltimore   10,800km (6,700miles) Peak bandwidth (on network) Mbps Average file transfer bandwidth Mbps Bandwidth-distance products ,048 terabit-kilometers/second BWC 2003 Phoenix → Tokyo → Portland → Tokyo 24,000 km (15,000 miles) Average file transfer bandwidth   7.5 Gbps Bandwidth-distance products ≒ petabit-kilometers/second More than 25 times improvement from BWC2002 performance (bandwidth-distance products) Still low performance for each TCP stream (460 Mbps)

15 Problems of BWC2003 experiments
Still Low TCP single stream performance NIC is GbE Software overheads Large number of server and storage boxed for 10Gbps iSCSI disk service Disk density TCP single stream performance CPU usage for managing iSCSI and TCP

16 3rd Generation Data Reservoir
Hardware and software basis for 100Gbps Distributed Data-sharing systems 10Gbps disk data transfer by a single Data Reservoir server Transparent support for multiple filesystems (detection of modified disk blocks) Hardware(FPGA) implementation of Inter-layer coordination mechanisms 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger

17 Features of a single server Data Reservoir
Low cost and high-bandwidth disk service A server + two RAID card + 50 disks Small size, scalable to 100Gbps Data Reservoir (10 servers racks) Mother system for GRAPE-DR simulation engine Compatible to existing Data Reservoir system Single stream TCP performance (must be more than 5 Gbps) Automatic adaptation for network condition Maximize multi-stream TCP performance Load balancing among parallel TCP streams

18 Utilization of 10Gbps network
A single box 10 Gbps Data Reservoir server Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) Two Chelsio T110 TCP off-loading NIC Disk arrays for necessary disk bandwidth Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) PCI-X bus Quad Opteron Server (SUN V40z) Linux 2.6.6 Chelsio T110 TCP NIC 10GBASE-SR 10G Ethernet Switch PCI-X bus Chelsio T110 TCP NIC PCI-X bus Ultra320SCSI SCSI adaptor PCI-X bus SCSI adaptor Data Reservoir Software

19 Single stream TCP performance
Importance of hardware-supported “Inter-layer optimization” mechanism CometTCP hardware (TCP tunnel by non-TCP protocol) Chelsio T110 NIC (pacing by network processor) Dual ported FPGA based NIC (under development Standard TCP, standard Ethernet frame size Major problem Conflict between long RTT and short RTT Detection of optimal parameters Chelsio T110 Dual ported NIC by U. of Tokyo

20 Tokyo-CERN experiment (Oct.2004)
CERN-Amsterdam-Chicago-Seattle-Tokyo SURFnet – CA*net 4 – IEEAF/Tyco – WIDE 18,500 km WAN PHY connection Performance result 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf 8.8 Gbps disk to disk performance 9 servers, 36 disks 36 parallel TCP streams

21 Network topology of CERN-Tokyo experiment
T-LEX StarLight Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Chelsio T110 NIC Fujitsu XG800 12 port switch Chelsio T110 NIC Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Tokyo Minneapolis Chicago Fujitsu XG800 Foundry NetIron40G IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) 10GBASE-LW Extreme Summit 400 IBM x345 IBM x345 Seattle Vancouver Amsterdam CERN (Geneva) IBM x345 IBM x345 Foundry BI MG8 Foundry FEXx448 GbE GbE Pacific Northwest Gigapop NetherLight Data Reservoir at Univ. of Tokyo Data Reservoir at CERN(Geneva) WIDE / IEEAF CA*net 4 SURFnet

22 Difficulty on CERN-Tokyo Experiment
Detecting SONET lever errors Synchronization error Low rate packet losses Lack of tools for debugging network Loopback at L1 is the only useful method No tools for performance bugs Very long latency

23 SC2004 bandwidth challenge (Nov.2004)
CERN-Amsterdam-Chicago-Seattle-Tokyo – Chicago - Pittsburgh SURFnet – CA*net 4 – IEEAF/Tyco – WIDE – JGN2 – Abilene 31,248 km (RTT 433 ms) (but shorter distance by LSR rules, 20,675 km) Performance result 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf 1.6 Gbps disk to disk performance 1 servers, 8 disks 8 parallel TCP streams

24 Pittsburgh-Tokyo-CERN Single stream TCP
Bandwidth Challenge: Pittsburgh-Tokyo-CERN Single stream TCP University of Amsterdam 31,248 km connection between SC2004 booth and CERN through Tokyo Latency 433 ms RTT Force10 Switch Chicago StarLight SCinet backborn HDXc Juniper T320 Router Juniper T640 Router ONS 15454 Amsterdam NetherLight Atlantic Ocean Force10 Switch Force10 Switch ONS 15454 SURFnet (at CERN) Procket Router Foundry NetIron 40G Foundry BI MG8 CERN ONS 15454 Minneapolis Chelsio T110 NIC ONS 15454 Fujitsu XG800 Switch Calgary Opteron server Dual Opteron248, 2.2GHz 1GB memory Linux (No.2-6) Pacific Ocean Opteron4 Chelsio T110 NIC Tokyo APAN Procket Router Opteron server Dual Opteron248, 2.2GHz 1GB memory Linux (No.2-6) ONS 15454 Vancouver Data Reservoir booth at SC2004 Opteron2 Foundry NetIron 40G Seattle Pacific Northwest Gigapop Tokyo T-LEX ONS 15454 Data Reservoir project at CERN(Geneva) WIDE APAN/JGN II IEEAF/Tyco/WIDE CANARIE Abilene SCinet SURFnet Router or L3 switch L3 switch used as L2 L1 or L2 switch

25 Pittsburgh-Tokyo-CERN Single stream TCP speed measurement
System Server configuration CPU: Dual AMD Opteron 248, 2.2GHz Mother Board: RioWorks HDAMA rev. E Memory: 1G bytes, Corsair Twinx CMX512RE-3200LLPT x 2, PC3200, CL=2 OS: Linux kernel 2.6.6 Disk Seagate IDE 80GB, 7200r.p.m. (disk speed not essential for performance) Network interface card Chelsio T110 (10GBASE-SR), TCP offload engine (TOE), PCI-X/133 MHz I/O bus connection TOE function ON Technical points Traditional tuning and optimization methods congestion window size, buffer size, size of queue etc. Ethernet frame size (standard frame, jumbo frame) Tuning and optimization by Inter-layer coordination(Univ. of Tokyo) Fine-grained pacing by offload engine optimization at slow start phase Utilization of flow control Performance of single stream TCP on 31,248 km circuit 7.21 Gbps (TCP payload), standard Ethernet frame ⇒ 224,985 terabit meter / second 80% more than previous Land Speed Record Observed bandwidth on the network (at Tokyo T-LEX NI40G CANARIE Calgary Amsterdam Vancouver Minneapolis IEEAF/Tyco/WIDE Chicago Geneva Seattle SURFnet WIDE Abilene CERN APAN/JGN II Pittsburgh Tokyo Network used in the experiment

26 Difficulty on SC2004 LSR challenge
Method for performance debugging Dividing the path (CERN to Tokyo, Tokyo to Pittsburgh) Jumbo frame performance (unknown reason) Stability in network QoS problem in the Procket 8800 routers

27 Year end experiments Target System configuration
> 30,000 km LSR distance L3 switching at Chicago and Amsterdam Period of the experiment 12/20 – 1/3 Holiday season for vacant public research networks System configuration A pair of opteron servers with Chelsio T110 (at N-otemachi) Another pair of opteron servers with Chelsion T110 for competing traffinc generation ClearSight 10Gbps packet analyzer for packet capturing

28 Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo
OME 6550 ONS 15454 Vancouver Calgary Router or L3 switch CANARIE ONS 15454 ONS 15454 L1 or L2 switch Minneapolis SURFnet Tokyo T-LEX Opteron1 IEEAF/Tyco ONS 15454 ONS 15454 OME 6550 Chelsio T110 NIC Chicago WAN PHY Opteron server University of Amsterdam WIDE Seattle Pacific Northwest Gigapop HDXc SURFnet Foundry NetIron 40G Force10 E1200 ONS 15454 Force10 E600 Opteron3 WAN PHY Pacific Ocean Chelsio T110 NIC Opteron server WIDE CISCO 6509 TransPAC Atlantic Ocean SURFnet Procket 8812 APAN/JGN Procket 8801 T640 OC-192 CISCO 12416 Chicago StarLight Fujitsu XG800 SURFnet Abilene OC-192 T640 HDXc SURFnet CISCO 12416 ClearSight 10Gbps capture SURFnet OC-192 Amsterdam NetherLight New York MANLAN Univ of Tokyo WIDE IEEAF/Tyco/WIDE CANARIE SURFnet Abilene APAN/JGN2

29 v0.06, Dec 21, 2004, data-reservoir@adm.s.u-tokyo.ac.jp
T-LEX UW CA*net 4 StarLight SURFnet UvA VLAN A (911) VLAN B (10) NI40G E1200 HDXc 10GE-SR 10GE-LW OC-192 OC-192 OC-192 OC-192 10GE-LW 10GE-LW OC-192 10GE-LW 10GE-LR opteron1 T110 NIC 15454 15454 15454 OME6500 OME6500 15454 E600 6509 IEEAF 10GE-LR 10GE-SR 10GE-LR OC-192 10GE-LR 10GE-LR OC-192 OC-192 OC-192 OC-192 OC-192 opteron3 T110 NIC PRO8812 PRO8801 T640 T640 HDXc 12416 CR1 12416 AR5 ClearSight 10GE analyzer XG1200 tap GX JGN II VLAN D (914) "operational and production-oriented high-performance research and educations networks" APAN TransPAC Abilene MANLAN SURFnet Tokyo/notemachi Seattle Vancouver Chicago NYC A’dam v0.06, Dec 21, 2004, Figure 1a. Network topology

30 Figure 2. Network connection
CANARIE Calgary Vancouver Minneapolis Amsterdam CA*net 4 IEEAF/Tyco/WIDE Seattle Chicago SURFnet APAN/JGN2 Abilene NYC Tokyo Network used in the experiment A router or an L3 switch A L1 or L2 switch Figure 2. Network connection

31 Difficulty on Tokyo-AMS-Tokyo experiment
Packet loss and Flow control About 1 loss/several minutes when flow control off About 100 losses/several minutes when flow control on Flow control at edge switch does not work properly Difference of network behavior by the direction Artificial bursty behavior caused by switches and routers Outgoing packets from NIC are paced properly. Jumbo frame packet losses Packet losses begin from MTU 4500 NI40G and Force10 E1200, E600 are on the path. Lack of peak network flow information 5 min average BW is useless for TCP traffic

32 Network Traffic on routers and switches
StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run

33 History of single-stream IPv4 Internet Land Speed Record
Distance bandwidth product Tbit m / s 2004/12/25 Data Reservoir project WIDE project SURFnet CANARIE Tbit m / s Experimental result 1,000,000 100,000 2004/11/9 Data Reservoir project WIDE project Tbit m / s Internet Land Speed Record 10,000 1,000 2000 2001 2002 2003 2004 2005 2006 2007 Year

34 Setting for instrumentation
Capture packet at both source-end and destination-end Packet filtering by src-dst IP address Capture up to 512 MB Precise packet counting was very useful Tap ClearSight Network To be tested Source Server XG800 Destination Server Source Server Fujitsu XG800 10G L2 switch Destination Server ClearSight 10Gbps Capture system

35 Packet count example

36 Traffic monitor example

37 Possible arrangement of devices
1U loss-less packet capturing box developed at U. of Tokyo 2 10GBASE-LR interface Lossless capture up to 1GB Precise time stamping Packet generating server 3U opteron server + 10Gbps Ethernet adaptor Pushing network by more than 5 Gbps TCP/UDP Taps and tap selecting switch Currently, port mirroring is not good for instrumentation Capturing backborn traffic and server in/out

38 2 port packet capturing box
Originally designed for emulating 10Gbps network 2 10GBASE-LR interface 0 to 1000 ms artificial latency (very long FIFO queue) Error rate 0 to 0.1 % ( 0.001% step) Lossless packet capturing Utilize FIFO queue capability Data uploadable through USB Cheaper than ClearSight or Agilent Technology

39 Setting for instrumentation
Capture packet at both source-end and destination-end Packet filtering by src-dst IP address Capture packet at both source-end and destination-end Packet filtering by src-dst IP address Backborn Router Or Switch Network To be tested Network To be tested TAP Server + 10G Ether 2 port Packet capture Box

40 More expensive plan Capture packet at both source-end and destination-end Packet filtering by src-dst IP address Capture packet at both source-end and destination-end Packet filtering by src-dst IP address Backborn Router Or Switch TAP Network To be tested Network To be tested TAP Server + 10G Ether 2 port Packet capture Box 2 port Packet capture Box

41 Summary Single Stream TCP We removed TCP related difficulties
Now I/O bus bandwidth is the bottleneck Cheap and simple servers can enjoy 10Gbps network Lack of methodology in high-performance network debugging 3 day debugging (overnight working) 1 day stable period (usable for measurements) Network may feel fatigue, some trouble must happen We need something effective. Detailed issues Flow control (and QoS) Buffer size and policy Optical level setting

42 Thanks


Download ppt "Long Distance experiments of Data Reservoir system"

Similar presentations


Ads by Google