Download presentation
Presentation is loading. Please wait.
Published byChristine Fox Modified over 9 years ago
1
AWOCA2003 Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research Mary Inaba, Makoto Nakamura, Kei Hiraki University of Tokyo AWOCA 2003
2
Today’s Topic New infrastructure for data intensive scientific research Problems of using the Internet
3
AWOCA2003 One day, I was surprised One professor (Dept. of Astronomy) said Network is for E-mail and paper exchange. FEDEX is for REAL Data exchange. (They use DLT tapes, and airplanes)
4
AWOCA2003 Huge Data Producers AKEBONO Sattelite Radio Telescope in NOBEYAMA SUBARU telescope KAMIOKANDE (Novel Prize) High Energy Accelerator A lot of Data suggest a lot of scientific truth, by computation. Now, we can compute. Data Intensive Research
5
AWOCA2003 Huge Data Transfer (inquiry to Profs.) Current State Data Transfer by DLT, EVERY WEEK. Expected Data Size in a few years 10GB/day for Satellite Data 50GB/day High Energy Accelerator 50PB tape archive for Earth Simulation Observatories are shared by many researchers, hence, NEED to bring data to Lab., somehow. Does Network help?
6
AWOCA2003 Super-SINET backbone BNC Tohoku Univ KEK, Tsukuba Univ Univ. Tokyo, NAO, NII, Titech, Waseda ISAS Kyoto Univ, Doshisha Univ Nagoya Univ, Okazaki Labs Osaka Univ Optical Cross-connect Hokkaido Univ Kyushu Univ Start 2002 Jan Network for Universities and Institute Combination of 10Gbps ordinary Line several 1Gbps Project Lines (physics, genome, Grid, etc.)
7
AWOCA2003 Currently It is not so easy to transfer HUGE data by fully utilizing bandwidth for long distance, Because, TCP/IP is popularly used, for TCP/IP latency is the problem. Disk I/O speed (50MB/sec) …
8
AWOCA2003 Recall HISOTRY Infrastructure for Scientific Research Projects Utilization of computing systems at the time –From the birth of a electronic computer Numerical computation ⇒ Tables 、 Equations ① Supercomputing(vector) ⇒ Simulation ② ③ Servers ⇒ Database 、 Data-mining 、 Genome ④ Internet ⇒ Information Exchange 、 Documentation ⑤ Scientific researchers always utilize top-end systems ①②③④ ⑤ EDSACCDC-6600CRAY-1SUN Fire1500010G Switch
9
AWOCA2003 Frontier of Information Processing New transition period -- Balance of computing systems –Very high-speed network –Large scale disk storage New infrastructure for –Cluster computers Data Intensive Research CPU GFLOPS Memory GB Network Interface Gbps Remote Disks Local Disks
10
AWOCA2003 Research Projects with Data Reservoir
11
AWOCA2003 Data Reservoir Data Reservoir High latency Very high bandwidth Network Distribute Shared File (DSM like architecture) Cache Disks Local file accesses Physically addressed Parallel and Multi-stream transfer Local file accesses Basic Architecture
12
AWOCA2003 Very High-speed Network Data Reservoir Data analysis at University of Tokyo Belle Experiments CERN X-ray astronomy Satellite ASUKA SUBARU Telescope Nobeyama Radio Observatory ( VLBI) Nuclear experiments Data Reservoir Data Reservoir Local Accesses Distributed Shared files Data intensive scientific computation through SUPER-SINET Digital Sky Survey
13
AWOCA2003 Design Policy Modification of disk handler under VFS layer Direct access to raw device for efficient data transfer Multi-level striping for scalability Use of iSCSI protocol Local file accesses through LAN Global disk transfer through WAN Single file image File system transparency File System SCSI driver iSCSI driver iSCSI daemon SCSI driver(mid) SCSI Driver(low) sdsgst -sg- Application md (RAID) driver Data Server Disks
14
AWOCA2003 Disk Server Scientific Detectors User Programs IP Switch File Server Disk Server IP Switch File Server Disk Server 1 st level striping 2 nd level striping Disk access by iSCSI File accesses on Data Reservoir
15
AWOCA2003 Disk Server Scientific Detectors User Programs IP Switch File Server Disk Server IP Switch File Server Disk Server 1 st level striping 2 nd level striping Disk access by iSCSI File accesses on Data Reservoir User’s View
16
AWOCA2003 Scientific Detectors User Programs File Server IP Switch Disk Server iSCSI Bulk Transfer Global Network Global Data Transfer
17
AWOCA2003 IP TCP/UDP NFS System Call EXT2 Linux RAID iSCSI driver sd Driversg Driver Application Network Implementation(File Server)
18
AWOCA2003 IP TCP iSCSI daemon System Call iSCSI Driver sg Driver Application Layer dr Driver SCSI Driver Data Stripe Network Disk Implementation(Disk Server) Disk
19
AWOCA2003 Performance evaluation of Data Reservoir 1.Local experiment 1 Gbps model (basic performance) 2.40 km experiments 1 Gbps model 、 U. of ⇔ ISAS 3.1600 km experiments 1 Gbps model 26ms latency (Tokyo ⇔ Kyoto ⇔ Osaka ⇔ Sendai ⇔ Tokyo) High-quality network (SUPER-Sinet Grid project lines) 4.US-Japan experiments 1.1Gbps model 2.U. of Tokyo ⇔ Fujitsu Lab. America (Maryland, USA) 3.U. of Tokyo ⇔ Scinet (Maryland, USA) 5.10 Gbps experiments compare four different switch configuration 1.Extreme Summit 7i, Trunked 8 Gigabit Ethernets 2.RiverStone RS16000 Trunked 8 and 121000BASE-SX 3.Foundry BigIron 10GBASE-LR modules 4.Extreme BlackDiamond Trunked 8 1000BASE-SX 5.Foundry BigIron Trunked 2 10BASE-LR the bottleneck (8Gbps), Trunking 8 Gigabit Ethernets
20
AWOCA2003 Performance Comparison to ftp(40km) ftp ---- Optimal performance (minimum disk head movements) iSCSI – Queued operation iSCSI transfer is 55% faster than ftp on single TCP stream
21
AWOCA2003 1600 km experiment System 870 Mbps file transfer BW Univ. of Tokyo (CISCO 6509) ↓ 1G Ether (Super-SINET) Kyoto Univ (Extreme Black Diamond ) ↓ 1G Ether (Super-SINET) Osaka Univ. (CISCO 3508) ↓ 1G Ether (Super-SINET) Tohoku Univ. (Jumper fiber) ↓ 1G Ether (Super-SINET) Univ. of Tokyo (Extreme Summit 7i)
22
AWOCA2003 IBM IBM IBM I B M Univ. of Tokyo Tohoku Univ. (sendai) Kyoto Univ. Osaka Univ. 550mile 300mile 250mile IBM IBM 1000mile line GbE Network for 1600km experiments ・ Grid project networks of SUPER-Sinet ・ One-way latency 26ms
23
AWOCA2003 870 828 812 737 700707 499 478 493 0 100 200 300 400 500 600 700 800 900 1000 1*4*81*4*(2+2)1*4*41*2*81*2*(2+2)1*2*41*1*81*1*(2+2)1*1*4 Transfer Rate (Mbps) Transfer speed on 1600 km experiment Maximum bandwidth by SmartBits = 970 Mbps Overheads of headers ~ 5 % System configuration (file-servers * disk servers * disks/disk server)
24
AWOCA2003 10Gbps experiment Local connection of two 10Gbps models 10GBASE-LR or 8 to 12 1000BASE-SX 24 disk servers + 6 file servers –Dell 1650, 1.26GHz PentiumIII× 2 1GB memory 、 ServerSet III HE-SL –NetGear GE NIC –Extreme Summit 7i (Trunking) –Extreme Black Diamond 6808 –Foundry Big Iron (10GBASE-LR) –RiverStone RS-16000 11.7 Gbps transfer BW
25
AWOCA2003 Performance on10Gbps model 300GBytes file transfer (iSCSI streams) 5% header loss due to TCP/IP, iSCSI 7% performance loss due to trunking Uneven use of disk servers 100GB file transfer in 2 minutes
26
AWOCA2003 US-Japan Experiments at SC2002 Bandwidth Challenge 92% Usage of Bandwidth using TCP/IP
27
AWOCA2003 Brief Explanation of TCP/IP
28
AWOCA2003 User’s View TCP Internet abcde Byte stream abcde TCP is PIPE Output Same Data In the same order Input Data
29
AWOCA2003 TCP’s View TCP Internet abcde Byte stream abcde Check all data has come? Re-order when arrival order is wrong Ask “re-send” when data misses. Speed Control
30
AWOCA2003 TCP’s feature Keep data until “Acknowledgement” arrives. Speed Control (Congestion Control) without knowing the state of routers. Use Buffer (Window), and when get ACK from receiver new data is moved to buffer Make Buffer (Window) small, when congestion is guessed to be occurred.
31
AWOCA2003 Window Size and Throughput Roughly speaking RTT: Round Trip Time Hence, Longer RTT needs Larger Window Size for same throughput. Throughput = Window Size / RTT
32
AWOCA2003 Congestion Control AIMD Additive Increase Multiplicative Decrease AIMD phase Doubled for every ACK (start phase) time Window Size Gradually accelerate once after congestion occurs, Rapidly slow-down, when congestion is expected.
33
AWOCA2003 Another Problem Denote “network with long latency and wide bandwidth” as LFN(Long Fat Pipe Network) LFN needs large window size, But, since increment is triggered by ACK. speed of increment is also SLOW. (LFN suffers, AIMD)
34
AWOCA2003 Network Environment The Bottle Neck (about 600Mbps) Note that 600Mbps < 1Gbps
35
AWOCA2003 92% using TCP/IP is good, but, still we have a PROBLEM Several Streams work after other streams finish
36
AWOCA2003 Fastest and slowest stream in the worst case The slowest 3 times slower Than the fastest. Even other streams finish Throughput did not recover Sequence Number Time
37
AWOCA2003 Hand-made Tools DR Gigabit Network Analyzer –Need accurate Time Stamp with 100ns accuracy –Dump full packets Comet Delay and Drop Pseudo Long Fat Pipe Network(LFN) Gigabit Ether a packet is sent every 12 μsec
38
AWOCA2003 Programmable NIC(Network Interface Card)
39
AWOCA2003 DR Giga Analyzer
40
AWOCA2003 Comet Delay and Drop
41
AWOCA2003 Unstable Throughput We examined Long Distance Data Transfer, throughput is 8Mbps to 120Mbps. (When we use Gigabit Ethernet Interface)
42
AWOCA2003 Fast Ethernet is very stable
43
AWOCA2003 Analysis of single stream. Number of packets with 200msec RTT
44
AWOCA2003 Packet Distribution Number of Packets Per msec Time(sec)
45
AWOCA2003 Packet Distribution of Fast Ethernet Number of Packets Per msec Time(sec)
46
AWOCA2003 Gigabit Ethernet interface v.s. Fast Ethernet interface Even, same “20Mbps”, Behavior of 20Mbps of Gigabit Ethernet Interface and 20Mbps of Fast Ethernet Interface Is completely different. Gigabit Ethernet is very bursty. Router might not like this.
47
AWOCA2003 2 problems Once packets are sent burstly, router sometimes cannot bear. (Unlucky stream slow, lucky stream fast) Especially when bottleneck is under Gigabit. More than 80% of time, the sender do not send anything.
48
AWOCA2003 Problem of implementation 1Gbps speed, suppose ether packet 1500B, 1 packet should be sent every 12 μsec. On the other hand, UNIX Kernel Timer is 10msec.
49
AWOCA2003 IPG(Inter Packet GAP) Transmitter is always on, When no packet sent, idle state. Each Frame at least 12bytes IPG (IEEE 802.3) sender Tunable by e1000 driver, (8bytes – 1023 bytes)
50
AWOCA2003 IPG tuning for short distance IPG 8bytes IPG 1023 bytes Fast Ethernet 94.1Mbps 56.7Mbps Gigabit Ethernet 941Mbps 567Mbps Suppose Ether Frame is 1500bytes, 1508: 2523 is approximately 567: 94 1 These work theoretically. (Gigabit ether has been perfectly tuned already for short distance data transfer)
51
AWOCA2003 IPG tuning for Long Distance
52
AWOCA2003 MAX,MIN,Average, Standard Deviation of Throughput FastEther
53
AWOCA2003 Some patterns of throughput change
54
AWOCA2003 Detail (Slow Start Phase)
55
AWOCA2003 Packet Distribution
56
AWOCA2003 But They are like an ad-hoc patch. What is the essential Problem?
57
AWOCA2003 One big problem Good MODEL does not exist. Old type MODEL does not work well. such as queueing theory M/M/1 packt distribution Poisson Distribution Experiment says it is not good. Currently, simulation and using real network is the only way to check. (No Theoretical background)
58
AWOCA2003 What is the difference of telephone network? AUTONOMY
59
AWOCA2003 For Telephone network, Telephone Company knows, manages and controls whole network. End-node doesn’t have to do heavy job, such as congestion control.
60
AWOCA2003 Current Trend(?) Analyze NETWORK using Game Theory. Nash Equilibrium
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.