Kei Hiraki University of Tokyo Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo
Kei Hiraki University of Tokyo Computing System for real Scientists Fast CPU, huge memory and disks, good graphics –Cluster technology, DSM technology, Graphics processors –Grid technology Very fast remote file accesses –Global file system, data parallel file systems, Replication facilities Transparency to local computation –No complex middleware, or no small modification to existing software Real Scientists are not computer scientists Computer scientists are not work forces for real scientists
Kei Hiraki University of Tokyo Objectives of Data Reservoir / GRAPE-DR(1) Sharing Scientific Data between distant research institutes –Physics, astronomy, earth science, simulation data Very High-speed single file transfer on Long Fat pipe Network –> 10 Gbps, > 20,000 Km, > 400ms RTT High utilization of available bandwidth –Transferred file data rate > 90% of available bandwidth Including header overheads, initial negotiation overheads
Kei Hiraki University of Tokyo Objectives of Data Reservoir / GRAPE-DR(2) GRAPE-DR:Very high-speed attached processor to a server –2004 – 2008 –Successor of Grape-6 astronomical simulator 2PFLOPS on 128 node cluster system –1G FLOPS / processor –1024 processor / chip –8 chips / PCI card –2 PCI card / serer –2 M processor / system
Kei Hiraki University of Tokyo Grape6 Very High-speed Network Data Reservoir Data analysis at University of Tokyo Belle Experiments X-ray astronomy Satellite ASUKA SUBARU Telescope Nobeyama Radio Observatory ( VLBI) Nuclear experiments Data Reservoir Data Reservoir Local Accesses Distributed Shared files Data intensive scientific computation through global networks Digital Sky Survey
Kei Hiraki University of Tokyo Data Reservoir Data Reservoir High latency Very high bandwidth Network Distribute Shared Data (DSM like architecture) Cache Disks Local file accesses Disk-block level Parallel and Multi-stream transfer Local file accesses Basic Architecture
Kei Hiraki University of Tokyo Disk Server Scientific Detectors User Programs IP Switch File Server Disk Server IP Switch File Server Disk Server 1 st level striping 2 nd level striping Disk access by iSCSI File accesses on Data Reservoir IBM x345 (2.6GHz x 2)
Kei Hiraki University of Tokyo Scientific Detectors User Programs File Server IP Switch Disk Server iSCSI Bulk Transfer Global Network Global Data Transfer
Kei Hiraki University of Tokyo Problems found in 1 st generation Data Reservoir Low TCP bandwidth due to packet losses –TCP congestion window size control –Very slow recovery from fast recovery phase (>20min) Unbalance among parallel iSCSI streams –Packet scheduling by switches and routers –User and other network users have interests only to total behavior of parallel TCP streams
Kei Hiraki University of Tokyo Fast Ethernet vs. GbE Iperf in 30 seconds Min/Avg: Fast Ethernet > GbE FE GbE
Kei Hiraki University of Tokyo Packet Transmission Rate Bursty behavior –Transmission in 20ms against RTT 200ms –Idle in rest 180ms Packet loss occurred
Kei Hiraki University of Tokyo Packet Spacing Ideal Story –Transmitting packet every RTT/cwnd –24μs interval for 500Mbps (MTU 1500B) –High load for software only –Low overhead because of limited use at slow start phase RTT RTT/cwnd
Kei Hiraki University of Tokyo Example Case of 8 IPG Success on Fast Retransmit –Smooth Transition to Congestion Avoidance –CA takes 28 minutes to recover to 550Mbps
Kei Hiraki University of Tokyo Best Case of 1023B IPG Like Fast Ethernet case –Proper transmission rate Spurious Retransmit due to Reordering
Kei Hiraki University of Tokyo Unbalance within parallel TCP streams Unbalance among parallel iSCSI streams –Packet scheduling by switches and routers –Meaningless unfairness among parallel streams –User and other network users have interests only to total behavior of parallel TCP streams Our approach –Constant Σcwnd i for fair TCP network usage to other users –Balance each cwnd i communicating between parallel TCP streams time BW time BW
Kei Hiraki University of Tokyo 3 rd Generation Data Reservoir Hardware and software basis for 100Gbps Distributed Data- sharing systems 10Gbps disk data transfer by a single Data Reservoir server Transparent support for multiple filesystems (detection of modified disk blocks) Hardware(FPGA) implementation of Inter-layer coordination mechanisms 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger
Kei Hiraki University of Tokyo Utilization of 10Gbps network A single box 10 Gbps Data Reservoir server Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) Two Chelsio T110 TCP off-loading NIC Disk arrays for necessary disk bandwidth Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) Chelsio T110 TCP NIC Quad Opteron Server (SUN V40z) Linux PCI-X bus Chelsio T110 TCP NIC SCSI adaptor PCI-X bus SCSI adaptor 10G Ethernet Switch 10GBASE-SR Data Reservoir Software Ultra320SCSI
Kei Hiraki University of Tokyo Tokyo-CERN experiment (Oct.2004) CERN-Amsterdam-Chicago-Seattle-Tokyo –SURFnet – CA*net 4 – IEEAF/Tyco – WIDE –18,500 km WAN PHY connection Performance result –7.21 Gbps (TCP payload) standard Ethernet frame size, iperf –7.53 Gbps (TCP payload) 8K Jumbo frame, iperf –8.8 Gbps disk to disk performance 9 servers, 36 disks 36 parallel TCP streams
Kei Hiraki University of Tokyo Tokyo Chicago Amsterdam Seattle Vancouver Calgary Minneapolis IEEAF CANARIE SURFnet Network used in the experiment Tokyo-CERN Network connection CA*net 4 End Systems A L1 or L2 switch Geneva
Kei Hiraki University of Tokyo Network topology of CERN-Tokyo experiment Tokyo Seattle Vancouver MinneapolisChicago Amsterdam CERN (Geneva) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) IBM x345 Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Chelsio T110 NIC Fujitsu XG port switch Foundry BI MG8 Data Reservoir at Univ. of Tokyo GbE IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) IBM x345 Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Chelsio T110 NIC Data Reservoir at CERN(Geneva) GbE T-LEX Pacific Northwest Gigapop StarLight Nether Light WIDE / IEEAF CA*net 4 SURFnet 10GBAS E-LW Fujitsu XG800 Foundry FEX x448 Foundry NetIron40G Extreme Summit 400
Kei Hiraki University of Tokyo LSR experiments Target –> 30,000 km LSR distance –L3 switching at Chicago and Amsterdam –Period of the experiment 12/20 – 1/3 Holiday season for vacant public research networks System configuration –A pair of opteron servers with Chelsio T110 (at N-otemachi) –Another pair of opteron servers with Chelsion T110 for competing traffinc generation –ClearSight 10Gbps packet analyzer for packet capturing
Kei Hiraki University of Tokyo Tokyo Chicago Amsterdam Seattle Vancouver Calgary Minneapolis IEEAF/Tyco/WIDE CANARIE SURFnet Network used in the experiment Figure 2. Network connection CA*net 4 APAN/JGN2 Abilene NYC A router or an L3 switch A L1 or L2 switch
Kei Hiraki University of Tokyo Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo Tokyo T-LEX Amsterdam NetherLight SURFnet IEEAF/Tyco/WIDECANARIE Router or L3 switch University of Amsterdam Chicago StarLight L1 or L2 switch Force10 E1200 ONS Vancouve r Foundry NetIron 40G OME 6550 Minneapolis Atlantic Ocean Pacific Ocean Opteron1 Opteron server Chelsio T110 NIC IEEAF/Tyco Opteron server Chelsio T110 NIC ClearSight 10Gbps capture Fujitsu XG800 OC-192 WAN PHY CANARIE SURFnet OME 6550 Procket 8801 Procket 8812 ONS ONS ONS ONS ONS HDXc T640 HDXc CISCO CISCO CISCO 6509 Force10 E600 Opteron3 Seattle Pacific Northwest Gigapop New York MANLAN Chicago SURFnet OC-192 Abilene APAN/JGN TransPAC SURFnet WIDE Calgary Univ of TokyoWIDEAPAN/JGN2 Abilene
Kei Hiraki University of Tokyo Network Traffic on routers and switches StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run
Kei Hiraki University of Tokyo Summary Single Stream TCP –We removed TCP related difficulties –Now I/O bus bandwidth is the bottleneck –Cheap and simple servers can enjoy 10Gbps network Lack of methodology in high-performance network debugging –3 day debugging (overnight working) –1 day stable period (usable for measurements) –Network may feel fatigue, some trouble must happen –We need something effective. Detailed issues –Flow control (and QoS) –Buffer size and policy –Optical level setting
Kei Hiraki University of Tokyo
Kei Hiraki University of Tokyo Systems used in Long-distance TCP experiments CERN Pittsburgh Tokyo
Kei Hiraki University of Tokyo Efficient and effective utilization of High-speed internet Efficient and effective utilization of 10Gbps network is still very difficult PHY, MAC, Data-link, and Switches –10Gbps is ready to use Network interface adaptor –8Gbps is ready to use, 10Gbps in several months –Proper offloading, RDMA implementation I/O bus of a server –20 Gbps is necessary to drive 10Gbps network Drivers, operating system –Too many interruption, buffer memory management File system –Slow NFS service –Consistency problem
Kei Hiraki University of Tokyo Difficulty in10Gbps Data Reservoir Disk to disk Single Stream TCP data transfer –High CPU utilization (performance limit by CPU) Too many context switches Too many interruption from Network adaptor (> 30,000/s) Data copy from buffers to buffers I/O bus bottleneck –PCI-X/ maximum 7.6Gbps data transfer Waiting for PCI-X/266 or PCI-express x8 or x16 NIC –Disk performance Performance limit of RAID adaptor Number of disks for data transfer (>40 disks are required) File system –High BW in file service is more difficult than data sharing
Kei Hiraki University of Tokyo High-speed IP network in supercomputing (GRAPE-DR project) World fastest computing system –2PFLOPS in 2008 (performance on actual application programs) Construction of general-purpose massively parallel architecture –Low power consumption in PFLOPS range performance –MPP architecture more general-purpose than vector architecture Use of comodity network for interconnection –10Gbps optical network (2008) + MEMs switches –100Gbps optical network (2010)
Kei Hiraki University of Tokyo K Parallel processors Processor chips 1M 1G 1T 1E 1Z 1Y FLOPS 1P Year Earth Simulator 40TFLOPS Target performance Grape DR 2 PFLOPS KEISOKU supercomputer 10PFLOPS
Kei Hiraki University of Tokyo
Kei Hiraki University of Tokyo GRAPE-DR architecture Shared memory On chip network 512 PEs Integer ALU Floating point ALU Outside world G Massively Parallel Processor Pipelined connection of a large number of PEs SIMASD (Single Instruction on Multiple and Shared Data) –All instruction operates on Data of local memory and shared memory –Extension of vector architecture Issues –Compiler for SIMASD architecture (currently developing – flat-C) F CP + Local Memory On chip shared memory
Kei Hiraki University of Tokyo Hierarchical construction of GRAPE-DR メモリ 512PE/Chip 512 GFlops /Chip 2KPE/PCI board 2TFLOPS/PCI board 8 KPE/Server 8 TFLOPS/Server 1 MPE/Node 1PFLOPS/Node 2M PE/System 2PFLOPS/System
Kei Hiraki University of Tokyo Network architecture inside a GRAPE-DR system Memory KOE AMD based server Memory bus Adaptive compier 光インタフェース Outside IP network 100Gbps iSCSI サーバ MEMs based optical switch IP storage system Total system conductor For dynamic optimization Highly functional router
Kei Hiraki University of Tokyo Fujitsu Computer Technologies, LTD