Kei Hiraki University of Tokyo Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University.

Slides:



Advertisements
Similar presentations
Appropriateness of Transport Mechanisms in Data Grid Middleware Rajkumar Kettimuthu 1,3, Sanjay Hegde 1,2, William Allcock 1, John Bresnahan 1 1 Mathematics.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Preparing For Server Installation Instructor: Enoch E. Damson.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
ISCSI Performance in Integrated LAN/SAN Environment Li Yin U.C. Berkeley.
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
Transport Level Protocol Performance Evaluation for Bulk Data Transfers Matei Ripeanu The University of Chicago Abstract:
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
Storage area network and System area network (SAN)
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
AWOCA2003 Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research Mary Inaba, Makoto Nakamura, Kei Hiraki University.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Long Distance experiments of Data Reservoir system
Chapter 6 High-Speed LANs Chapter 6 High-Speed LANs.
Introduction to Computers Personal Computing 10. What is a computer? Electronic device Performs instructions in a program Performs four functions –Accepts.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.
Large File Transfer on 20,000 km - Between Korea and Switzerland Yusung Kim, Daewon Kim, Joonbok Lee, Kilnam Chon
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Data GRID Activity in Japan Yoshiyuki WATASE KEK (High energy Accelerator Research Organization) Tsukuba, Japan
UDT: UDP based Data Transfer Yunhong Gu & Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
CA*net 4 International Grid Testbed Tel:
Example: Sorting on Distributed Computing Environment Apr 20,
Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.
ENW-9800 Copyright © PLANET Technology Corporation. All rights reserved. Dual 10Gbps SFP+ PCI Express Server Adapter.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
Masaki Hirabaru Tsukuba WAN Symposium 2005 March 8, 2005 e-VLBI and End-to-End Performance over Global Research Internet.
Parallel TCP Bill Allcock Argonne National Laboratory.
Rate Control Rate control tunes the packet sending rate. No more than one packet can be sent during each packet sending period. Additive Increase: Every.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Masaki Hirabaru Network Performance Measurement and Monitoring APAN Conference 2005 in Bangkok January 27, 2005 Advanced TCP Performance.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Masaki Hirabaru NICT Koganei 3rd e-VLBI Workshop October 6, 2004 Makuhari, Japan Performance Measurement on Large Bandwidth-Delay Product.
First of ALL Big appologize for Kei’s absence Hero of this year’s LSR achievement Takeshi in his experiment.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
Interconnection network network interface and a case study.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Hardened IDS using IXP Didier Contis, Dr. Wenke Lee, Dr. David Schimmel Chris Clark, Jun Li, Chengai Lu, Weidong Shi, Ashley Thomas, Yi Zhang  Current.
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
Networking update and plans (see also chapter 10 of TP) Bob Dobinson, CERN, June 2000.
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
Parallel IO for Cluster Computing Tran, Van Hoai.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Background Computer System Architectures Computer System Software.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Internet2 Land Speed Record
Realization of a stable network flow with high performance communication in high bandwidth-delay product network Y. Kodama, T. Kudoh, O. Tatebe, S. Sekiguchi.
Efficient utilization of 40/100 Gbps long-distance network
Introduction to Networks
Storage Networking Protocols
Presentation transcript:

Kei Hiraki University of Tokyo Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo

Kei Hiraki University of Tokyo Computing System for real Scientists Fast CPU, huge memory and disks, good graphics –Cluster technology, DSM technology, Graphics processors –Grid technology Very fast remote file accesses –Global file system, data parallel file systems, Replication facilities Transparency to local computation –No complex middleware, or no small modification to existing software Real Scientists are not computer scientists Computer scientists are not work forces for real scientists

Kei Hiraki University of Tokyo Objectives of Data Reservoir / GRAPE-DR(1) Sharing Scientific Data between distant research institutes –Physics, astronomy, earth science, simulation data Very High-speed single file transfer on Long Fat pipe Network –> 10 Gbps, > 20,000 Km, > 400ms RTT High utilization of available bandwidth –Transferred file data rate > 90% of available bandwidth Including header overheads, initial negotiation overheads

Kei Hiraki University of Tokyo Objectives of Data Reservoir / GRAPE-DR(2) GRAPE-DR:Very high-speed attached processor to a server –2004 – 2008 –Successor of Grape-6 astronomical simulator 2PFLOPS on 128 node cluster system –1G FLOPS / processor –1024 processor / chip –8 chips / PCI card –2 PCI card / serer –2 M processor / system

Kei Hiraki University of Tokyo Grape6 Very High-speed Network Data Reservoir Data analysis at University of Tokyo Belle Experiments X-ray astronomy Satellite ASUKA SUBARU Telescope Nobeyama Radio Observatory ( VLBI) Nuclear experiments Data Reservoir Data Reservoir Local Accesses Distributed Shared files Data intensive scientific computation through global networks Digital Sky Survey

Kei Hiraki University of Tokyo Data Reservoir Data Reservoir High latency Very high bandwidth Network Distribute Shared Data (DSM like architecture) Cache Disks Local file accesses Disk-block level Parallel and Multi-stream transfer Local file accesses Basic Architecture

Kei Hiraki University of Tokyo Disk Server Scientific Detectors User Programs IP Switch File Server Disk Server IP Switch File Server Disk Server 1 st level striping 2 nd level striping Disk access by iSCSI File accesses on Data Reservoir IBM x345 (2.6GHz x 2)

Kei Hiraki University of Tokyo Scientific Detectors User Programs File Server IP Switch Disk Server iSCSI Bulk Transfer Global Network Global Data Transfer

Kei Hiraki University of Tokyo Problems found in 1 st generation Data Reservoir Low TCP bandwidth due to packet losses –TCP congestion window size control –Very slow recovery from fast recovery phase (>20min) Unbalance among parallel iSCSI streams –Packet scheduling by switches and routers –User and other network users have interests only to total behavior of parallel TCP streams

Kei Hiraki University of Tokyo Fast Ethernet vs. GbE Iperf in 30 seconds Min/Avg: Fast Ethernet > GbE FE GbE

Kei Hiraki University of Tokyo Packet Transmission Rate Bursty behavior –Transmission in 20ms against RTT 200ms –Idle in rest 180ms Packet loss occurred

Kei Hiraki University of Tokyo Packet Spacing Ideal Story –Transmitting packet every RTT/cwnd –24μs interval for 500Mbps (MTU 1500B) –High load for software only –Low overhead because of limited use at slow start phase RTT RTT/cwnd

Kei Hiraki University of Tokyo Example Case of 8 IPG Success on Fast Retransmit –Smooth Transition to Congestion Avoidance –CA takes 28 minutes to recover to 550Mbps

Kei Hiraki University of Tokyo Best Case of 1023B IPG Like Fast Ethernet case –Proper transmission rate Spurious Retransmit due to Reordering

Kei Hiraki University of Tokyo Unbalance within parallel TCP streams Unbalance among parallel iSCSI streams –Packet scheduling by switches and routers –Meaningless unfairness among parallel streams –User and other network users have interests only to total behavior of parallel TCP streams Our approach –Constant Σcwnd i for fair TCP network usage to other users –Balance each cwnd i communicating between parallel TCP streams time BW time BW

Kei Hiraki University of Tokyo 3 rd Generation Data Reservoir Hardware and software basis for 100Gbps Distributed Data- sharing systems 10Gbps disk data transfer by a single Data Reservoir server Transparent support for multiple filesystems (detection of modified disk blocks) Hardware(FPGA) implementation of Inter-layer coordination mechanisms 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger

Kei Hiraki University of Tokyo Utilization of 10Gbps network A single box 10 Gbps Data Reservoir server Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) Two Chelsio T110 TCP off-loading NIC Disk arrays for necessary disk bandwidth Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) Chelsio T110 TCP NIC Quad Opteron Server (SUN V40z) Linux PCI-X bus Chelsio T110 TCP NIC SCSI adaptor PCI-X bus SCSI adaptor 10G Ethernet Switch 10GBASE-SR Data Reservoir Software Ultra320SCSI

Kei Hiraki University of Tokyo Tokyo-CERN experiment (Oct.2004) CERN-Amsterdam-Chicago-Seattle-Tokyo –SURFnet – CA*net 4 – IEEAF/Tyco – WIDE –18,500 km WAN PHY connection Performance result –7.21 Gbps (TCP payload) standard Ethernet frame size, iperf –7.53 Gbps (TCP payload) 8K Jumbo frame, iperf –8.8 Gbps disk to disk performance 9 servers, 36 disks 36 parallel TCP streams

Kei Hiraki University of Tokyo Tokyo Chicago Amsterdam Seattle Vancouver Calgary Minneapolis IEEAF CANARIE SURFnet Network used in the experiment Tokyo-CERN Network connection CA*net 4 End Systems A L1 or L2 switch Geneva

Kei Hiraki University of Tokyo Network topology of CERN-Tokyo experiment Tokyo Seattle Vancouver MinneapolisChicago Amsterdam CERN (Geneva) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) IBM x345 Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Chelsio T110 NIC Fujitsu XG port switch Foundry BI MG8 Data Reservoir at Univ. of Tokyo GbE IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux (No.2-7) Linux 2.4.X (No. 1) IBM x345 Opteron server Dual Opteron248,2.2GHz 1GB memory Linux (No.2-6) Chelsio T110 NIC Data Reservoir at CERN(Geneva) GbE T-LEX Pacific Northwest Gigapop StarLight Nether Light WIDE / IEEAF CA*net 4 SURFnet 10GBAS E-LW Fujitsu XG800 Foundry FEX x448 Foundry NetIron40G Extreme Summit 400

Kei Hiraki University of Tokyo LSR experiments Target –> 30,000 km LSR distance –L3 switching at Chicago and Amsterdam –Period of the experiment 12/20 – 1/3 Holiday season for vacant public research networks System configuration –A pair of opteron servers with Chelsio T110 (at N-otemachi) –Another pair of opteron servers with Chelsion T110 for competing traffinc generation –ClearSight 10Gbps packet analyzer for packet capturing

Kei Hiraki University of Tokyo Tokyo Chicago Amsterdam Seattle Vancouver Calgary Minneapolis IEEAF/Tyco/WIDE CANARIE SURFnet Network used in the experiment Figure 2. Network connection CA*net 4 APAN/JGN2 Abilene NYC A router or an L3 switch A L1 or L2 switch

Kei Hiraki University of Tokyo Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo Tokyo T-LEX Amsterdam NetherLight SURFnet IEEAF/Tyco/WIDECANARIE Router or L3 switch University of Amsterdam Chicago StarLight L1 or L2 switch Force10 E1200 ONS Vancouve r Foundry NetIron 40G OME 6550 Minneapolis Atlantic Ocean Pacific Ocean Opteron1 Opteron server Chelsio T110 NIC IEEAF/Tyco Opteron server Chelsio T110 NIC ClearSight 10Gbps capture Fujitsu XG800 OC-192 WAN PHY CANARIE SURFnet OME 6550 Procket 8801 Procket 8812 ONS ONS ONS ONS ONS HDXc T640 HDXc CISCO CISCO CISCO 6509 Force10 E600 Opteron3 Seattle Pacific Northwest Gigapop New York MANLAN Chicago SURFnet OC-192 Abilene APAN/JGN TransPAC SURFnet WIDE Calgary Univ of TokyoWIDEAPAN/JGN2 Abilene

Kei Hiraki University of Tokyo Network Traffic on routers and switches StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run

Kei Hiraki University of Tokyo Summary Single Stream TCP –We removed TCP related difficulties –Now I/O bus bandwidth is the bottleneck –Cheap and simple servers can enjoy 10Gbps network Lack of methodology in high-performance network debugging –3 day debugging (overnight working) –1 day stable period (usable for measurements) –Network may feel fatigue, some trouble must happen –We need something effective. Detailed issues –Flow control (and QoS) –Buffer size and policy –Optical level setting

Kei Hiraki University of Tokyo

Kei Hiraki University of Tokyo Systems used in Long-distance TCP experiments CERN Pittsburgh Tokyo

Kei Hiraki University of Tokyo Efficient and effective utilization of High-speed internet Efficient and effective utilization of 10Gbps network is still very difficult PHY, MAC, Data-link, and Switches –10Gbps is ready to use Network interface adaptor –8Gbps is ready to use, 10Gbps in several months –Proper offloading, RDMA implementation I/O bus of a server –20 Gbps is necessary to drive 10Gbps network Drivers, operating system –Too many interruption, buffer memory management File system –Slow NFS service –Consistency problem

Kei Hiraki University of Tokyo Difficulty in10Gbps Data Reservoir Disk to disk Single Stream TCP data transfer –High CPU utilization (performance limit by CPU) Too many context switches Too many interruption from Network adaptor (> 30,000/s) Data copy from buffers to buffers I/O bus bottleneck –PCI-X/ maximum 7.6Gbps data transfer Waiting for PCI-X/266 or PCI-express x8 or x16 NIC –Disk performance Performance limit of RAID adaptor Number of disks for data transfer (>40 disks are required) File system –High BW in file service is more difficult than data sharing

Kei Hiraki University of Tokyo High-speed IP network in supercomputing (GRAPE-DR project) World fastest computing system –2PFLOPS in 2008 (performance on actual application programs) Construction of general-purpose massively parallel architecture –Low power consumption in PFLOPS range performance –MPP architecture more general-purpose than vector architecture Use of comodity network for interconnection –10Gbps optical network (2008) + MEMs switches –100Gbps optical network (2010)

Kei Hiraki University of Tokyo K Parallel processors Processor chips 1M 1G 1T 1E 1Z 1Y FLOPS 1P Year Earth Simulator 40TFLOPS Target performance Grape DR 2 PFLOPS KEISOKU supercomputer 10PFLOPS

Kei Hiraki University of Tokyo

Kei Hiraki University of Tokyo GRAPE-DR architecture Shared memory On chip network 512 PEs Integer ALU Floating point ALU Outside world G Massively Parallel Processor Pipelined connection of a large number of PEs SIMASD (Single Instruction on Multiple and Shared Data) –All instruction operates on Data of local memory and shared memory –Extension of vector architecture Issues –Compiler for SIMASD architecture (currently developing – flat-C) F CP + Local Memory On chip shared memory

Kei Hiraki University of Tokyo Hierarchical construction of GRAPE-DR メモリ 512PE/Chip 512 GFlops /Chip 2KPE/PCI board 2TFLOPS/PCI board 8 KPE/Server 8 TFLOPS/Server 1 MPE/Node 1PFLOPS/Node 2M PE/System 2PFLOPS/System

Kei Hiraki University of Tokyo Network architecture inside a GRAPE-DR system Memory KOE AMD based server Memory bus Adaptive compier 光インタフェース Outside IP network 100Gbps iSCSI サーバ MEMs based optical switch IP storage system Total system conductor For dynamic optimization Highly functional router

Kei Hiraki University of Tokyo Fujitsu Computer Technologies, LTD