Harvey B Newman, Caltech Harvey B Newman, Caltech Optical Networks and Grids Meeting the Advanced Network Needs of Science May 7, 2002 Optical Networks and Grids Meeting the Advanced Network Needs of Science May 7, 2002http://l3www.cern.ch/~newman/HENPGridsNets_I2Mem ppt HENP Grids and Networks Global Virtual Organizations HENP Grids and Networks Global Virtual Organizations
Computing Challenges: Petabyes, Petaflops, Global VOs è Geographical dispersion: of people and resources è Complexity: the detector and the LHC environment è Scale: Tens of Petabytes per year of data Physicists 250+ Institutes 60+ Countries Major challenges associated with: Communication and collaboration at a distance Managing globally distributed computing & data resources Cooperative software development and physics analysis New Forms of Distributed Systems: Data Grids
Four LHC Experiments: The Petabyte to Exabyte Challenge ATLAS, CMS, ALICE, LHCB Higgs + New particles; Quark-Gluon Plasma; CP Violation Data stored ~40 Petabytes/Year and UP; CPU 0.30 Petaflops and UP Data stored ~40 Petabytes/Year and UP; CPU 0.30 Petaflops and UP 0.1 to 1 Exabyte (1 EB = Bytes) (2007) (~2012 ?) for the LHC Experiments 0.1 to 1 Exabyte (1 EB = Bytes) (2007) (~2012 ?) for the LHC Experiments
10 9 events/sec, selectivity: 1 in (1 person in a thousand world populations) LHC: Higgs Decay into 4 muons (Tracker only); 1000X LEP Data Rate
LHC Data Grid Hierarchy Tier 1 Tier2 Center Online System CERN 700k SI95 ~1 PB Disk; Tape Robot FNAL: 200k SI95; 600 TB IN2P3 Center INFN Center RAL Center Institute Institute ~0.25TIPS Workstations ~ MBytes/sec 2.5 Gbps 0.1–1 Gbps Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels Physics data cache ~PByte/sec ~2.5 Gbps Tier2 Center ~2.5 Gbps Tier 0 +1 Tier 3 Tier 4 Tier2 Center Tier 2 Experiment CERN/Outside Resource Ratio ~1:2 Tier0/( Tier1)/( Tier2) ~1:1:1
Next Generation Networks for Experiments: Goals and Needs u Providing rapid access to event samples, subsets and analyzed physics results from massive data stores è From Petabytes by 2002, ~100 Petabytes by 2007, to ~1 Exabyte by ~2012. u Advanced integrated applications, such as Data Grids, rely on seamless operation of our LANs and WANs è With reliable, monitored, quantifiable high performance u Providing analyzed results with rapid turnaround, by coordinating and managing the LIMITED computing, data handling and NETWORK resources effectively u Enabling rapid access to the data and the collaboration è Across an ensemble of networks of varying capability Large data samples explored and analyzed by thousands of globally dispersed scientists, in hundreds of teams
Baseline BW for the US-CERN Link: HENP Transatlantic WG (DOE+NSF ) US-CERN Link: 622 Mbps this month DataTAG 2.5 Gbps Research Link in Summer 2002 10 Gbps Research Link by Approx. Mid-2003 Transoceanic Networking Integrated with the Abilene, TeraGrid, Regional Nets and Continental Network Infrastructures in US, Europe, Asia, South America Baseline evolution typical of major HENP links
Transatlantic Net WG (HN, L. Price) Bandwidth Requirements [*] u [*] Installed BW. Maximum Link Occupancy 50% Assumed See
Links Required to US Labs and Transatlantic [*] Links Required to US Labs and Transatlantic [*] [*] Maximum Link Occupancy 50% Assumed; OC3=155 Mbps; OC12=622 Mbps; OC48=2.5 Gbps; OC192=10 Gbps
AMS-IX Internet Exchange Throughput Accelerating Growth in Europe (NL) Monthly Traffic 2X Growth from 8/00 - 3/01; 2X Growth from 8/ /01 ↓ 2.0 Gbps 4.0 Gbps 6.0 Gbps Hourly Traffic 3/22/02
Emerging Data Grid User Communities u NSF Network for Earthquake Engineering Simulation (NEES) è Integrated instrumentation, collaboration, simulation u Grid Physics Network (GriPhyN) è ATLAS, CMS, LIGO, SDSS u Access Grid; VRVS: supporting group-based collaboration And u Genomics, Proteomics,... u The Earth System Grid and EOSDIS u Federating Brain Data u Computed MicroTomography … è Virtual Observatories
Upcoming Grid Challenges: Secure Workflow Management and Optimization uMaintaining a Global View of Resources and System State èCoherent end-to-end System Monitoring èAdaptive Learning: new paradigms for execution optimization (eventually automated) uWorkflow Management, Balancing Policy Versus Moment-to-moment Capability to Complete Tasks èBalance High Levels of Usage of Limited Resources Against Better Turnaround Times for Priority Jobs èMatching Resource Usage to Policy Over the Long Term èGoal-Oriented Algorithms; Steering Requests According to (Yet to be Developed) Metrics uHandling User-Grid Interactions: Guidelines; Agents uBuilding Higher Level Services, and an Integrated Scalable (Agent-Based) User Environment for the Above
( Physicists’) Application Codes Experiments’ Software Framework Layer Modular and Grid-aware: Architecture able to interact effectively with the lower layers Grid Applications Layer (Parameters and algorithms that govern system operations) Policy and priority metrics Workflow evaluation metrics Task-Site Coupling proximity metrics Global End-to-End System Services Layer Monitoring and Tracking Component performance Workflow monitoring and evaluation mechanisms Error recovery and redirection mechanisms System self-monitoring, evaluation and optimization mechanisms Application Architecture: Interfacing to the Grid
COJAC: CMS ORCA Java Analysis Component: Java3D Objectivity JNI Web Services Demonstrated Caltech-Rio de Janeiro (Feb.) and Chile
Models Of Networked Analysis At Regional Centers) The simulation program developed within MONARC (Models Of Networked Analysis At Regional Centers) uses a process- oriented approach for discrete event simulation, and provides a realistic modelling tool for large scale distributed systems. Modeling and Simulation: MONARC System (I. Legrand) SIMULATION of Complex Distributed Systems for LHC
MONARC SONN: 3 Regional Centres Learning to Export Jobs (Day 9) NUST 20 CPUs CERN 30 CPUs CALTECH 25 CPUs 1MB/s ; 150 ms RTT 1.2 MB/s 150 ms RTT 0.8 MB/s 200 ms RTT Day = 9 = 0.73 = 0.66 = 0.83 By I. Legrand
Agent-Based Distributed JINI Services: CIT/Romania/Pakistan u Includes “Station Servers” (static) that host mobile “Dynamic Services” u Servers are interconnected dynamically to form a fabric in which mobile agents travel, with a payload of physics analysis tasks u Prototype is highly flexible and robust against network outages u Adaptable to Web services: OGSA; and many platforms u The Design and Studies with this prototype use the MONARC Simulator, and build on SONN studies. See StationServer StationServer StationServer LookupService LookupService Proxy Exchange Registration Service Listener Lookup Discovery Service Remote Notification
TeraGrid ( NCSA, ANL, SDSC, Caltech NCSA/UIUC ANL UIC Multiple Carrier Hubs Starlight / NW Univ Ill Inst of Tech Univ of Chicago Indianapolis (Abilene NOC) I-WIRE Caltech San Diego DTF Backplane: 4 X 10 Gbps Abilene Chicago Indianapolis Urbana OC-48 (2.5 Gb/s, Abilene) Multiple 10 GbE (Qwest) Multiple 10 GbE (I-WIRE Dark Fiber) Source: Charlie Catlett, Argonne A Preview of the Grid Hierarchy and Networks of the LHC Era
NL SURFnet GENEVA UK SuperJANET4 ABILEN E ESNET CALRE N It GARR-B GEANT NewYork Fr Renater STAR-TAP STARLIGHT DataTAG Project u EU-Solicited Project. CERN, PPARC (UK), Amsterdam (NL), and INFN (IT); and US (DOE/NSF: UIC, NWU and Caltech) partners u Main Aims: è Ensure maximum interoperability between US and EU Grid Projects è Transatlantic Testbed for advanced network research u 2.5 Gbps Wavelength Triangle 7/02 (10 Gbps Triangle in 2003) Wave Triangle
RNP Brazil (to 20 Mbps) FIU Miami/So. America (to 80 Mbps)
A Short List: Revolutions in Information Technology (2002-7) u Managed Global Data Grids (As Above) u Scalable Data-Intensive Metro and Long Haul Network Technologies è DWDM: 10 Gbps then 40 Gbps per ; 1 to 10 Terabits/sec per fiber è 10 Gigabit Ethernet (See 10GbE / 10 Gbps LAN/WAN integration è Metro Buildout and Optical Cross Connects è Dynamic Provisioning Dynamic Path Building “Lambda Grids” u Defeating the “Last Mile” Problem (Wireless; or Ethernet in the First Mile) è 3G and 4G Wireless Broadband (from ca. 2003); and/or Fixed Wireless “Hotspots” è Fiber to the Home è Community-Owned Networks
Key Network Issues & Challenges u Net Infrastructure Requirements for High Throughput Packet Loss must be ~Zero (at and below ) I.e. No “Commodity” networks Need to track down uncongested packet loss No Local infrastructure bottlenecks Multiple Gigabit Ethernet “clear paths” between selected host pairs are needed now To 10 Gbps Ethernet paths by 2003 or 2004 TCP/IP stack configuration and tuning Absolutely Required Large Windows; Possibly Multiple Streams New Concepts of Fair Use Must then be Developed Careful Router, Server, Client, Interface configuration Sufficient CPU, I/O and NIC throughput sufficient End-to-end monitoring and tracking of performance Close collaboration with local and “regional” network staffs TCP Does Not Scale to the 1-10 Gbps Range
True End to End Experience r User perception r Application r Operating system r Host IP stack r Host network card r Local Area Network r Campus backbone network r Campus link to regional network/GigaPoP r GigaPoP link to Internet2 national backbones r International connections EYEBALL APPLICATION STACK JACK NETWORK...
Internet2 HENP WG [*] u Mission: To help ensure that the required è National and international network infrastructures (end-to-end) è Standardized tools and facilities for high performance and end-to-end monitoring and tracking, and è Collaborative systems u are developed and deployed in a timely manner, and used effectively to meet the needs of the US LHC and other major HENP Programs, as well as the at-large scientific community. è To carry out these developments in a way that is broadly applicable across many fields u Formed an Internet2 WG as a suitable framework: Oct u [*] Co-Chairs: S. McKee (Michigan), H. Newman (Caltech); Sec’y J. Williams (Indiana) u Website: also see the Internet2 End-to-end Initiative:
A Short List: Coming Revolutions in Information Technology u Storage Virtualization è Grid-enabled Storage Resource Middleware (SRM) è iSCSI (Internet Small Computer Storage Interface); Integrated with 10 GbE Global File Systems u Internet Information Software Technologies è Global Information “Broadcast” Architecture E.g the Multipoint Information Distribution Protocol è Programmable Coordinated Agent Architectures E.g. Mobile Agent Reactive Spaces (MARS) by Cabri et al., University of Modena u The “Data Grid” - Human Interface è Interactive monitoring and control of Grid resources By authorized groups and individuals By Autonomous Agents
HENP Major Links: Bandwidth Roadmap (Scenario) in Gbps
HENP Scenario Limitations: Technologies and Costs u Router Technology and Costs (Ports and Backplane) u Computer CPU, Disk and I/O Channel Speeds to Send and Receive Data u Link Costs: Unless Dark Fiber (?) u MultiGigabit Transmission Protocols End-to-End u “100 GbE” Ethernet (or something else) by ~2006: for LANs to match WAN speeds
HENP Lambda Grids: Fibers for Physics u Problem: Extract “Small” Data Subsets of 1 to 100 Terabytes from 1 to 1000 Petabyte Data Stores u Survivability of the HENP Global Grid System, with hundreds of such transactions per day (circa 2007) requires that each transaction be completed in a relatively short time. u Example: Take 800 secs to complete the transaction. Then u Transaction Size (TB) Net Throughput (Gbps) u 1 10 u u (Capacity of Fiber Today) u Summary: Providing Switching of 10 Gbps wavelengths within 2-3 years; and Terabit Switching within 5-7 years would enable “Petascale Grids with Terabyte transactions”, as required to fully realize the discovery potential of major HENP programs, as well as other data-intensive fields.
10614 Hosts; 6003 Registered Users in 60 Countries 41 (7 I2) Reflectors Annual Growth 2 to 3X
Networks, Grids and HENP u Next generation 10 Gbps network backbones are almost here: in the US, Europe and Japan è First stages arriving, starting now u Major transoceanic links at Gbps in u Network improvements are especially needed in Southeast Europe, So. America; and some other regions: è Romania, Brazil; India, Pakistan, China; Africa u Removing regional, last mile bottlenecks and compromises in network quality are now All on the critical path u Getting high (reliable; Grid) application performance across networks means! è End-to-end monitoring; a coherent approach è Getting high performance (TCP) toolkits in users’ hands è Working in concert with AMPATH, Internet E2E, I2 HENP WG, DataTAG; the Grid projects and the GGF
Some Extra Slides Follow
The Large Hadron Collider (2007-) u The Next-generation Particle Collider è The largest superconductor installation in the world u Bunch-bunch collisions at 40 MHz, Each generating ~20 interactions è Only one in a trillion may lead to a major physics discovery u Real-time data filtering: Petabytes per second to Gigabytes per second u Accumulated data of many Petabytes/Year Large data samples explored and analyzed by thousands of globally dispersed scientists, in hundreds of teams
HENP Related Data Grid Projects u Projects è PPDG IUSADOE$2M è GriPhyNUSANSF$11.9M + $1.6M è EU DataGridEUEC€10M è PPDG II (CP)USADOE$9.5M è iVDGLUSANSF$13.7M + $2M è DataTAGEUEC€4M è GridPP UKPPARC>$15M è LCG (Ph1)CERN MS30 MCHF u Many Other Projects of interest to HENP è Initiatives in US, UK, Italy, France, NL, Germany, Japan, … è US and EU networking initiatives: AMPATH, I2, DataTAG è US Distributed Terascale Facility: ($53M, 12 TeraFlops, 40 Gb/s network)
Mobile Agents: (Semi)-Autonomous, Goal Driven, Adaptive è Execute Asynchronously è Reduce Network Load: Local Conversations è Overcome Network Latency; Some Outages è Adaptive Robust, Fault Tolerant è Naturally Heterogeneous è Extensible Concept: Coordinated Agent Architectures Beyond Traditional Architectures: Mobile Agents “Agents are objects with rules and legs” -- D. Taylor Application ServiceAgent
Farm Monitor Client (other service) Lookup Service Lookup Service Registration Farm Monitor Discovery Proxy uComponent Factory uGUI marshaling uCode Transport uRMI data access Push & Pull rsh & ssh scripts; snmp Globally Scalable Monitoring Service I. Legrand RC Monitor Service
National R&E Network Example Germany: DFN TransAtlanticConnectivity Q STM 4 STM 16 u 2 X OC12 Now: NY-Hamburg and NY-Frankfurt u ESNet peering at 34 Mbps u Upgrade to 2 X OC48 expected in Q u Direct Peering to Abilene and Canarie expected u UCAID will add (?) another 2 OC48’s; Proposing a Global Terabit Research Network (GTRN) u FSU Connections via satellite: Yerevan, Minsk, Almaty, Baikal è Speeds of kbps u SILK Project (2002): NATO funding è Links to Caucasus and Central Asia (8 Countries) è Currently kbps è Propose VSAT for X BW: NATO + State Funding
National Research Networks in Japan u SuperSINET è Started operation January 4, 2002 è Support for 5 important areas: HEP, Genetics, Nano-Technology, Space/Astronomy, GRIDs è Provides 10 ’s: r 10 Gbps IP connection r Direct intersite GbE links r Some connections to 10 GbE in JFY2002 u HEPnet-J Will be re-constructed with MPLS-VPN in SuperSINET Will be re-constructed with MPLS-VPN in SuperSINET u Proposal: Two TransPacific 2.5 Gbps Wavelengths, and Japan-CERN Grid Testbed by ~2003 Tokyo Osaka Nagoya Internet Osaka U Kyoto U ICR Kyoto-U Nagoya U NIFS NIG KEK Tohoku U IMS U-Tokyo NAO U Tokyo NII Hitot. NII Chiba IP WDM path IP router OXC ISAS
ICFA SCIC Meeting March 9 at CERN: Updates from Members u Abilene Upgrade from 2.5 to 10 Gbps è Additional scheduled lambdas planned for targeted applications u US-CERN è Upgrade On Track: 2 X 155 to 622 Mbps in April; Move to STARLIGHT è 2.5G Research Lambda by this Summer: STARLIGHT-CERN è 2.5G Triangle between STARLIGHT (US), SURFNet (NL), CERN u SLAC + IN2P3 (BaBar) è Getting 100 Mbps over 155 Mbps CERN-US Link è 50 Mbps Over RENATER 155 Mbps Link, Limited by ESnet è 600 Mbps Throughput is BaBar Target for this Year u FNAL è Expect ESnet Upgrade to 622 Mbps this Month è Plans for dark fiber to STARLIGHT, could be done in ~6 Months; Railway and Electric Co. providers considered
ICFA SCIC: A&R Backbone and International Link Progress u GEANT Pan-European Backbone ( è Now interconnects 31 countries è Includes many trunks at 2.5 and 10 Gbps u UK è 2.5 Gbps NY-London, with 622 Mbps to ESnet and Abilene u SuperSINET (Japan): 10 Gbps IP and 10 Gbps Wavelength è Upgrade to Two 0.6 Gbps Links, to Chicago and Seattle è Plan upgrade to 2 X 2.5 Gbps Connection to US West Coast by 2003 u CA*net4 (Canada): Interconnect customer-owned dark fiber nets across Canada at 10 Gbps, starting July 2002 è “Lambda-Grids” by ~ u GWIN (Germany): Connection to Abilene to 2 X 2.5 Gbps in 2002 u Russia è Start 10 Mbps link to CERN and ~90 Mbps to US Now
[*] See “Macroscopic Behavior of the TCP Congestion Avoidance Algorithm,” Matthis, Semke, Mahdavi, Ott, Computer Communication Review 27(3), 7/1997 Throughput quality improvements: BW TCP < MSS/(RTT*sqrt(loss)) [*] China Recent Improvement 80% Improvement/Year Factor of 10 In 4 Years Eastern Europe Keeping Up
TCP Protocol Study: Limits u We determined Precisely èThe parameters which limit the throughput over ahigh-BW, long delay (170 msec) network èHow to avoid intrinsic limits; unnecessary packet loss Methods Used to Improve TCP u Linux kernel programming in order to tune TCP parameters u We modified the TCP algorithm u A Linux patch will soon be available Result: The Current State of the Art for Reproducible Throughput u125 Mbps between CERN and Caltech 190 Mbps (one stream) between CERN and Chicago shared on Mbps links Status: Ready for Tests at Higher BW (622 Mbps) this Month Congestion window behavior of a TCP connection over the transatlantic line 3) Back to slow start (Fast Recovery couldn’t repair the lost The packet lost is detected by timeout => go back to slow start cwnd = 2 MSS) 2) Fast Recovery (Temporary state to repair the lost) 1) A packet is lost New loss Losses occur when the cwnd is larger than 3,5 Mbyte Maximizing US-CERN TCP Throughput (S.Ravot, Caltech) Reproducible 125 Mbps Between CERN and Caltech/CACR
Rapid Advances of Nat’l Backbones: Next Generation Abilene u Abilene partnership with Qwest extended through 2006 u Backbone to be upgraded to 10-Gbps in phases, to be Completed by October 2003 è GigaPoP Upgrade started in February 2002 u Capability for flexible provisioning in support of future experimentation in optical networking è In a multi- infrastructure
US CMS TeraGrid Seamless Prototype u Caltech/Wisconsin Condor/NCSA Production u Simple Job Launch from Caltech è Authentication Using Globus Security Infrastructure (GSI) è Resources Identified Using Globus Information Infrastructure (GIS) u CMSIM Jobs (Batches of 100, Hours, 100 GB Output) Sent to the Wisconsin Condor Flock Using Condor-G è Output Files Automatically Stored in NCSA Unitree (Gridftp) u ORCA Phase: Read-in and Process Jobs at NCSA è Output Files Automatically Stored in NCSA Unitree u Future: Multiple CMS Sites; Storage in Caltech HPSS Also, Using GDMP (With LBNL’s HRM). u Animated Flow Diagram of the DTF Prototype:
Building Petascale Global Grids: Implications for Society Meeting the challenges of Petabyte-to-Exabyte Grids, and Gigabit-to-Terabit Networks, will transform research in science and engineering These developments will create the first truly global virtual organizations (GVO) If these developments are successful this could lead to profound advances in industry, commerce and society at large è By changing the relationship between people and “persistent” information in their daily lives è Within the next five to ten years