Download presentation
Presentation is loading. Please wait.
1
Named Data Networking in science Applications
Christos Papadopoulos Colorado State University CALTECH, Nov 20, 2017 Supported by NSF # , # and #
2
Science Applications Scientific apps generate tremendous amounts of data Current climate CMIP5 dataset: 3.5 PB, projected into the exabytes with CMIP6 High Energy Physics (HEP) ATLAS: 1 PB/s raw, filters to 4 PB/yr Data typically distributed to repositories Sciences develop custom software for dataset discovery, publishing, and retrieval E.g. ESGF, xrootd, etc. How to locate data? Availability? No standard naming scheme
3
ESGF (Climate Data) Globally distributed, P2P system to distribute CMIP5 (and other) climate data Performs publishing, discovery, user authentication, data retrieval User makes selection on portal, receives a collection of “wget” scripts
4
ESGF Server Locations User specifies which ESGF server to connect to receive data
5
ESGF Data Retrieval Today
adal1990/day/atmos/tas/r3i2p1/tas_Amon_HADCM3_ historical_r1i1p1_ nc
6
CMIP5 Retrieval with NDN
adal1990/day/atmos/tas/r3i2p1/tas_Amon_HADCM3_ historical_r1i1p1_ nc
7
NDN Content Publishing
Example using real object names
8
Data Request Interests for Jan 30-31 go to server1
Interests for Feb go to server2 Data dynamically extracted from file
9
Name Translation translator spcesm-ctrl.pop.h.1891-01.nc
Filename and Contents of file Filename to NDN name mapping schema and user defined components translator /coupled /control /CMMAP /r3i1p1 /spcesm-ctrl /pop /1M / / /Activity/subactivity/organization/ensamble/experiment/model/granularity/start_time
10
Naming Scientific Datasets in NDN
NDN’s only requirement is hierarchical names. Many scientific communities have already adopted such naming structures CMIP5 Data reference System (DRS) /CMIP5/output/MOHC/HadCM3/decadal1990/day/atmos/tas/r3i2p 1/tas_Amon_HADCM3_ historical_r1i1p1_ nc HEP naming structure /store/mc/fall13/BprimeBprime_M_3000/GEN-SIM/POSTLS162_v1- v2/10000 /<UUID.root>
11
First Step – Build a Catalog
Create a shared resource – a distributed, synchronized catalog of names over NDN Provide common operations such as publishing, discovery, access control Catalog only deals with name management, not dataset retrieval Platform for further research and experimentation Research questions: Namespace construction, distributed publishing, key management, UI design, failover, etc. Functional services such as subsetting Mapping of name-based routing to tunneling services (VPN, OSCARS, MPLS)
12
NDN Catalog NDN Catalog node 1 Data storage Catalog node 3
(1)Publish Dataset names NDN (2) Sync changes Publisher (3) Query for Dataset names Data storage (4) Retrieve data Consumer Catalog node 2
13
Forwarding Strategies
Catalog node 1 Data storage Catalog node 3 NDN Publisher Data storage Consumer Catalog node 2
14
Science NDN Testbed NSF CC-NIE campus infrastructure award
10G testbed (courtesy of ESnet, UCAR, and CSU Research LAN) Currently ~50TB of CMIP5, ~20TB of HEP data
15
Distribution with NDN: A Case Study
Request Aggregation, Caching, and Forwarding Strategies for Improving Large Climate Data Distribution with NDN: A Case Study Susmit Shannigrahi, Chengyu Fan, Christos Papadopoulos ICN 2017
16
Analyzing CMIP5 Activity
We looked at one ESGF server log collected at LLNL Approximately three years of requests ( ) Using clients’ IP addresses, we created a network and geographic map with all clients Goal: run a trace-driven simulation in NDN and quantify gains Distributed around the world Several Petabytes of data Predicted to reach exabytes for the next run, CMIP6 CMIP5 is a 3.3PB archive of climate data, made available to the community through ESGF (~25 nodes) (CMIP6 estimated into the exabytes)
17
User and Client Statistics
Unique Users (Usernames) 5692 Unique Clients (IP addresses) 9266 Unique ASNs 911 Client IP addresses
18
Data Statistics 1,844TB Two out of three requests are duplicates
Log duration Total number of requests 18.5M Number of partial or completed downloads 5.7M Number of files requested 1.8M Filesize (95% percentile) 1.3GB Total data requested 1,844TB Two out of three requests are duplicates
19
Request Distribution Some files are very popular
Candidates for aggregation and caching Can be served from nearer replica
20
Partial Transfers Three distinct categories of clients according to partial transfers Waste bandwidth and server resources Requests are often temporally close; aggregation and caching should help
21
Partial Transfers and duplicate requests
All three categories contribute to duplicate requests Successful as well as partial transfers are repeated
22
Simulation Setup NDNSim was not able to handle the entire log Ignore all zero-byte transfers Reduce number of events Randomly pick 7 weeks from the log Choose clients responsible for ~95% traffic Generate topology using reverse traceroute from server to clients Import topology into NDNSim, replay requests in real-time
23
Simulation Setup Randomly sampled seven weeks
No loss in generality – other weeks show similar traffic volume and number of duplicate requests
24
Result 1: Interest Aggregation
Large reduction in duplicate Interests No effect when there were no duplicates Interest aggregation very useful in suppressing duplicate spikes
25
Result 2: Caching - How Big?
Small caches are useful – even 1GB cache provided significant benefit Linear increase in cache size does not proportionally decrease traffic 95% inter-arrival time for duplicate requests = 400 seconds Caching everything on a 10G link for 400s = 500GB
26
Result 3: Where to Caching?
Requests are highly localized Request paths do not overlap too much Caching at the edge provides significant benefits In limited cases, network-wide caches provide more benefits
27
Result 4: Nearest-Replica
Simulated six ESGF servers with fully replicated datasets at real locations NDN strategy measures path delay and sends requests to nearest replica 96% original requests diverted to the two nearest replicas, original server only receives 0.03% requests
28
Result 5: Latency Reduction
Nearer replica strategy also reduces client-side latency RTT for client-3 reduced from 200ms to 25ms
29
Conclusions NDN is a great match for CMIP5 data distribution
Naming translation had a few bumps because of inconsistent original names Found a few CMIP5 naming bugs along the way Catalog implementation over NDN was very easy: publishing (with the help of NDN sync) discovery authentication NDN caching very beneficial in simulation NDN strategies shown to work on testbed
30
For More Info christos@colostate.edu http://named-data.net
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.