Named Data Networking in science Applications

Named Data Networking in science Applications
Christos Papadopoulos Colorado State University CALTECH, Nov 20, 2017 Supported by NSF # , # and #

Science Applications Scientific apps generate tremendous amounts of data Current climate CMIP5 dataset: 3.5 PB, projected into the exabytes with CMIP6 High Energy Physics (HEP) ATLAS: 1 PB/s raw, filters to 4 PB/yr Data typically distributed to repositories Sciences develop custom software for dataset discovery, publishing, and retrieval E.g. ESGF, xrootd, etc. How to locate data? Availability? No standard naming scheme

ESGF (Climate Data) Globally distributed, P2P system to distribute CMIP5 (and other) climate data Performs publishing, discovery, user authentication, data retrieval User makes selection on portal, receives a collection of “wget” scripts

ESGF Server Locations User specifies which ESGF server to connect to receive data

ESGF Data Retrieval Today
adal1990/day/atmos/tas/r3i2p1/tas_Amon_HADCM3_ historical_r1i1p1_ nc

CMIP5 Retrieval with NDN
adal1990/day/atmos/tas/r3i2p1/tas_Amon_HADCM3_ historical_r1i1p1_ nc

NDN Content Publishing
Example using real object names

Data Request Interests for Jan 30-31 go to server1
Interests for Feb go to server2 Data dynamically extracted from file

Name Translation translator spcesm-ctrl.pop.h.1891-01.nc
Filename and Contents of file Filename to NDN name mapping schema and user defined components translator /coupled /control /CMMAP /r3i1p1 /spcesm-ctrl /pop /1M / / /Activity/subactivity/organization/ensamble/experiment/model/granularity/start_time

Naming Scientific Datasets in NDN
NDN’s only requirement is hierarchical names. Many scientific communities have already adopted such naming structures CMIP5 Data reference System (DRS) /CMIP5/output/MOHC/HadCM3/decadal1990/day/atmos/tas/r3i2p 1/tas_Amon_HADCM3_ historical_r1i1p1_ nc HEP naming structure /store/mc/fall13/BprimeBprime_M_3000/GEN-SIM/POSTLS162_v1- v2/10000 /<UUID.root>

First Step – Build a Catalog
Create a shared resource – a distributed, synchronized catalog of names over NDN Provide common operations such as publishing, discovery, access control Catalog only deals with name management, not dataset retrieval Platform for further research and experimentation Research questions: Namespace construction, distributed publishing, key management, UI design, failover, etc. Functional services such as subsetting Mapping of name-based routing to tunneling services (VPN, OSCARS, MPLS)

NDN Catalog NDN Catalog node 1 Data storage Catalog node 3
(1)Publish Dataset names NDN (2) Sync changes Publisher (3) Query for Dataset names Data storage (4) Retrieve data Consumer Catalog node 2

Forwarding Strategies
Catalog node 1 Data storage Catalog node 3 NDN Publisher Data storage Consumer Catalog node 2

Science NDN Testbed NSF CC-NIE campus infrastructure award
10G testbed (courtesy of ESnet, UCAR, and CSU Research LAN) Currently ~50TB of CMIP5, ~20TB of HEP data

Distribution with NDN: A Case Study
Request Aggregation, Caching, and Forwarding Strategies for Improving Large Climate Data Distribution with NDN: A Case Study Susmit Shannigrahi, Chengyu Fan, Christos Papadopoulos ICN 2017

Analyzing CMIP5 Activity
We looked at one ESGF server log collected at LLNL Approximately three years of requests ( ) Using clients’ IP addresses, we created a network and geographic map with all clients Goal: run a trace-driven simulation in NDN and quantify gains Distributed around the world Several Petabytes of data Predicted to reach exabytes for the next run, CMIP6 CMIP5 is a 3.3PB archive of climate data, made available to the community through ESGF (~25 nodes) (CMIP6 estimated into the exabytes)

User and Client Statistics
Unique Users (Usernames) 5692 Unique Clients (IP addresses) 9266 Unique ASNs 911 Client IP addresses

Data Statistics 1,844TB Two out of three requests are duplicates
Log duration Total number of requests 18.5M Number of partial or completed downloads 5.7M Number of files requested 1.8M Filesize (95% percentile) 1.3GB Total data requested 1,844TB Two out of three requests are duplicates

Request Distribution Some files are very popular
Candidates for aggregation and caching Can be served from nearer replica

Partial Transfers Three distinct categories of clients according to partial transfers Waste bandwidth and server resources Requests are often temporally close; aggregation and caching should help

Partial Transfers and duplicate requests
All three categories contribute to duplicate requests Successful as well as partial transfers are repeated

Simulation Setup NDNSim was not able to handle the entire log Ignore all zero-byte transfers Reduce number of events Randomly pick 7 weeks from the log Choose clients responsible for ~95% traffic Generate topology using reverse traceroute from server to clients Import topology into NDNSim, replay requests in real-time

Simulation Setup Randomly sampled seven weeks
No loss in generality – other weeks show similar traffic volume and number of duplicate requests

Result 1: Interest Aggregation
Large reduction in duplicate Interests No effect when there were no duplicates Interest aggregation very useful in suppressing duplicate spikes

Result 2: Caching - How Big?
Small caches are useful – even 1GB cache provided significant benefit Linear increase in cache size does not proportionally decrease traffic 95% inter-arrival time for duplicate requests = 400 seconds Caching everything on a 10G link for 400s = 500GB

Result 3: Where to Caching?
Requests are highly localized Request paths do not overlap too much Caching at the edge provides significant benefits In limited cases, network-wide caches provide more benefits

Result 4: Nearest-Replica
Simulated six ESGF servers with fully replicated datasets at real locations NDN strategy measures path delay and sends requests to nearest replica 96% original requests diverted to the two nearest replicas, original server only receives 0.03% requests

Result 5: Latency Reduction
Nearer replica strategy also reduces client-side latency RTT for client-3 reduced from 200ms to 25ms

Conclusions NDN is a great match for CMIP5 data distribution
Naming translation had a few bumps because of inconsistent original names Found a few CMIP5 naming bugs along the way Catalog implementation over NDN was very easy: publishing (with the help of NDN sync) discovery authentication NDN caching very beneficial in simulation NDN strategies shown to work on testbed

For More Info christos@colostate.edu http://named-data.net

Named Data Networking in science Applications

Similar presentations

Presentation on theme: "Named Data Networking in science Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Named Data Networking in science Applications

Similar presentations

Presentation on theme: "Named Data Networking in science Applications"— Presentation transcript:

Similar presentations

About project

Feedback