Optimizing End-User Data Delivery Using Storage Virtualization Sudharshan Vazhkudai Oak Ridge National Laboratory Ohio State University Systems Group Seminar October 20th, 2006 Columbus, Ohio
Problem space: Client-side caching Storage Virtualization: Outline Problem space: Client-side caching Storage Virtualization: FreeLoader Desktop Storage Cache A Virtual cache: Prefix caching End on a funny note!!
Problem Domain Data Deluge Experimental facilities: SNS, LHC (PBs/yr) Observatories: sky surveys, world-wide telescopes Simulations from NLCF end-stations Internet archives: NIH GenBank (serves 100 gigabases of sequence data) Typical user access traits on large scientific data Download remote datasets using favorite tools FTP, GridFTP, hsi, wget Shared interest among groups of researchers A Bioinformatics group collectively analyze and visualize a sequence database for a few days: Locality of interest! Often times, discard original datasets after interest dissipates
So, what’s the problem with this story? Wide-area data movement is full of pitfalls Sever bottlenecks, BW/latency fluctuations GridFTP-like tuned tools not widely available Popular Internet repositories still served through modest transfer tools! User applications are often latency intolerant e.g., real-time viz rendering of a TerraServer map from Microsoft on ORNL’s tiled display! Why can’t we address this with the current storage landscape? Shared storage: Limited quotas Dedicated storage: SAN storage is a non-trivial expense! (4TB disk array ~ $40K) Local storage: Usually not enough for such large datasets Archive in mass storage for future accesses: High latency Upshot Retrieval rates significantly lower than local I/O or LAN throughput
Is there a silver lining at all? (Desktop Traits) Desktop Capabilities better than ever before Space usage to Available storage ratio is significantly low in academic and industry settings Increasing numbers of workstations online most of the time At ORNL-CSMD, ~ 600 machines are estimated to be online at any given time At NCSU, > 90% availability of 500 machines Well-connected, secure LAN settings A high-speed LAN connection can stream data faster than local disk I/O
Storage Virtualization? Can we use novel storage abstractions to provide: More storage than locally available Better performance than local or remote I/O A seamless architecture for accessing and storing transient data
Desktop Storage Scavenging as a means to virtualize I/O access FreeLoader Imagine Condor for storage Harness the collective storage potential of desktop workstations ~ Harnessing idle CPU cycles Increased throughput due to striping Split large datasets into pieces, Morsels, and stripe them across desktops Scientific data trends Usually write-once-read-many Remote copy held elsewhere Primarily sequential accesses Data trends + LAN-Desktop Traits + user access patterns make collaborative caches using storage scavenging a viable alternative!
Old wine in a new bottle…? Key strategies derived from “best practices” across a broad range of storage paradigms… Desktop Storage Scavenging from P2P systems Striping, parallel I/O from parallel file systems Caching from cooperative Web caching And, applied to scientific data management for Access locality, aggregating I/O, network bandwidth and data sharing Posing new challenges and opportunities: heterogeneity, striping, volatility, donor impact, cache management and availability
FreeLoader Environment
FreeLoader Architecture Lightweight UDP Scavenger device: metadata bitmaps, morsel organization Morsel service layer Monitoring and Impact control Global free space management Metadata management Soft-state registrations Data placement Cache management Profiling
Testbed and Experiment setup FreeLoader installed in a user’s HPC setting GridFTP access to NFS GridFTP access to PVFS hsi access to HPSS Cold data from tapes Hot data from disk caches wget access to Internet archive
Comparing FreeLoader with other storage systems
Optimizing access to the cache: Client Access-pattern Aware Striping Uploading client likely to access more frequently So, let’s try to optimize data placement for him! Overlap network I/O with local I/O What is the optimal local:remote data ratio? Model
What the scavenged storage “is not”: Philosophizing… What the scavenged storage “is not”: Not a file system, not a replacement to high-end storage Not intended for wide-area resource integration What it “is”: Low-cost, best-effort storage cache for scientific data sources Intended to facilitate Transient access to large, read-only datasets Data sharing within administrative domain To be used in conjunction with higher-end storage systems
Towards a “virtual cache” Scientific data caches typically host complete datasets Not always feasible in our environment since: Desktop workstations can fail or space contributions can be withdrawn leaving partial datasets Not enough space in the cache to host the new dataset in entirety Cache evictions can leave partial copies of datasets Can we host partial copies of datasets and yet serve client accesses to the entire dataset? ~ FileSystem-BufferCache:Disk :: FreeLoader:RemoteDataSource
The Prefix Caching Problem: Impedance Matching on Steroids!! HTTP Prefix Caching Multimedia, streaming data delivery BitTorrent P2P System: leechers can download and yet serve Benefits Bootstrapping the download process Store more datasets Allows for efficient cache management Oh…, that scientific data trends again (how convenient…) Immutable data, Remote source copy, Primarily sequential accesses Challenges Clients should be oblivious to dataset being partially available Performance hit? How much of the prefix of a dataset to cache? So, client accesses can progress seamlessly Online patching issues Client access to remote patching I/O mismatch Wide-area download vagaries
Virtual Cache Architecture Capability-based resource aggregation Persistent storage & BW-only donors Client serving: parallel get Remote patching using URIs Better cache management Stripe entirely when space available When eviction is needed, only stripe a prefix of the dataset Victims based on LRU: Evict chunks from the tail until a prefix Entire datasets evicted only after all such tails are evicted
Prefix Size Prediction Goal: Eliminate client perceived delay in data access What is an optimal prefix size to hide the cost of suffix patching? Prefix size depends on: Dataset size, S In-cache data access rate by the client, Rclient Suffix patching rate, Rpatch Initial latency in suffix patching, L Client access rate indicative of time to patch, S/Rclient = L + (S – Sprefix)/Rpatch Thus, Sprefix = S(1 – Rpatch/Rclient) + LRpatch
Can we derive from collective I/O in parallel I/O Collective Download Why? Wide-area transfer reasons: Storage systems and protocols for HEC are tuned for bulk transfers (GridFTP, HSI) Wide-area transfer pitfalls: high latency, connection establishment cost Client’s local-area cache access reasons: Client accesses to the cache use a smaller stripe size (e.g., 1MB chunks in FreeLoader) Finer granularity for better client access rates Can we derive from collective I/O in parallel I/O
Collective Download Implementation Patching nodes perform bulk, remote I/O; ~ 256MB per request Reducing multiple authentication costs per dataset Automated interactive session with “Expect” for single sign on FreeLoader patching framework instrumented with Expect Protocol needs to allow sessions (GridFTP, HSI) Need to reconcile the mismatch in client access stripe size and the bulk, remote I/O request size Shuffling Patching nodes, p, redistribute the downloaded chunks among themselves according to the client’s striping policy Redistribution will enable a round-robin client access Each patching node redistributes (p – 1)/p of the downloaded data Shuffling accomplished in memory to motivate BW-only donors Thus, client serving, collective download and shuffling are all overlapped
Testbed and Experiment setup UberFTP stateful client to GridFTP servers at TeraGrid-PSC and TeraGrid-ORNL HSI access to HPSS Cold data from tapes FreeLoader patching framework deployed in this setting
Collective Download Performance PW=10; I/O=256M Download Download + Shuffle Client access HPSS 13.6 12.3 -9.6% 11.7 -4.9% Tera-ORNL 79.7 75.1 -5.8% 74.7 -1.3% Tera-PSC 21.9 20.2 -7.8% 20 -1.0%
Prefix Size Model Verification Data sources HPSS-ORNL Tera-ORNL Tera-PSC Rclient (MB/s) 52.2 Rpatch (MB/s) 7.6 42 10.8 L (s) 31.4 3 3.9 Predicted ratio 95% 24.6% 81%
Impact of Prefix Caching on Cache Hit rate Jefferson Lab Asynchronous Storage Manager (JASMine) No of days 19.1 No of accesses 4000 No of unique datasets 1686 Tera-ORNL will see improvements around 0.2 and 0.4 curve (308% and 176% for 20% and 40% prefix ratio) Tera-PSC sees up to 76% improvement in hit rate with 80% prefix ratio
Let me philosophize again… Novel storage abstractions as a means to: Provide performance impedance matching Overlap remote I/O, cache I/O and local I/O into a seamless “data pathway” Provide rich resource aggregation models Provide a low-cost, best-effort architecture for “transient” data A combination of best practices from: parallel I/O, P2P scavenging, cooperative caching, HTTP multimedia streaming; brought to bear on “scientific data caching” Intermediate data cache exploits this area
Collaborator: Xiaosong Ma (NCSU) Let me advertise… http://www.csm.ornl.gov/~vazhkuda/Storage.html Email: vazhkudaiss@ornl.gov Collaborator: Xiaosong Ma (NCSU) Funding: DOE ORNL LDRD (Terascale & Petascale initiatives) Interested in joining our team? Full time positions and summer internships available
More slides Some performance numbers Impact studies
Striping Parameters
Client-side Filters
Computation Impact
Network Activity Test
Disk-intensive Task
Impact Control