Download presentation
Presentation is loading. Please wait.
Published byPaul Cunningham Modified over 9 years ago
1
1 FreeLoader: borrowing desktop resources for large transient data Vincent Freeh 1 Xiaosong Ma 1,2 Stephen Scott 2 Jonathan Strickland 1 Nandan Tammineedi 1 Sudharshan Vazhkudai 2 1. North Carolina State University 2. Oak Ridge National Laboratory September, 2004
2
2 Roadmap Motivation FreeLoader architecture Design choices Results Future work
3
3 Motivation: Data Avalanche More data to process Science, industry, government Example: scientific data Better instruments More simulation power Higher resolution (Picture courtesy: Jim Gray, SLAC Data Management Workshop) Space Telescope P&E Gene Sequencer From http://www.genome.uci.edu/
4
4 Data acquisition and storage Data acquisition, reduction, analysis, visualization, storage Data Acquisition System Remote users with local computing and storage Remote storage Local users High Speed Network Metadata raw data Remote users Supercomputers
5
5 Remote Data Sources Data serving at supercomputing sites Shared file systems – GPFS Archiving systems - HPSS Data centers Expensive, high-end solutions with guaranteed capacity and access rates Tools used in access FTP, GridFTP Grid file systems Customized data migration program Web browser
6
6 User perspective End user typically processes data locally Convenience and control Better CPU/memory configurations Problem 1: needs local space to hold data Problem 2: getting data from remote sources is slow Central point of failure High contention for resource, multiple incoming requests – availability is hit Dataset characteristics Write-once, read-many access patterns Raw data often discarded Shared interest to same data among groups Primary copy archived elsewhere Squirrel – P2P web cache
7
7 Harnessing idle disk storage Harnessing storage resources of individual workstations ~ Harnessing idle CPU cycles LAN environments desktops with 100Mbps or Gbps connectivity Increasing hard disk capacities Increasing % of total is unused – 50% and upwards Even with contribution << available - impressive aggregate storage Increasing numbers of workstations are online most of the time Access locality, aggregate I/O and network bandwidth, data sharing
8
8 Use Cases FreeLoader storage cloud as a: Cache Local, client-side scratch Intermediate hop Grid replica
9
9 Intended Role of FreeLoader What the scavenged storage “is not”: Not a replacement to high-end storage Not a file system Not intended for integrating resources at wide-area scale Does not emphasize replica discovery, routing protocol and consistency like P2P storage systems What it “is”: Low-cost, best-effort alternative to remote high-end storage Intended to facilitate transient access to large, read-only datasets data sharing within administrative domain To be used in conjunction with higher-end storage systems
10
10 FreeLoader Architecture Pool n Morsel Access, Data Integrity, Non-invasiveness Management Layer Data Placement, Replication, Grid Awareness, Metadata Management Management Layer Data Placement, Replication, Grid Awareness, Metadata Management Pool A Registration Storage Layer Pool m Registration Grid Data Access Tools
11
11 Storage Layer Donors/Benefactors: Morsels as a unit of contribution Basic morsel operations [new(), free(), get(), put()…] Space Reclaim: User withdrawal / space shrinkage Data Integrity through checksums Performance history per benefactor Pools: Benefactor registrations (soft state) Dataset distributions Proximity and performance characteristics dataset 1: 1 23 dataset n: 1a 2a 3a 4a 2a1a 21 4a3a 23 2a1a 3a1
12
12 Management Layer Manager: Pool registrations Metadata: datasets-to-pools; pools-to- benefactors, etc. Availability: Redundant Array of Replicated Morsels Minimum replication factor for morsels Where to replicate? Which morsel replica to choose? Clients are oblivious to metadata – all metadata requests are sent to manager Cache replacement policy
13
13 Dataset Striping Stripe datasets across benefactors Morsel doubles as basic unit of striping Manager decides the allocation of data blocks to morsels across benefactors Multiple-fold benefits Higher aggregate access bandwidth Lowering impact per benefactor Load balancing Greedy algorithm to make best use of available space Stripe width and Stripe size can be varied as striping parameters
14
14 Client interface Obtains metadata from the manager Performs gets or puts directly to the benefactors All control messages are exchanged via UDP All data transfers – TCP Morsel requests are sent to benefactors in parallel, striping strategy ensures these blocks are contiguous Efficient buffering strategy : Buffer pool of size (stripesize+1)*stripewidth Double buffering scheme Allows network and I/O to proceed in parallel After pool is filled up, buffer contents are flushed to disk Reduces disk seeks, waits for filled buffer contents to form contiguous blocks before writing to disk
15
15 Current Status Application Client Manager Benefactor OS Benefactor OS I/O interface UDP (A) UDP (C) UDP/TCP (B) reserve() cancel() store() retrieve() delete() open() close() read() write() new() free() get() put() (A) services: Dataset creation/deletion Space reservation (B) services: Dataset retrieval Hints (C) services: Registration Benefactor alerts, warnings, alarms to manager (D) services: Dataset store Morsel request UDP/TCP (D) Simple data striping
16
16 Results: Experiment Setup FreeLoader prototype running at ORNL Client Box AMD Athlon 700MHz 400MB memory Gig-E card Linux 2.4.20-8 Benefactors Group of heterogeneous Linux workstations Contributing 7GB-30GB each 100Mb cards
17
17 Data Sources Local GPFS Attached to ORNL SCs Accessed through GridFTP 1MB TCP buffer, 4 parallel streams Local HPSS Accessed through HSI client, highly optimized Hot: data in disk cache without tape unloading Cold: data purged, retrieval done in large intervals Remote NFS At NCSU HPC center Accessed through GridFTP 1MB TCP buffer, 4 parallel streams FreeLoader 1 MB morsel size for all experiments Varying configurations
18
18 Testbed
19
19 Best of class performance comparisons Throughput (MB/s)
20
20 Effect of stripe width variation ( stripe size=1 morsel)
21
21 Effect of stripe width variation ( stripe size=8 morsels)
22
22 Effect of stripe size variation ( stripe width=4 benefactors)
23
23 Impact Tests How uncomfortable do the donors feel When running CPU intensive tasks? Disk intensive tasks? Network intensive? A set of tests at NCSU Benefactor performing local tasks Client retrieving datasets at a given rate Rate is varied to study the impact on user Pentium 4, 512MB memory, 100Mbps connectivity
24
24 CPU-intensive and Mixed Time (s)
25
25 Network-intensive Task Normalized Download Time
26
26 Disk-intensive Task Throughput (MB/s)
27
27 Sample application - formatdb Subset of basic file APIs implemented formatdb (NCBI) BLAST toolkit – preprocesses biological sequence database to create set of sequence and index files Raw database is ideal candidate for caching on FreeLoader formatdb not the ideal application for FreeLoader LocalNFS Benefactors Time (sec) 124 598585599563556
28
28 Significant results
29
29 Significant results – contd. 2x and 4x speedup wrt GPFS and HPSS Management overhead is minimal 14% worst case performance hit for CPU intensive <= 25% for network intensive tasks formatdb – tests upper bound of FreeLoader’s internal overhead Same as local for 1 benefactor, 2 % slower than NFS 5% faster than NFS for 4 benefactors 10 MB/s performance gain for each benefactor added until saturation
30
30 Conclusions Goal is to achieve saturation from the client side Striping helps achieve this Low cost commodity parts Harnessing idle disk bandwidth Low impact on donor, controlled by throttling request rate Better availability, more suitable for large transient data sets than regular FS
31
31 In-progress and Future Work In-progress Windows support Future Complete pool structure, registration Intelligent data distribution, service profiling Benefactor impact control, self-configuration Naming and replication Grid awareness Potential extensions Harnessing local storage at cluster nodes? Complementing commercial storage servers?
32
32 Further Information http://www.csm.ornl.gov/~vazhkuda/Morsels/ http://www.csm.ornl.gov/~vazhkuda/Morsels/
33
33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.