Download presentation
Presentation is loading. Please wait.
Published bySharon Shepherd Modified over 9 years ago
1
Khalil Amiri*, David Petrou, Gregory R. Ganger* and Garth A. Gibson "Dynamic Function Placement for Data-intensive Cluster Computing," Proceedings of the USENIX Annual Technical Conference, San Diego, CA, June 2000. (http://www.pdl.cs.cmu.edu/Publications/publications.html)
2
Function Placement for Data Intensive Cluster Computing Data intensive applications that filter, mine, sort or manipulate large data sets –Spread data parallel computations across source/sink servers –Exploit servers’ computational resources –Reduce network bandwidth
3
Programming model and runtime system Compose data intensive applications from explicitly-migratable, functionally independent components Mobile objects provide explicit methods that checkpoint and restore state Application and filesystem represented as graph of communicating mobile objects Graph rooted at storage servers by non-migratable storage objects Anchored at clients by non-migratable console object Mobile objects have explicit methods that checkpoint and restore state during migration Storage objects provide persistent storage Console object contains part of application that must remain at the node where application is started
4
ABACUS runtime system Migration and location-transparent invocation component (binding manager) –Responsible for creation of location-transparent references to mobile objects –Redirection of method invocations in face of object migrations –Each machine’s binding manager notifies local resource manager of each procedure call to and return from mobile object Resource monitoring and management component (resource manager) –Uses notifications to collect statistics about bytes moved between objects and resources used by objects –Monitors load on local processor and costs associated with moving data to and from storage servers –Server side managers collect statistics from client side resource managers –Employ analytic models to estimate performance advantages that might accrue from moving to alternate placements
5
Programming Model Mobile Objects –C++ –Required to implement a few mthods to enable runtime system to create instances and migrate –Medium granularity – performs self-contained processing step that is data intensive, such as parity computation, caching, searching or aggregation –Has private state not accessible to outside objects except via exported interface –Responsible for saving private state, including state of all embedded objects when Checkpoint() method is called by ABACUS –Responsible for restoring state, including creation and initialization of all embedded objects, when runtime system invokes restore() method after migration to a new node –Checkpoint and restore go to/from external file or memory See Figure 1
6
Storage Servers Provides local storage objects exporting a flat file interface Accessible only at server that hosts them and never migrates Migratable objects lie between storage objects and console objects –Applications can declare other objects to be non-migratable –Object that implements write-ahead logging can be declared by filesystem as non-migratable
7
Iterative Processing Model Synchronous invocations start at the top-level console object and propagate down the object graph Amount of data moved is an application-specific number of records, rather than the entire file or data set ABACUS accumulates statistics on return from method invocations to make object migration decisions
8
Object based distributed filesystem Filesystems composed of explicitly migratable objects (Figure 2) –RAID –Caching –Application specific functionality Coherent file and directory abstractions atop flat file space exported by base storage objects File –Stack of objects supporting services bound to file –Files whose data cannot be lost include RAID object –When file is opened, top-most object is instantiated, lower level objects then instantiated –Supports inter-client file and directory sharing –Allows both file data and directory data to be cached and manipulated at trusted clients –AFS style call backs for cache coherence –Timestamp ordering protocol to make sure that updates performed at client are consistent before being committed at server
9
Virtual File System Interface (VFS) The Virtual File System (VFS) interface hides implementation dependent parts of the file system BSD implemented VFS for NFS: aim dispatch to different filesystems Manages kernel level file abstractions in one format for all file systems Receives system call requests from user level (e.g. write, open, stat, link) Interacts with a specific file system based on mount point traversal Receives requests from other parts of the kernel, mostly from memory management (http://bukharin.hiof.no/fag/iad22099/innhold/lab/lab3/nfsnis_slides/ text13.htm)
10
Virtual File System Interface (VFS) Microsoft Windows have VFS type interfaces Functions of the VFS: Separate file system generic operations from their implementation. Enables transparent access to a variety of different local file systems. At the VFS interface, files are represented as v-nodes, which are networkwide unique numerical designator for a file. Vnode contains pointer to its parent file system and to the file system over which it's mounted
11
File Graph File’s graph provides: –VFS interface to applications –Cache object Keeps index of particular object’s blocks in the shared cache kept by the ABACUS filesystem –Optional RAID 5 object Stripes and maintains parity for individual files across set of storage servers –One or more storage objects –RAID isolation/atomicity object anchored at storage servers Intercepts reads and writes to base storage object and verifies the consistency of updates before committing –Linux ext2 filesystem or CMU’s NASD prototype can be used for backing store
12
Directory Graph Directory object –POSIX like directory calls and caches directory entries Isolation/atomicity object –Specialized to directory semantics for performance reasons Storage object
13
Accessing ABACUS filesystem Applications that include ABACUS objects directly append per-file object subgraphs onto their application object graphs Can be mounted as a standard file system via VFS layer redirection –Legacy applications can use filesystem objects adaptively migrating below them –Legacy applications themselves do not migrate
14
Object-based applications Data intensive applications decomposed into objects that search, aggregate or data mine Formulate applications to iterate over input data and operate on data one buffer at a time Encapsulate the filtering component into C++ object, write checkpoint, restore methods Applications instantiate mobile objects by making request to ABACUS run- time system ABACUS allocates and returns to caller network-wide unique run-time identifier Acts as layer of indirection Per-node has tables map rid to (node, object_reference_within_node) pair Data is passed by procedure call when objects are in the same address space, and RPCs when objects cross machines
15
Object Migration Migrate from source to target –Binding manager blocks new calls to the migrating object –Binding manager waits until all active invocations to migrating object have drained –Object is locally checkpointed by invoking Checkpoint() method –State written into buffer or memory –Location tables at source, target and home node are updated –Invocations are unblocked and redirected to proper node via updated location table –Per-node hash tables may contain stale data, if object cannot be located, up-to-date information can be found in object’s home node
16
Resource Monitoring Memory consumption, instructions executed per byte, stall time –Since runtime system redirects calls to objects, it is in a position to collect all necessary statistics –Monitoring code interposed between mobile object procedure call and return –Number of bytes transferred recorded in timed data flow graph –Moving averages of bytes moved between every pair of communicating objects in graph –Runtime system tracks dynamic memory allocation using wrappers around each memory allocation routine –Linux interval times or Pentium cycle counter used to count instructions executed by objects –Track amount of time thread is blocked by having kernel update blocking timers of any threads in the queue marked as blocked
17
Dynamic Placement Net benefit –Server side resource manager collects per-object measurements –Receives statistics about client processor speed and current load –Given data flow graph between objects, latency of client-server link, model estimates changes in stall time if object changes location –Model also estimates change in execution time for other objects executing at target node
18
Example – Placement of RAID Object Figure 3: –In Client A, RAID object runs in client, Client B RAID object runs at storage device –If LAN is slow, better if RAID object runs at storage device Figure 4: –Two clients writing 4MB files sequentially –Stripe size is 5 (4 data + parity) –Stripe unit is 32KB –In LAN case, better to execute on server –In SAN case, RAID running RAID object locally at client is 1.3X faster –In degraded read case, client based RAID wins (due to computational cost of doing reconstruction)
19
Placement of filter Vary filter’s selectivity and CPU consumption High selectivity filters are better on server Possible to arrange things so that filter should run on client if filter is expensive
20
David F. Nagle, Gregory R. Ganger, Jeff Butler, Garth Gibson, and Chris Sabol "Network Support for Network-Attached Storage", Hot Interconnects 1999, August 18 - 20, 1999, Stanford University, Stanford, California.
21
Network Support for Network Attached Storage Enable scalable storage systems in ways that minimize file manager bottleneck –Homogeneous network of trusted clients that issue unchecked commands to shared storage Poor security and integrity (anybody can read or write to anything!) –NetSCSI Minimal changes to hardware, software of SCSI disks NetSCSI disks send data directly to clients Crytographic hashes, encryption, verified by NetSCSI disks provide for integrity and privacy File manager still involved in each storage access Translates namespaces and sets up third party transfer on each request
22
Network Attached Secure Disks (NASD) NASD architecture provides command interface that reduce number of client- storage interactions Data intensive operations go right to disk, less common policy making operations (e.g. namespace and access control) go to the file manager See Figure 1 for scheme NASD drives map and authorize requests for disk sectors Time limited capability provided by file manager –Allows access to a given file’s map and contents –Storage mapping metadata is maintained by drive Smart drives can exploit knowledge of their own resources to optimize data layout, read- ahead and cache management –NASD drive exports “namespace” Describes file objects which can generalize to banks of stripped files
23
NASD Implementation Ported AFS and NSF to use interface Implemented striped version of NSF on top of interface NAS/AFS and NASD/NFS filesystems, frequent data moving operations occur directly between client and NASD drive –NFS --- stateless server, weak cache consistency –Based on client’s RPC opcode, RPC destination addresses are modifie to deliver requests to NASD drive AFS port had to maintain sequential consistency guarantees of AFS –Used ability of NASD capabilities to be revoked based expired time or object attributes
24
Performance Comparisons Compare NASD/AFS, NASD/NFS v.s. Server Attached Disk (SAD) AFS, NFS –Computing frequent sets of 300MB sales transactions –Maximum bandwidth with given number of disks or NASD drives and up to 10 clients –Bandwidth of client reading from single NASD file striped across n drives (linear scaling) –NFS – performance of all clients reading from single file striped across n disks on server (bottlenecks near 20 MB/s) –NFS_parallel – each client reads from separate file on an independent disk through the same server (bottlenecks near 22.5 MB/s)
25
Network Support Non-cached read/write can be serviced by modest hardware Requests that hit in cache need much higher bandwidth – lots of time in network stack Need to deliver scalable bandwidth –Deal efficiently with small messages –Don’t spend too much time going between OS layers –Don’t copy data too many times
26
Reducing Data Copies Integrate buffering/caching into OS –Effective where caching plays central role Direct user level access to network –High bandwidth applications Layered NASD on top of VI-Architecture (VIA) interface –Integrates user-level Network Interface Control access with protection mechanisms – send/receive/DMA –Commercial VIA implementations are available – full network bandwidth while consuming less than 10% of CPU’s cycles –Support from hardware, software vendors
27
Integrating VIA with NASD NASD software runs in kernel mode Drive must support external VIA interface and semantics Can result in 100’s of connections and lots of RAM Interface –Client preposts set of buffers matching read request size –Issues NASD read comman –Drive returns data –Writes are analogous but bursts require large amount of preposted buffer –VIA’s remote DMA Clients send write command with pointer to data stored in client’s pinned memory Drive uses VIA RDMA command to pull data out of client’s memory Drive would treat client’s RAM as extended buffer/cache
28
Network striping and Incase File striping across multiple storage devices –Poor support for incase (many to one) –Client should receive equal bandwidth from each source –Poor performance (Figure 4)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.