AFS vs YFS.  Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit.

AFS vs YFS

 Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit the network bandwidth to 1gbit  Avoid workloads requiring: Multiple clients creating / removing entries in a single directory Multiple clients writing to or reading from a single file More clients than file server worker threads accessing a single volume Applications requiring features that AFS does not offer:  Byte range locking, ext. attributes, per file ACLs, etc

 Deployed isolation file servers and complex monitoring to detect hot volumes and quarantine them  Developed complex workarounds including vicep- access, OSD, and OOB  Segregated RW and RO access into separate cells and constructed their own volume management systems to “vos release” volumes from RW cell to RO cells  Used the AFS name space for some tasks and other “high performance” file systems for others NFS3, NFS4, Lustre, GPFS, Panasys, others

 Additional servers cost money US$6800 per year according to Cornell University Including hardware depreciation, support contracts, maintenance, power and cooling, staff time  Increased complexity for end users  Multiple backup strategies

 Maintain the data and the name space  Fix the performance problems  Enhance the functionality to match Apple/Microsoft first class file systems  Improve Security  Save money

 What are the bottlenecks in AFS and why do they exist?  What can be done to maximize the performance of an AFS file server?  How scalable is a YFS file server?

 File Server Throughput is bound by the amount of data the listener thread can read from the network during any time period  As Simon Wilkinson likes to say: “There are only two things wrong with AFS RX, the protocol and the implementation.”

 Incorrect Round Trip Time calculations  Incorrect Retransmission Timeout implementation  Window size vs Congested Networks Broken window management makes congested networks worse  Soft ACKs and Hard ACKs Twice as many ACKs as necessary

 Lock Contention 20% of runtime spent waiting for locks  UDP Context Switching Every packet processed on a different CPU Cache line invalidation

 To see the full details, see http://tinyurl.com/p8c8yqs

 Light weight processes (LWP) is a cooperative threading model that was used for the original AFS implementation  Only one thread can execute at a time  Threads yield voluntarily or when blocking for I/O  Data access is implicitly protected by single execution  All lock state changes are atomic when a thread yields. In other words: Acquire + Release + Yield == Never Acquire Acquire A + Acquire B == Acquire B + Acquire A

 When converting a cooperative threaded application to pthreads, it is faster to add global locks to protect data structures that are accessed across I/O than to redesign the data structures and the work flow  AFS 3.4 added pthread file servers by adding a minimum number of global locks to each package  AFS 3.6 added finer grained but still global locks

 AFS file servers must acquire many mutexes during the processing of each RPC (* = global) RX  peer_hash*, conn_hash*, peer, conn_call, conn_data, stats*, free_packet_queue*, free_call_queue*, event_queue*, and more viced  H* [host table, callbacks]  FS* [stats]  VOL* [volume metadata]  VNODE [file/dir]

 Threads are scheduled to a processor and must give up their time slice whenever a required lock is unavailable  When there are multiple processors, threads are scheduled to a processor.  Any data not in the processor cache or that has been invalidated, must be fetched. Locks are represented as data in memory whose state changes when acquired and released.  Two side effects of global locks: Only one thread at a time can make progress Multiple processor cores hurt performance

 An AFS file server promises its client that for a fixed period of time it notify the client if the metadata or data state of an accessed object changes  For read write volumes, one callback promise per file object  For read only volumes, one callback promise per volume regardless of how many file objects are accessed  Today, many file servers are deployed with callback tables containing millions of entries

 A host table and hash tables for looking up host entries by IP address and UUID are protected by a single global lock.  Host entries have their own locks. To avoid hard deadlocks, locking an entry requires dropping the global lock, obtaining the entry lock, and obtaining the global lock.  Soft deadlocks occur when multiple threads are blocked on the entry lock but the thread holding it is blocked waiting for the global lock.  Lock contention occurs multiple times for each new rx connection and each time a call is scheduled.

 The Callback Table is protected by the same global lock as the Host Table  Each new/updated callback promise requires exclusive access to the table  Notifying registered clients of state changes (breaking callbacks) requires exclusive access  Garbage collection of expired callbacks (5 minute intervals) requires exclusive access  Callback Table Limit exceeded requires exclusive access for immediate garbage collection and premature callback notification

 The larger the callback table the longer exclusive access is maintained for garbage collection and callback breaks  While exclusive access is maintained, no calls can be scheduled nor can existing calls be completed

 Increasing the worker thread pool permits additional calls to be scheduled instead of blocking in the rx wait queue  Primary benefit of scheduling is that locks provide a filtering mechanism to decide which calls can make progress. Calls on the rx wait queue can never make progress of thread pool is exhausted  Downside of increased thread pool size is increased lock contention and more CPU time wasted on thread scheduling

 Start with “large” configuration -L  Make thread pool as large as possible For 1.4, -p 128 For 1.6, -p 256  Set directory buffer size to twice the thread count -b 512

 Volume Cache larger than total volume count -vc  Small vnode cache (files) -s  Large vnode cache (directories) -l  If volumes are very large, may require higher multiples

 The callback table must be large enough to avoid thrashing -cb Where that value *72 bytes should not exceed 10% of machine physical memory  Use “xstat_fs_test's -collId 3 –once” to monitor “GetSomeSpaces” value. If non-zero, increase –cb value

 UDP Receive Buffer Must be large enough to receive all packets for in process calls. -udpsize 16777216 Won’t take effect unless OS is configured to match  UDP Send Buffer -sendsize 2097152  (2^21) unless client chunk size is larger

 AFS protocol does not expose the last access time to clients  Nor does the AFS file server make use of it  Turn off last access time updates to avoid large amounts of unnecessary disk i/o unrelated to serving the needs of clients

 Syncing data to disk is very expensive. If you trust your UPS and have a good battery backup caching storage adapter we recommend reducing the frequency of sync operations.  For 1.6.5, new option -sync onclose

 YFS File Server experience much less contention between threads  RPCs take less time to complete Store operations do not block simultanenous Fetch requests  One YFS File Server can replace at least 30 AFS file servers Max in-flight RPCs per AFS server = 240 Max in-flight RPCs per YFS server = 16,000 (dynamic) 240 * 30 = 7,200

Up to 8.2 gbits/second per listener thread

 SLAC has experienced file server meltdowns for years. Large number of file servers deployed to permit distribution of load isolation of volume accesses by users.  One YFS file server satisfied 500 client nodes for nearly 24 hours without noticeable delays 1gbit NIC, 8 processor cores, 6gbit/sec local raid disk 800 operations per second 55MB/sec FetchData 5MB/sec StoreData

 2038 Safe  100ns time  2^64 volumes  2^96 vnodes / volume  2^64 max quota/vol/part size  Per File ACLs  Volume Security Policies Max ACL / Wire Privacy  Servers do not run as “root”  Linux O_DIRECT  Mandatory Locking  IPv6 network stack

 RXGK GSS-API Authentication AES-256/SHA-1 wire privacy File server wire security policies  File servers cannot serve volumes with stronger required policies Combined Identity Tokens Keyed cache managers / Machine IDs Maximum Volume ACL prevents data leaks

AFS vs YFS.  Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit.

Similar presentations

Presentation on theme: "AFS vs YFS.  Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AFS vs YFS.  Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit.

Similar presentations

Presentation on theme: "AFS vs YFS.  Use large numbers of small file servers  Use many small partitions per file server  Restrict the number of processors to 1 or 2  Limit."— Presentation transcript:

Similar presentations

About project

Feedback