Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conquest: Preparing for Life After Disks An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek.

Similar presentations


Presentation on theme: "Conquest: Preparing for Life After Disks An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek."— Presentation transcript:

1 Conquest: Preparing for Life After Disks An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek

2 2 Conquest Overview File systems are optimized for disks Performance problem Complexity Now we have tons of inexpensive RAM What can we do with that RAM?

3 3 Conquest Approach Combine disk and persistent RAM (e.g., battery-backed RAM) in a novel way Simplification > 20% fewer semicolons than ext2, reiserfs, and SGI XFS Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching

4 4 Outline of the Talk Motivation Conquest design (high level) Conquest components Performance evaluation Conclusion Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

5 5 Motivation Most file systems are built for disks Problems with the disk assumption: Performance Complexity Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

6 6 Hardware Evolution 19902000 1 KHz 1 MHz 1 GHz CPU (50% /yr) memory (50% /yr) disk (15% /yr) accesses per second (log scale) 10 5 10 6 1995 (1 sec : 6 days)(1 sec : 3 months) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

7 7 Inside Pandora’s Box Disk arm Disk platters Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time

8 8 Disk Optimization Methods Disk arm scheduling Group information on disk Disk readahead Buffered writes Disk caching Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Data mirroring Hardware parallelism

9 9 Complexity Bytes synchronization predictive readahead cache replacement elevator algorithm data clustering data consistency asynchronous write Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

10 [Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron Semiconductor Products 2000; Quantum 2000]10 Storage Media Alternatives accesses/sec (log scale) $/MB (log scale) 10 0 10 3 persistent RAM magnetic RAM? (write once) flash memory disk tape battery-backed DRAM 10 -3 10 6 Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

11 [Grochowski 2000]11 Price Trend of Persistent RAM 19952005 10 0 year $/MB (log scale) 2000 10 -2 10 -1 10 1 10 2 paper/film 3.5" HDD 2.5" HDD 1" HDD persistent RAM booming of digital photography 4 to 10 GB of persistent RAM Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

12 12 Old Order; New World Disk will stay around Cost, capacity, power, heat RAM as a viable storage alternative PDAs, digital cameras, MP3 players More architectural changes due to RAM A big assumption change from disk Rethink data structures, interfaces, applications Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

13 13 What does it take to design and build a system that assumes ample persistent RAM as the primary storage medium? Getting a Fresh Start Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

14 14 Conquest Design Design and build a disk/persistent-RAM hybrid file system Deliver all file system services from memory, with the exception of high-capacity storage Two separate data paths to memory and disk Benefits: Simplicity Performance Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

15 15 Simplicity Remove disk-related complexities for most files Make things simpler for disk as well Less complexity Fewer bugs Easier maintenance Shorter data paths Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

16 16 Overall All management performed in memory Memory data path No disk-related overhead Disk data path Faster speed due to simpler access models Performance Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

17 17 Conquest Components Media management Metadata representation Directory service Allocation service Persistence support Resiliency support Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

18 [Iram 1993; Douceur et al., 1999; Roselli et al., 2000]18 User Access Patterns Small files Take little space (10%) Represent most accesses (90%) Large files Take most space Mostly sequential accesses Not characteristic of database applications Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

19 19 Files Stored in Persistent RAM Small files (< 1MB) No seek time or rotational delays Fast byte-level accesses Contiguous allocation Metadata Fast synchronous update No dual representations Executables and shared libraries In-place execution Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

20 20 Memory Data Path of Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Conventional File Systems IO buffer disk management storage requests IO buffer management disk persistence support Conquest Memory Data Path storage requests persistence support battery-backed RAM small file and metadata storage

21 [Devlinux.com 2000]21 Large-File-Only Disk Storage Allocate in big chunks Lower access overhead Reduced management overhead No fragmentation management No tricks for small files Storing data in metadata No elaborate data structures Wrapping a balanced tree onto disk cylinders Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

22 22 Sequential-Access Large Files Sequential disk accesses Near-raw bandwidth Well-defined readahead semantics Read-mostly Little synchronization overhead (between memory and disk) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

23 23 Disk Data Path of Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Conventional File Systems IO buffer disk management storage requests IO buffer management disk persistence support Conquest Disk Data Path IO buffer management IO buffer storage requests disk management disk battery-backed RAM small file and metadata storage large-file-only file system

24 24 Random-Access Large Files Random access? Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files Near sequential access? Simplifies large-file metadata representation significantly Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

25 25 Logical File Representation File Name(s) i-node File attributes Data Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

26 26 Physical File Representation File Name(s) i-node File attributes Data locations Data blocks Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

27 27 Ext2 Data Representation data block location index block location data block location index block location data block location i-node 12 data block location index block location Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

28 28 Disadvantages with Ext2 Design Designed for disk storage Optimization for small files makes things complex Random-access data structure for large files that are accessed mostly sequentially Data access time dependent on the byte position in a file Maximum file size is limited Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

29 29 Conquest Representation Persistent RAM Hash(file name) = location of data Offset(location of data) Disk storage Per-file, doubly linked list of disk block segments (stored in persistent RAM) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

30 30 Advantages Conquest Design Direct data access for in-core files Worse case: sequential memory search for random disk locations Maximum file size limited by physical storage Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

31 31 Directory Service Requirements Fast sequential traversal (e.g., ls) Fast random lookup (e.g., locate file x) Hard links (apply multiple names to data) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

32 32 First Design A doubly hashed table for each directory Conserves space Problems: Dynamic resizing of directories Need to handle the current file position Important for rm -fr Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

33 [Fagin et al., 1979]33 Second Design A variant of extensible hash table for each directory An old data structure fits nicely Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion empty 0100 | file_1 1001 | file_2 empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink

34 34 Additional Engineering Details Popular hash functions randomize lower bits Dynamic file positioning Need to handle collisions Memory overhead and complexity tradeoffs Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

35 35 Metadata Allocation Requirements Keep track of usage status of metadata entries Avoid duplicate allocation with unique IDs Fast retrieval of metadata with a given ID Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion ID: 1| free ID: 2| in use ID: 3| free ID: 4| free ID: 5| in use ID: 6| free

36 36 Existing Memory Allocation Services Keep track of unallocated memory No duplicate allocation of physical addresses Hmm… Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use

37 37 Conquest Metadata Management Metadata = memory allocated by memory manager Metadata ID = physical address of metadata Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion ID: 1| free ID: 2| in use ID: 3| free ID: 4| free ID: 5| in use ID: 6| free ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use Usage status Unique IDs and fast retrieval

38 38 Persistence Support Restore file system states after a reboot Data Metadata Memory manager Keep track of metadata allocation Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

39 39 Linux Memory Manager (1) Page allocator maintains individual pages Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Page allocator

40 40 Linux Memory Manager (2) Zone allocator allocates memory in power-of- two sizes Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Page allocator Zone allocator

41 41 Linux Memory Manager (3) Slab allocator groups allocations by sizes to reduce internal memory fragmentation Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Page allocator Zone allocator Slab allocator

42 42 Linux Memory Manager (4) Difficult to restore the persistent states Three layers of pointer-rich mappings Mixing of persistent and temporary allocations Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Page allocator Slab allocator Zone allocator

43 43 Conquest Persistence Create memory zones with own instantiations of memory managers Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Page allocator Slab allocator Zone allocator

44 44 Conquest Persistence Encapsulate all pointers within each zone Pointers can survive reboots No serialization and deserialization Swapping and paging Disabled for Conquest memory zones Enabled for non-Conquest zones Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

45 45 Resiliency Support Instantaneous metadata commit No fsck (ad hoc metadata consistency check) Built-in checkpointing Pointer-switch commit semantics Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion pointer

46 46 Implementation Status Kernel module under Linux 2.4.2 Fully functional and POSIX compliant Modified memory manager to support Conquest persistence Need to overcome BIOS limitations for distribution Looking for licensing opportunities Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

47 47 Performance Evaluation Architectural simplification Feature count Performance improvement Memory-only workload Memory and disk workload Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

48 48 Conventional Data Path Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Conventional File Systems IO buffer disk management storage requests IO buffer management disk persistence support Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

49 49 Memory Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Conquest Memory Data Path storage requests Persistence support battery-backed RAM small file and metadata storage Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Memory manager encapsulation

50 50 Disk Path of Conquest Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Conquest Disk Data Path IO buffer management IO buffer storage requests disk management disk battery-backed RAM small file and metadata storage large-file-only file system Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

51 [Katcher 1997; Sweeney et al., 1996; Card et al., 1999; Namesys 2002]51 Conquest is comparable to ramfs At least 24% faster than the LRU disk cache ISP workload (emails, web-based transactions) PostMark Benchmark (1) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 40 to 250 MB working set with 2 GB physical RAM

52 52 When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS PostMark Benchmark (2) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 10,000 files, 80 MB to 3.5 GB working set with 2 GB physical RAM > RAM<= RAM

53 53 When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS PostMark Benchmark (3) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 10,000 files, 80 MB to 3.5 GB working set with 2 GB physical RAM

54 54 Sprite LFS Microbenchmarks (1) Small-file benchmark Operates on 10,000 1-KB files in three phases Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion

55 55 Sprite LFS Microbenchmarks (2) Modified large-file microbenchmark: 10 1-MB files (Conquest in-core files) Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion

56 56 Sprite LFS Microbenchmarks (3) Modified large-file microbenchmark: 10 1.01- MB files (Conquest on-disk files) Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion

57 57 Sprite LFS Microbenchmarks (4) Large-file microbenchmark: 40 100-MB files (Conquest on-disk files) Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion

58 58 History’s Mystery Puzzling Microbenchmark Numbers… Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion Geoffrey Kuenning: “If Conquest is slower than ext2, I will toss you off of the balcony…”

59 59 With me hanging off a balcony… Original large-file microbenchmark: 1-MB file (Conquest in-core file) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

60 60 Odd Microbenchmark Numbers Why are random reads slower than sequential reads? Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

61 61 Odd Microbenchmark Numbers Why are RAM-based file systems slower than disk-based file systems? Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

62 62 A Series of Hypotheses Warm-up effect? Maybe Why do RAM-based systems warm up slower? Bad initial states? No Pentium III streaming IO option? No Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

63 63 Effects of Cache Footprint Sizes Large cache footprintSmall cache footprint Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion write a file sequentially footprintfile end footprint read the same file sequentially footprint flush file end file read write a file sequentially footprintfile end footprint read the same file sequentially footprint flush file end read file

64 64 LFS Sprite Microbenchmarks Modified large-file microbenchmark: 10 1-MB files (Conquest in-core files) Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion faster random over sequential accesses due to cache reuse

65 65 LFS Sprite Microbenchmarks (2) Modified large-file microbenchmark: 10 128- KB files (Conquest in-core files) Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion slower random over sequential accesses due to the extra lseek

66 66 Lessons Learned Faster than LRU caching, unexpected Heavyweight disk handling Severe penalty for accessing memory content Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

67 67 More Lessons Learned Effects of L2 caching become highly visible in memory workloads (modern workloads) Cannot blindly apply existing disk-based microbenchmarks to measure memory performance of file systems Need to consider states of L2 cache and memory behaviors at each stage of microbenchmarking Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

68 68 Additional Lessons Learned Don’t discuss your performance numbers next to a balcony…unless… Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

69 [McKusick et al., 1990; Ganger et al., 2000; Roselli et al., 2000; Seltzer et al., 2000]69 Related Work (1) Disk caching Assumption of scarce memory Complex mechanisms to maintain consistency Especially with the presence of metadata RAM drives and RAM file systems Not meant to be persistent Use disk-related mechanisms Limitations on storage capacity Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

70 [Riedel 1998; ZDNet 1999]70 Related Work (2) Disk emulators RAM storage accessed through SCSI interface Ad hoc approaches Manual transferring of files to and from ramfs Capacity limitation Background daemon to stage RAM files to a disk Semantic and name space problems Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

71 71 Going Beyond Conquest (1) Matching usage patterns with heterogeneous machines in the distributed domain Specialized tasks for machines within a cluster Preferably self-organizing and self-evolving State-rich computing Caching of runtime data structures Similar to /tmp Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

72 72 Going Beyond Conquest (2) Separate storage of metadata from data Association of metadata with data of different fidelity Opportunity for hierarchical replication across devices with different calibers Benchmarking memory performance of file systems Developing new memory benchmarks Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

73 73 Contributions Demonstrated the feasibility of disk-memory hybrid file systems Showed performance does not preclude simplicity Pinpointed cache-related problems with modern benchmarks Opened doors to many exciting areas of research Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

74 74 Conclusion Conquest demonstrates how rethinking changes in underlying assumptions can lead to significant architectural and performance improvements Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well. Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion

75 75 Questions... Conquest: http://lasr.cs.ucla.edu/conquest Andy Wang: awang@cs.ucla.edu


Download ppt "Conquest: Preparing for Life After Disks An-I Andy Wang Geoff Kuenning, Peter Reiher, Gerald Popek."

Similar presentations


Ads by Google