1 Deciding When to Forget in the Elephant File System University of British Columbia: Douglas. S. Santry, Michael J. Feeley, Norman C. Hutchinson, Ross W. Carton, and Jacob Ofir Hewlett-Packard Laboratories: Alistair. C. Veitch December 1999 Presentated by: David Allen May 31 st, 2005
2 Elephant File System: Overview Undo and Long-Term History File system that helps to protect data by keeping histories of file and directory changes. User Control Gives control over retention policies to the user. Can be applied at the file level. Storage Reclamation Separates storage reclamation from file operations such as write and delete. Cleaner runs in background to reclaim storage and support the retention policy.
3 Elephant file system: Why User Failures There is already good protection from network, system and media failures. Now we need to protect from user mistakes. rm *.o is not the same as rm * o
4 Elephant file system: Why Cheap Disk Space Single inexpensive disks were approaching 50GB at time of paper in Now in 2005 they are approaching 500GB. They will be 2TB by 2010.
5 Elephant file system: Why Cheap Disk Space In addition to high-end disk capacity increasing 10x in 6 years, the price is more than 10 times cheaper.
6 Elephant file system: Why Cheap Disk Space Other types of media as well. 8GB compact flash 6GB micro drives (Useful for that 16.7MP Canon camera. 42MB images.)
7 Elephant file system: Why Capacity Large disk capacities. Constant human productivity. Only a relatively small set of files that need protection. It makes sense to support revision histories on files and directories.
8 Elephant file system: Change Change in pattern of use. Does this paper stand up to changes in disk usage? Explosion of large files from still and video digital cameras, mp3 CD rips, and divx DVD rips. I have 17.8GB of pictures and video from one trip, which I need to prune and edit to a final form. How would people in the class use this system?
9 Elephant file system: Policies Keep One (no versioning) Just like the FFS. Files changes can overwrite existing data, and are permanent.
10 Elephant file system: Policies Keep All (complete versioning) Like revision control systems. Entire history is maintained.
11 Elephant file system: Policies Keep Safe (undo protection) Keeps recent changes for a specified undo period.
12 Elephant file system: Policies Keep Landmarks (long-term history) In addition to Keep Safe protection, retain important file versions.
13 Elephant file system: Policies Application Defined (user specified) Custom policy implemented at the user level.
14 Elephant file system: Features for Comparison User Control Only retains history on user selected files, with user selected policies. Custom policies can be created. Landmarks can be user specified. Automation Implemented within the file system. Revisions are maintained automatically as the files are used. Landmarks can be determined automatically. Cleaning is done in the background.
15 Elephant file system: Features for Comparison Granularity Every file and directory change can be kept. Full or partial long term histories can be maintained. Files can be grouped to maintain consistency for landmarking. Versioning on files is done at the block level. Access Specific version can be specified with a file and date pair. Only the current version can be written to. Most recent revision is fastest, but all versions can be accessed relatively quickly. Only a single version exists at a time.
16 Elephant file system: Features for Comparison Storage Files with no versions are stored as efficiently as files without versioning. Revisions to inodes are stored in a inode log, which uses full blocks and is much larger than a single inode. Directories are stored as name histories.
17 Elephant file system vs. the Trash Can User Control Users manually empty the trash can. This causes files to have different levels of protection based on when they were deleted and when the trash can was emptied. Automation Files are automatically moved to the trash can on delete. Granularity Very coarse-grained. Only protects files against accidental deletion. Only until the trash can is emptied. No directory protection. Access Files can retrieved from the trash can, but the user needs to determine where to put it. Storage Copy of entire file is kept in the trash can.
18 Elephant file system vs. Backups User Control Typically no control over system backups. Users can manually copy files. Automation System backups are usually automatic. Granularity Very coarse over time. No fine grained revisioning No protection between backups. Typically limited by backup retention policy (number of tapes). Access System backups are usually very expensive to retrieve. User manual backups are usually closer, but not always convenient. Storage Usually full or differential copies of the data.
19 Elephant file system vs. Checkpoints User Control Typically no user control over checkpoints. Automation Checkpoints are usually automatic. Granularity Very coarse over time. No fine grained revisioning No protection between backups. Typically limited by checkpoints retention policy (space). Access Typically on-line, easy to get to. Storage Efficient. Copy-on-write policy maintains changes to file system after the checkpoint.
20 Elephant file system vs. Revision Control System User Control Only retains history on user selected files, but usually best to use revision control on all files in a directory. No policies to select, entire history is retained. File groups can be "tagged" to establish a consistent version. (Like landmarks and grouping.) Automation No automation. Usually a set of command line tools that are initiated by the user. Checkout, commit... Granularity Medium granularity. Only committed changes are kept. All versions are retained. Often it is difficult or impossible to remove old versions. Typically revision control does not include directories. (CVS) Often renaming or moving files will break file histories. (CVS, SourceSafe)
21 Elephant file system vs. Revision Control System Access Files can be accessed by name and version. Only most recent files can be modified. Older versions can be branched. Branches can be merged. Multiple branches (versions) can exists at a time. Storage Text file are usually stored efficiently as differentials. Access is fast for recent versions and slow for old versions. Binary file storage is usually inefficient, full copies.
22 Elephant file system: Summary Most files don't need versioning so impact is low. Performance is very close to a system with no versioning. Storage cost of metadata is high in the prototype implementation. Disk capacity has increased as predicted in this paper, but so has the need for capacity due to digital music and imaging. Usage patterns have also changed for the same reasons. Does this system still make as much sense in the face of these changes? Definitely!
23 References "Deciding When to Forget in the Elephant File System." D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R. W. Carton, and J. Or, In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, December 12-15, 1999, Charleston, SC, pp Historic disk capacity and price data: Current media capacities and prices: