Long tails and Archive systems Elliot Jaffe FDIS 2005
Archive Metrics What –Distribution of file sizes –Distribution of occupied storage –How are files accessed Why –System architecture –Scaling for access
File size studies UFS93 (1993) 12 million files UNIX only Avg. file size is 2k 90% of storage in 11% of files HUJI (2005) 4 million files UNIX + Windows Avg. file size is 8k 90% of storage in 5.5% of files
What’s Changed Then JAWS, NOW Online was expensive Offline tape storage Now Central File Servers Digital Libraries Online is cheap No offline storage XML Multimedia
Empirical Data
Questions What is the future of these distributions? Are the changes extensions of the tails with power laws, so that 10/90 and 20/80 rules no longer work and are the wrong way to think about them? Are the changes based on external factors that are unpredictable?
The Long Tail Chris Anderson (2004) – The long tail of a distribution has tremendous mass and creates new market opportunities Amazon, Netflix, Wikipedia
Today’s landscape NOW File Servers Sarbanes Oxley Digital Libraries Storage Capacity Access Frequency
Next Steps Collecting data from large storage systems –File Sizes, Created, Last Modified, Last Access, Frequency of Reads Goal: New architectures for Digital libraries –Focus on Operations –Store large and small files differently –Store very-low access files in slow access