Storage Systems for Managing Voluminous Data CS455: Introduction To Distributed Systems Meghana Santhapur Pratyusha Reddy Degapudi Rasika Warade Department Of Computer Science
WHY IS THIS PROBLEM IMPORTANT ? Big Data universe beginning to explode Store and manage large volumes of data efficiently Selection of a storage system for a particular use case So where do we see this explosion? source: http://www.nbnnews.com/NBN/issues/2011-12-05/Sales+and+Marketing/index.html
PROBLEM CHARACTERIZATION April 2009 Current Total 15 billion photos 60 billion images 1.5 PB 65 billion photos 260 billion images 20 PB Upload rate 220 million photos per week 25 TB 1 billion photos per week 60 TB Server rate 550,000 images per sec 1 million images per sec Photo storage systems Block storage? File storage? Object storage? Source: Facebook Engineering research group
TRADE_OFF SPACE FOR SOLUTIONS Thousands of files in each directory with more than 10 disk operations Directory size reduced to store hundreds of images with 3 disk operations to maintain File handles cached in Photo servers File handles of every image in memcache No decrease in caches and disk operations giving Overhead on Metadata OBJECT STORAGE
FACEBOOK HAYSTACK STORE LAYOUT DOMINANT APPROACHES Can you find a needle in a haystack? FACEBOOK HAYSTACK STORE LAYOUT It maintains an incore index for all photos This eliminated unneccesary metadata Source: Facebook Engineering research group
DOMINANT APPROACHES (Contd…) MObStor/DORA HBase Dora provides backend service to MObStor Elimination of Metadata from Object Storage Use of data locators Top of Hadoop framework Vector data is converted to Well Known Binary and Well Known Text Can handle spatial images
Object Storage Haystack Store Mobstor/DORA INSIGHTS GLEANED Ignores the file system and puts everything in a bucket Lot faster than file systems Performance does not degrade as the cluster grows Object Storage Old NFS infrastructure was replaced due to more file system metadata Allows storage of multiple photos in single file with less metadata Haystack Store Supported high request rates for operational storage Added features to do object storage on cheaper systems Mobstor/DORA
WHAT THE PROBLEM SPACE IN FUTURE WOULD LOOK LIKE
TRADE_OFF SPACE AND SOLUTIONS IN THE FUTURE Currently, Object storage is the Smartest Solution for voluminous data Scaling in a less expensive way, suggests open source programs Storage Systems supporting both object and block storage, or may be file storage altogether Metadata can be further reduced using dynamic data structure to locate servers responsible for data Which is the World’s largest Haystack possible?