Finding a needle in Haystack Facebook’s Photo Storage

Slides:



Advertisements
Similar presentations
Boxwood: Distributed Data Structures as Storage Infrastructure Lidong Zhou Microsoft Research Silicon Valley Team Members: Chandu Thekkath, Marc Najork,
Advertisements

SE-292 High Performance Computing
M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
More on File Management
Squirrel: A peer-to-peer web cache Sitaram Iyer Joint work with Ant Rowstron (MSRC) and Peter Druschel.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
File Systems.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Chapter 11: File System Implementation
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
File System Implementation
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
1 File Management Chapter File Management File management system consists of system utility programs that run as privileged applications Input to.
Wide-area cooperative storage with CFS
File Management Chapter 12.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
BigTable A System for Distributed Structured Storage Yanen Li Department of Computer Science University of Illinois at Urbana-Champaign
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 11: File System Implementation Chapter.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
CS333 Intro to Operating Systems Jonathan Walpole.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Storage Systems CSE 598d, Spring 2007 OS Support for DB Management DB File System April 3, 2007 Mark Johnson.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Bigtable: A Distributed Storage System for Structured Data
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
File Systems and Disk Management
File-System Management
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Jonathan Walpole Computer Science Portland State University
FileSystems.
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Finding a Needle in Haystack : Facebook’s Photo storage
Steve Ko Computer Sciences and Engineering University at Buffalo
File System Implementation
An Analysis of Facebook photo Caching
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
NFS and AFS Adapted from slides by Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift.
Chapter 11: File System Implementation
Steve Ko Computer Sciences and Engineering University at Buffalo
Storage Systems for Managing Voluminous Data
How Yahoo! use to serve millions of videos from its video library.
by Mikael Bjerga & Arne Lange
File System Implementation
Presentation transcript:

Finding a needle in Haystack Facebook’s Photo Storage Shakthi Bachala

Outline Scenario Goal Problem Previous Approach Current Approach Evaluation Advantages Critic Conclusion

Scenario : April 2009 October 2011 Total 15 billion Photos 4*15 billion images= 60 billion images 1.5 petabytes of data 65 billion Photos 4*65 billion images = 260 billion images 20 petabytes of data Upload Rate 220 million photos / week 25 terabytes of data 1 billion photos / week 60 terabytes of data Serving Rate 550,000 images / sec 1 million images / sec

Goal High throughput and low latency Fault-tolerant Cost-effective Simple

Previous Approach : Typical design for Photo Sharing

Previous Approach : NFS based design for Photo Sharing at facebook

Previous Approach – NFS based design Traditional file system architecture performs poorly under Facebook's kind of workload NFS - based Design: CDN effectively serves the hottest photos (profile pictures and recently updated photos), but facebook also generates a lot of requests for less popular images (long tail images). These are not handled by CDN Normal website had 99% CDN hit rate but facebook had around 80% CDN hit rate

Long Tail Issue

Previous Approach cont.. Problems with that approach were: Wastage of storage capacity due to metadata Large metadata per file Each image stored as a file Large number of disk operations for reads Because of large directories (large directories containing thousands of files) Change of the directory structures and changing from large directories to small directories has brought down the iops approximately from 10 to 2.5-3.0

Current Approach – Haystack Architecture

Current Approach- Haystack Components The main components of Haystack architecture are: Haystack Directory Haystack Cache Haystack Store

Current Approach- Haystack Directory The main goals of directory are: Map logical volumes to physical volumes 3 Physical volumes( on 3 nodes) per one logical volume Load balance Writes across logical volumes Reads across physical volumes (any of the 3 stores) Caching strategy: Whether the photo request should be handled by the CDN or by the cache URL generation http://<CDN>/<Cache>/<Node>/<Logical volume id, Image id> The directory would Identify the logical volumes that are read only either because of operational reason or because those volumes have reached their storage capacity

Current Approach- Haystack Cache The Cache receives HTTP requests for photos from browser or CDNs It is a distributed hash table with photo id as the key to locate the cached data If the photo id is missing in cache , the cache fetches the data from photo server and replies it to the browser or CDN depending on the request

Current Approach- Haystack Cache Caches a photo if it satisfies the following two conditions: The request directly come from a user and instead of CDN Facebook’s experience with the NFS-based design showed post-CDN caching is ineffective as it is unlikely that a request misses in the CDN would hit in our internal cache The photos is fetched by the write enabled store Photos are most heavily accessed soon after they are uploaded File systems generally work better when doing either writes or reads but not both

Current Approach- Haystack Cache Hit Rate

Current Approach : Haystack Store Replaces the storage and photo server layer in NFS based Design with this structure:

Current Approach : Haystack Store Storage : 12x 1TB SATA, RAID6 Filesytem: Single approx. 10 TB xfs filesystem. Haystack: Log structured , append only object store containing needles as object abstractions 100 haystacks per node each 100GB in size

Current Approach: Haystack Store File

Current Approach: Operations in Haystack Photo Read Look up offset /size of the image in the incore index Read Data (approx. 1 iop) Photo Write Asynchronously append images one by one to the haystack file Next haystack file when becomes full Asynchronously append index records to the index file Flush index file if too many dirty index records Update incore index

Current Approach: Operations in Haystack Photo Delete Lookup offset of the image in the incore index Mark the image needle flag as “DELETED” Update incore index Index File: Provides minimum metadata to locate the needle in the Haystack store Subset of Header metadata

Current Approach: Haystack Index File

Haystack Based Design - Photo Upload

Haystack Based Design - Photo Download

Current Approach: Operations in Haystack Filesystem: Haystack uses XFS, an extent based file system It has two main advantages: The block maps for several contiguous large files can be small enough to be stored in the main memory XFS provides efficient file pre allocation, mitigating fragmentation and reigning in how large block maps can grow

Current Approach: Haystack Optimization Compaction: Infrequent online operation Create a copy of haystack skipping duplicates and deleted photos The patterns of deletes to photo views, young photos are a lot more likely to be deleted Last year about 25% of the photos got deleted

Current Approach: Haystack Optimization Saving More Memory: With the following two techniques store machines reduced their main memory footprints by 20% Eliminate the need for an in-memory representation of flags by setting the offset to be 0 for deleted photos. Store machine do not keep track of cookie values in main memory and instead check the supplied cookie after reading from the disk

Current Approach: Haystack Optimization Batch Uploads: Disks perform better with large sequential writes instead of small random writes, so facebook uses batch uploads whenever possible Many users upload entire albums to facebook instead of each picture which gives an opportunity to batch the uploads

Evaluation -Data

Evaluation – Production Workload

Advantages Simple design Decrease number of disk operations by reducing the average metadata per photo This system is robust enough to handle a very large amount of data Fault Tolerant

Critic I thought this approach is very facebook specific . Any other?

Conclusion Built a simple but robust data storage mechanism for facebook photo storage to accommodate long tail of photo requests which was not possible by previous approaches

References http://www.usenix.org/event/osdi10/tech/ http://www.cs.uiuc.edu/class/sp11/cs525/sched.htm