Finding a Needle in Haystack : Facebook’s Photo storage

Slides:



Advertisements
Similar presentations
Finding a needle in Haystack Facebook’s Photo Storage
Advertisements

G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
File System Implementation
The design and implementation of a log-structured file system The design and implementation of a log-structured file system M. Rosenblum and J.K. Ousterhout.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Wide-area cooperative storage with CFS
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
BigTable A System for Distributed Structured Storage Yanen Li Department of Computer Science University of Illinois at Urbana-Champaign
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
Serving Photos at Scaaale: Caching and Storage An Analysis of Facebook Photo Caching. Huang et al. Finding a Needle in a Haystack. Beaver et al. Vlad Niculae.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Bigtable: A Distributed Storage System for Structured Data
Overview on Web Caching COSC 513 Class Presentation Instructor: Prof. M. Anvari Student name: Wei Wei ID:
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
File-System Management
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
John Kubiatowicz and Anthony D. Joseph
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Jonathan Walpole Computer Science Portland State University
Hadoop Aakash Kag What Why How 1.
Slide credits: Thomas Kao
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Steve Ko Computer Sciences and Engineering University at Buffalo
Google Filesystem Some slides taken from Alan Sussman.
An Analysis of Facebook photo Caching
Introduction to Networks
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
A Survey on Distributed File Systems
Chapter 12: File System Implementation
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
RAID RAID Mukesh N Tekwani
Be Fast, Cheap and in Control
Steve Ko Computer Sciences and Engineering University at Buffalo
Storage Systems for Managing Voluminous Data
Edge computing (1) Content Distribution Networks
Hadoop Technopoints.
John Kubiatowicz Electrical Engineering and Computer Sciences
Chapter 14: File-System Implementation
by Mikael Bjerga & Arne Lange
RAID RAID Mukesh N Tekwani April 23, 2019
File System Implementation
John Kubiatowicz Electrical Engineering and Computer Sciences
Presentation transcript:

Finding a Needle in Haystack : Facebook’s Photo storage Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel Presented by : Pallavi Kasula

Photos @ Facebook April,2009 October,2010 August,2012 Total 15 billion photos 60 billion images 1.5 petabytes 65 billion photos260 billion images20 petabytes Upload Rate 220 million photos/week 25 terabytes 1 billion photos/week 60 terabytes 2 billion photos/week* Serving Rate 550,000 images/sec 1 million images/sec *http://news.cnet.com/8301-1023_3-57498531-93/facebook-processes-more-than-500-tb-of-data-daily/

Goals High throughput and low latency Fault-tolerant Cost-effective Simple

Typical Design

NFS-based Design

NFS-based Design Typical website small working set Infrequent access to old content ~99% CDN hit rate Facebook large working set Frequent access to old content ~80% CDN hit rate

NFS-based Design Metadata bottleneck Each image stored as a file Large metadata size severely limits the metadata hit ratio Image read performance ~10 iops / image read (Large directories-thousands of files) 3 iops / image read (smaller directories - hundreds of files) 2.5 iops / image read ( file handle cache)

NFS-based Design

Haystack based Design http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>

Store Organization Store’s capacity is organized into physical volumes. eg- 1 TB -> 100 physical volumes each of 100GB. Logical volumes - Physical volumes on different machines grouped.

Photo upload

Haystack Directory

Haystack Directory Logical to physical volume mapping Load balancing writes across logical volumes reads across physical volumes Request handled by CDN or Cache Identifies and marks volumes Read- Only

Haystack Cache

Haystack Cache Distributed hash table with photo-id as key Receives HTTP requests from CDNs and browsers Cache a photo if following two conditions are met: Request is directly from browser not CDN Photo fetched is from write enabled Store machine

Haystack Store

Haystack Store Each store manages multiple physical volumes Physical volume analogous to a large file saved as “/hay/haystack_<logical volume id>” Contains in-memory mappings of Photo ids to filesystem metadata (file, offset, size etc.) Represents each physical volume as a large file

Layout of Haystack Store file

Haystack Store Read Cache machine supplies the logical volume id, key, alternate key, and cookie to the Store machine Store machine looks up the relevant metadata in its in-memory mappings Seeks to the appropriate offset in the volume file, reads the entire needle Verifies cookie and integrity of the data Returns data to the Cache

Haystack Store Write Web server provides logical volume id, key, alternate key, cookie, and data to Store machines Store machines synchronously append needle images to physical volume files Update in-memory mappings as needed

Haystack Store Delete Store machine sets the delete flag in both the in-memory mapping and in the volume file Space occupied by deleted needles is lost! How to reclaim? Compaction! Important because 25% of photos get deleted in a given year.

Layout of Haystack Index file

Haystack Index File The index file provides the minimal metadata required to locate a particular needle in the store Main purpose: allow quick loading of the needle metadata into memory without traversing the larger Haystack store file Index is usually less than 1% the size of the store file Problem: Updated asynchronously leading to stale checkpoints with needles without index needles and index needles not updated when deleted.

File System Uses XFS - extent based file system Advantages : Small blockmaps for several contigious large files that can be stored in main memory Efficient file preallocation which mitigates fragmentation

Failure Recovery Pitch-fork - Background task that periodically checks health of each store. If health checks fails, marks logical volumes on that store as Read-Only Failures addressed manually with operations like Bulk sync.

Optimizations Compaction Deleted and Duplicate needles Saving more memory Delete flag replaced with offset 0 Cookie value not stored Batch upload

Evaluation Characterize photo requests seen by Facebook Effectiveness of Directory and Cache Store performance

Photo Requests

Traffic Volume

Haystack Directory

Haystack Cache hit rate

Benchmark Performance

Production Data

Production Data

Conclusion Haystack simple and effective storage system Optimized for random reads Cheap commodity storage Fault-tolerant Incrementally scalable

Q&A Thanks