Finding a needle in Haystack: Facebook’s photo storage OSDI 2010

Slides:



Advertisements
Similar presentations
Finding a needle in Haystack Facebook’s Photo Storage
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
The Google File System.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
File Processing : Storage Media 2015, Spring Pusan National University Ki-Joune Li.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
Bigtable: A Distributed Storage System for Structured Data
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CS161 – Design and Architecture of Computer
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CS161 – Design and Architecture of Computer
RAID Redundant Arrays of Independent Disks
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Finding a needle in Haystack: Facebook’s photo storage OSDI 2010
Finding a Needle in Haystack : Facebook’s Photo storage
Steve Ko Computer Sciences and Engineering University at Buffalo
CSE-291 (Cloud Computing) Fall 2016
File System Structure How do I organize a disk into a file system?
CS703 - Advanced Operating Systems
Google Filesystem Some slides taken from Alan Sussman.
An Analysis of Facebook photo Caching
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Storage Virtualization
File Processing : Storage Media
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Steve Ko Computer Sciences and Engineering University at Buffalo
Filesystems 2 Adapted from slides of Hank Levy
File Processing : Storage Media
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Printed on Monday, December 31, 2018 at 2:03 PM.
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
File System Implementation
John Kubiatowicz Electrical Engineering and Computer Sciences
Presentation transcript:

Finding a needle in Haystack: Facebook’s photo storage OSDI 2010 Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel Facebook.com An Overview Presented in CSE-291 (Storage Systems), Spring 2017 Gregory Kesden

Overview Facebook is the world’s largest photosharing site As of 2010: 260 billion images 20 petabytes of data 1 billion photos, 60 terabytes uploaded each week Over 1 million images/second served at peak News feed and albums are 98% of photo requests Images are written once, read often, never modified, and rarely deleted

Why Not Use a Traditional Filesystem? No need for most metadata (directory tree, owner, group, etc) Wastes space More importantly, slows access to read, check, and use it Turned out to be bottleneck What cost? Seems small? Multiplied a lot Think about steps: Map name to inode (name cache, directory files), read inode, read data Latency is in access, itself, not tiny transfer

Haystack Goals High throughput, low latency Fault-tolerance Cost-effective (It is hugely scaled, right?) Simple (Means matures to robustness quickly, low operational cost)

Typical Design

“Original” Facebook NFS-based Design

Lessons Learned from “Original” CDNs serve “hottest” photos, e.g. profile pictures, but don’t help with “Long tail” of requests for older photos generated by such a large volume site Significant amount of traffic, hitting backing store Too many possibilities, too few used to keep in memory cache of any kind, not just via CDN Surprising complexity Directories of thousands of images ran into metadata data inefficiencies in NAS, ~10 access/image Even when optimized to hundreds of images/directory, still took 3 access: metadata, inode, file

Why Go Custom? Needed better RAM:Disk ratio Unachievable, because would need too much RAM Had to reduce demand for RAM by reducing metadata

Reality and Goal Reality: Can’t keep all files in memory, or enough for long-tail Achievable Goal: Shrink metadata so it can fit in memory Result: 1 disk access per photo, for the photo, itself (not metadata)

Haystack’s Design

Photo URL http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo> Specifies steps to retrieving the photos CDN Looks up <Logical volume, Photo>. If hit, great. If not, CDN strips <CDN> component, and asks the Cache. If Cache hits, great. If not, Cache strips <Cache> component and asks back-end Haystack Store machine If not in CDN just starts at second step.

Photo Upload Request goes to Web server Web server requests a write-enabled logical volume from the Haystack Directory Web server assigns unique ID to photo and uploads it to each physical volume associated with logical volume

Haystack Directory Functions: Logical volume to physical volumes mapping Load balances writes across logical volumes and reads across physical volumes Determines if the request should be handed by CDN or Cache Makes volumes read-only, if full, or for operational reasons (Machine-level granularity) Removes failed physical volumes, replaces with new Store Replicated database with memcache

Haystack Cache Distributed hash table with photo ID as key Cache photo iff Request is from end user (not CDN) – CDN much bigger than cache. Miss there, unlikely to hit in smaller Cache. Volume is write-enabled Volumes perform better when reading or writing, but not mix, so doing one or the other is helpful Shelter reads, letting focus on writes (No need to shelter, once volume is full – no more writes) Could pro-actively push newly uploaded files into cache.

Haystack Store Store needs logical volume id and offset (and size) This needs to quick to get, given the photo id – no disk operations Keeps open file descriptors for each physical volume (preloaded fd cache) Keeps in-memory mapping of photo ids to fs metadata (file, offset, size) Needle represents a file stored within Haystack In memory mapping from <photoid, type (size)> to <flags, size, offset>

Needle and Store File

Reads from Store Cache machine requests <logical volume id, key, alternate key, cookie> Cookie is random number assigned/maintained by Directory upon upload. Prevents brute-force lookups via photo ids Store machines looks this up in in-memory metadata, if not deleted Seeks to offset in volume file and reads entire needle from disk Verifies cookie and data integrity Returns photo to cache

Photo Write Web server provides <logical volume id, key, type (size), cookie, data> to store machines (all associated with logical volume) Store machines synchronously appends needle to physical volume and updates mapping The append makes this much happier But, if files are updated, e.g. rotated, needle can’t be changed, new one must be appended – if multiple, greatest offset wins. (Directory can update for logical volumes)

Delete Just a delete flag – long live those party photos!

Index File Asynchronously updated checkpoint of in-memory data structures, in event of reboot Possible to reconstruct, but much data would need to be crunched Can be missing recent files and/or delete flags Upon reboot Load checkpoint Find last needle Add needles after that from volume file Restore checkpoint Store machines re-verify deleted flag after read form storage, in case index file was stale

Host Filesystem Store machines should use file system that: Requires little memory for random seeks within a large file E.g., blockmaps vs B-trees for logical to physical block mapping

Recovering From Failure Failure detection Proactively test Stores: Connected? Each volume available? Each volume readable? Fail? Mark logical volumes on store read only and fix. In worst case, copy over from other replicas (slow = hours)

Optimizations Compaction Batch Uploads Copies over used needles, ignoring deleted ones, locks, and atomically swaps 25% of photos deleted over course of a year, more likely to be recent ones Batch Uploads Such as when whole albums uploaded Improves performance via large, sequential writes

Cache Hit Rate

Why Did We Talk About Haystack All the fun bits Classical Browser-Server-Directory-CDN-Cache-Store Layering Simple, Easy Example, By Design (Simple = Robust Fast) Optimizes for use cases Memory-Disk Trade Focus on fitting metadata into memory Real-world storage concerns