Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson.

Slides:



Advertisements
Similar presentations
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Advertisements

Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson.
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
© 2006 DataCore Software Corp SANmotion New: Simple and Painless Data Migration for Windows Systems Note: Must be displayed using PowerPoint Slideshow.
The Linux Storage People Simple Fast Massively Scalable Network Storage Coraid EtherDrive ® Storage.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Enigma Data’s SmartMove.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
1 CSC 486/586 Network Storage. 2 Objectives Familiarization with network data storage technologies Understanding of RAID concepts and RAID levels Discuss.
A match made in heaven?. Who am I? Richard Barlow Systems Architect and Engineering Manager for the Virginia Credit Union Worked in IT for almost 20 years.
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
2 June 2015 © Enterprise Storage Group, Inc. 1 The Case for File Server Consolidation using NAS Nancy Marrone Senior Analyst The Enterprise Storage Group,
SQL Server, Storage And You Part 2: SAN, NAS and IP Storage.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
Multiprocessing Memory Management
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Storage Area Network (SAN)
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
12.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 12: Mass-Storage Systems.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
BACKUP/MASTER: Immediate Relief with Disk Backup Presented by W. Curtis Preston VP, Service Development GlassHouse Technologies, Inc.
RAID Redundancy is the factor for development of RAID in server environments. This allows for backup of the data in the storage in the event of failure.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
Managing Storage Lesson 3.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
GeoVision Solutions Storage Management & Backup. ๏ RAID - Redundant Array of Independent (or Inexpensive) Disks ๏ Combines multiple disk drives into a.
Module 13: Configuring Availability of Network Resources and Content.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
EE616 Technical Project Video Hosting Architecture By Phillip Sutton.
Introducing Snap Server™ 700i Series. 2 Introducing the Snap Server 700i series Hardware −iSCSI storage appliances with mid-market features −1U 19” rack-mount.
Hosted by Case Study - Storage Consolidation Steve Curry Yahoo Inc.
Module 9: Configuring Storage
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Virtualization for Storage Efficiency and Centralized Management Genevieve Sullivan Hewlett-Packard
MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter Three Configuring Windows Server 2008 Storage.
1 Computer and Network Bottlenecks Author: Rodger Burgess 27th October 2008 © Copyright reserved.
1 Windows 2000 Product family (Week 3, Monday 1/23/2006) © Abdou Illia, Spring 2006.
IT253: Computer Organization
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
Multi-level Raid Multi-level Raid 2 Agenda Background -Definitions -What is it? -Why would anyone want it? Design Issues -Configuration and.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Chapter 12 – Mass Storage Structures (Pgs )
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
VMware vSphere Configuration and Management v6
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
WINDOWS SERVER 2003 Genetic Computer School Lesson 12 Fault Tolerance.
STORAGE ARCHITECTURE/ MASTER): Where IP and FC Storage Fit in Your Enterprise Randy Kerns Senior Partner The Evaluator Group.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
BIG DATA/ Hadoop Interview Questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
DIT314 ~ Client Operating System & Administration CHAPTER 7 MANAGING DISKS AND FILE SYSTEM Prepared By : Suraya Alias.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
Course: Cluster, grid and cloud computing systems Course author: Prof
Storage Area Networks The Basics.
Integrating Disk into Backup for Faster Restores
Video Security Design Workshop:
Chapter 12: Mass-Storage Structure
Introduction to Networks
Introduction to Networks
Storage Virtualization
Chapter 12: Mass-Storage Systems
Cost Effective Network Storage Solutions
Presentation transcript:

Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson

Web 2.0 Expo, 17 April Hello!

Web 2.0 Expo, 17 April Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice

Web 2.0 Expo, 17 April Scalable Looking at storage and serving infrastructures 1

Web 2.0 Expo, 17 April Reliable Looking at redundancy, failure rates, on the fly changes 2

Web 2.0 Expo, 17 April Cheap Looking at upfront costs, TCO and lifetimes 3

Web 2.0 Expo, 17 April Four buckets Storage Serving BCP Cost

Web 2.0 Expo, 17 April Storage

Web 2.0 Expo, 17 April The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB

Web 2.0 Expo, 17 April Hardware overview The storage scale InternalDASSANNAS LowerHigher

Web 2.0 Expo, 17 April Internal storage A disk in a computer –SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes

Web 2.0 Expo, 17 April DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U

Web 2.0 Expo, 17 April SAN Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband –Low level protocols

Web 2.0 Expo, 17 April NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS –High level protocols

Web 2.0 Expo, 17 April Of course, it’s more confusing than that

Web 2.0 Expo, 17 April Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: –c1t2d3 –Controller, Target, Disk (Slice) Now means a virtual partition/volume –LVM, Logical Volume Management

Web 2.0 Expo, 17 April NAS vs SAN With a SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN

Web 2.0 Expo, 17 April SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement

Web 2.0 Expo, 17 April Some Practical Examples There are a lot of vendors Configurations vary Prices vary wildly Let’s look at a couple –Ones I happen to have experience with –Not an endorsement ;)

Web 2.0 Expo, 17 April NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads

Web 2.0 Expo, 17 April Isilon IQ 2U Nodes, 3-96 nodes/cluster, TB FC/InfiniBand SAN with NAS head on each node

Web 2.0 Expo, 17 April Scaling Vertical vs Horizontal

Web 2.0 Expo, 17 April Vertical scaling Get a bigger box Bigger disk(s) More disks Limited by current tech – size of each disk and total number in appliance

Web 2.0 Expo, 17 April Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of

Web 2.0 Expo, 17 April Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space

Web 2.0 Expo, 17 April Huge FS Create one giant volume with growing space –Sun’s ZFS –Isilon IQ Expandable on-the-fly? Upper limits –Always limited somewhere

Web 2.0 Expo, 17 April Huge FS Pluses –Simple from the application side –Logically simple –Low administrative overhead Minuses –All your eggs in one basket –Hard to expand –Has an upper limit

Web 2.0 Expo, 17 April Physical nodes Application handles distribution to multiple physical nodes –Disks, Boxes, Appliances, whatever One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever

Web 2.0 Expo, 17 April Physical Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once Minuses –Many ‘mounts’ to manage –More administration

Web 2.0 Expo, 17 April Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever

Web 2.0 Expo, 17 April Virtual Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once –Addressing is logical, not physical –Flexible volume sizing, consolidation Minuses –Many ‘mounts’ to manage –More administration

Web 2.0 Expo, 17 April Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files –read multiple parts simultaneously

Web 2.0 Expo, 17 April Chunked space Pluses –High performance –Limitless size Minuses –Conceptually complex –Can be hard to expand on the fly –Can’t manually poke it

Web 2.0 Expo, 17 April Real Life Case Studies

Web 2.0 Expo, 17 April GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks they’ve given Designed to store huge files for fast access

Web 2.0 Expo, 17 April GFS – Google File System Single ‘Master’ node holds metadata –SPF – Shadow master allows warm swap Grid of ‘chunkservers’ –64bit filenames –64 MB file chunks

Web 2.0 Expo, 17 April GFS – Google File System 1(a)2(a) 1(b) Master

Web 2.0 Expo, 17 April GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing –Synchronously for atomicity

Web 2.0 Expo, 17 April GFS – Google File System Reading is fast (parallelizable) –But requires a lease Master server is required for all reads and writes

Web 2.0 Expo, 17 April MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage

Web 2.0 Expo, 17 April MogileFS – OMG Files Single metadata store (MySQL) –MySQL Cluster avoids SPF Multiple ‘tracker’ nodes locate files Multiple ‘storage’ nodes store files

Web 2.0 Expo, 17 April MogileFS – OMG Files Tracker MySQL

Web 2.0 Expo, 17 April MogileFS – OMG Files Replication of file ‘classes’ happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes

Web 2.0 Expo, 17 April Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage

Web 2.0 Expo, 17 April Flickr File System No metadata store –Deal with it yourself Multiple ‘StorageMaster’ nodes Multiple storage nodes with virtual volumes

Web 2.0 Expo, 17 April Flickr File System SM

Web 2.0 Expo, 17 April Flickr File System Metadata stored by app –Just a virtual volume number –App chooses a path Virtual nodes are mirrored –Locally and remotely Reading is done directly from nodes

Web 2.0 Expo, 17 April Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately

Web 2.0 Expo, 17 April Amazon S3 A big disk in the sky Multiple ‘buckets’ Files have user-defined keys Data + metadata

Web 2.0 Expo, 17 April Amazon S3 ServersAmazon

Web 2.0 Expo, 17 April Amazon S3 ServersAmazon Users

Web 2.0 Expo, 17 April The cost Fixed price, by the GB Store: $0.15 per GB per month Serve: $0.20 per GB

Web 2.0 Expo, 17 April The cost S3

Web 2.0 Expo, 17 April The cost S3 Regular Bandwidth

Web 2.0 Expo, 17 April End costs ~$2k to store 1TB for a year ~$63 a month for 1Mb ~$65k a month for 1Gb

Web 2.0 Expo, 17 April Serving

Web 2.0 Expo, 17 April Serving files Serving files is easy! ApacheDisk

Web 2.0 Expo, 17 April Serving files Scaling is harder ApacheDisk ApacheDisk ApacheDisk

Web 2.0 Expo, 17 April Serving files This doesn’t scale well Primary storage is expensive –And takes a lot of space In many systems, we only access a small number of files most of the time

Web 2.0 Expo, 17 April Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)

Web 2.0 Expo, 17 April Why it works Keep a smaller working set Use faster hardware –Lots of RAM –SCSI –Outer edge of disks (ZCAV) Use more duplicates –Cheaper, since they’re smaller

Web 2.0 Expo, 17 April Two models Layer 4 –‘Simple’ balanced cache –Objects in multiple caches –Good for few objects requested many times Layer 7 –URL balances cache –Objects in a single cache –Good for many objects requested a few times

Web 2.0 Expo, 17 April Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies greatly with each

Web 2.0 Expo, 17 April Cache Churn How long do objects typically stay in cache? If it gets too short, we’re doing badly –But it depends on your traffic profile Make the cached object store larger

Web 2.0 Expo, 17 April Problems Caching has some problems: –Invalidation is hard –Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier

Web 2.0 Expo, 17 April CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people –Already in-place –In lots of places GSLB/DNS balancing

Web 2.0 Expo, 17 April Edge networks Origin

Web 2.0 Expo, 17 April Edge networks Origin Cache

Web 2.0 Expo, 17 April CDN Models Simple model –You push content to them, they serve it Reverse proxy model –You publish content on an origin, they proxy and cache it

Web 2.0 Expo, 17 April CDN Invalidation You don’t control the caches –Just like those awful ISP ones Once something is cached by a CDN, assume it can never change –Nothing can be deleted –Nothing can be modified

Web 2.0 Expo, 17 April Versioning When you start to cache things, you need to care about versioning –Invalidation & Expiry –Naming & Sync

Web 2.0 Expo, 17 April Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly –Avoid users finding old content –Save cache space

Web 2.0 Expo, 17 April Cache versioning Simple rule of thumb: –If an item is modified, change its name (URL) This can be independent of the file system!

Web 2.0 Expo, 17 April Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg

Web 2.0 Expo, 17 April Authentication Authentication inline layer –Apache / perlbal Authentication sideline –ICP (CARP/HTCP) Authentication by URL –FlickrFS

Web 2.0 Expo, 17 April Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin

Web 2.0 Expo, 17 April Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin

Web 2.0 Expo, 17 April Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the ‘keys’ for accessing files CacheOriginWeb Server

Web 2.0 Expo, 17 April BCP

Web 2.0 Expo, 17 April Business Continuity Planning How can I deal with the unexpected? –The core of BCP Redundancy Replication

Web 2.0 Expo, 17 April Reality On a long enough timescale, anything that can fail, will fail Of course, everything can fail True reliability comes only through redundancy

Web 2.0 Expo, 17 April Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many $node boxes can fail at once?

Web 2.0 Expo, 17 April Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage

Web 2.0 Expo, 17 April Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but may need duplication too

Web 2.0 Expo, 17 April Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these

Web 2.0 Expo, 17 April Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing –replication queuing Host and colo level rebuilding

Web 2.0 Expo, 17 April Reliable Reads & Writes Reliable reads are easy –2 or more copies of files Reliable writes are harder –Write 2 copies at once –But what do we do when we can’t write to one?

Web 2.0 Expo, 17 April Dual writes Queue up data to be written –Where? –Needs itself to be reliable Queue up journal of changes –And then read data from the disk whose write succeeded Duplicate whole volume after failure –Slow!

Web 2.0 Expo, 17 April Cost

Web 2.0 Expo, 17 April Judging cost Per GB? Per GB upfront and per year Not as simple as you’d hope –How about an example

Web 2.0 Expo, 17 April Hardware costs Cost of hardware Usable GB Single Cost

Web 2.0 Expo, 17 April Power costs Cost of power per year Usable GB Recurring Cost

Web 2.0 Expo, 17 April Power costs Power installation cost Usable GB Single Cost

Web 2.0 Expo, 17 April Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost

Web 2.0 Expo, 17 April Network costs Cost of network gear Usable GB Single Cost

Web 2.0 Expo, 17 April Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs

Web 2.0 Expo, 17 April Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]

Web 2.0 Expo, 17 April TCO Total cost of ownership in two parts –Upfront –Ongoing Architecture plays a huge part in costing –Don’t get tied to hardware –Allow heterogeneity –Move with the market

(fin)

Web 2.0 Expo, 17 April Photo credits flickr.com/photos/ebright/ / flickr.com/photos/thomashawk/ / flickr.com/photos/tom-carden/ / flickr.com/photos/sillydog/ / flickr.com/photos/foreversouls/ / flickr.com/photos/julianb/324897/ flickr.com/photos/primejunta/ / flickr.com/photos/whatknot/ / flickr.com/photos/dcjohn/ /

Web 2.0 Expo, 17 April You can find these slides online: iamcal.com/talks/