Download presentation
Presentation is loading. Please wait.
Published byJon Pruden Modified over 10 years ago
1
Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson
2
Web 2.0 Expo, 17 April 20072 Hello!
3
Web 2.0 Expo, 17 April 20073 Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice
4
Web 2.0 Expo, 17 April 20074 Scalable Looking at storage and serving infrastructures 1
5
Web 2.0 Expo, 17 April 20075 Reliable Looking at redundancy, failure rates, on the fly changes 2
6
Web 2.0 Expo, 17 April 20076 Cheap Looking at upfront costs, TCO and lifetimes 3
7
Web 2.0 Expo, 17 April 20077 Four buckets Storage Serving BCP Cost
8
Web 2.0 Expo, 17 April 20078 Storage
9
Web 2.0 Expo, 17 April 20079 The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB
10
Web 2.0 Expo, 17 April 200710 Hardware overview The storage scale InternalDASSANNAS LowerHigher
11
Web 2.0 Expo, 17 April 200711 Internal storage A disk in a computer –SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes
12
Web 2.0 Expo, 17 April 200712 DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U
13
Web 2.0 Expo, 17 April 200713 SAN Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband –Low level protocols
14
Web 2.0 Expo, 17 April 200714 NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS –High level protocols
15
Web 2.0 Expo, 17 April 200715 Of course, it’s more confusing than that
16
Web 2.0 Expo, 17 April 200716 Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: –c1t2d3 –Controller, Target, Disk (Slice) Now means a virtual partition/volume –LVM, Logical Volume Management
17
Web 2.0 Expo, 17 April 200717 NAS vs SAN With a SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN
18
Web 2.0 Expo, 17 April 200718 SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement
19
Web 2.0 Expo, 17 April 200719 Some Practical Examples There are a lot of vendors Configurations vary Prices vary wildly Let’s look at a couple –Ones I happen to have experience with –Not an endorsement ;)
20
Web 2.0 Expo, 17 April 200720 NetApp Filers Heads and shelves, up to 500TB in 6 Cabs FC SAN with 1 or 2 NAS heads
21
Web 2.0 Expo, 17 April 200721 Isilon IQ 2U Nodes, 3-96 nodes/cluster, 6-600 TB FC/InfiniBand SAN with NAS head on each node
22
Web 2.0 Expo, 17 April 200722 Scaling Vertical vs Horizontal
23
Web 2.0 Expo, 17 April 200723 Vertical scaling Get a bigger box Bigger disk(s) More disks Limited by current tech – size of each disk and total number in appliance
24
Web 2.0 Expo, 17 April 200724 Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of
25
Web 2.0 Expo, 17 April 200725 Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space
26
Web 2.0 Expo, 17 April 200726 Huge FS Create one giant volume with growing space –Sun’s ZFS –Isilon IQ Expandable on-the-fly? Upper limits –Always limited somewhere
27
Web 2.0 Expo, 17 April 200727 Huge FS Pluses –Simple from the application side –Logically simple –Low administrative overhead Minuses –All your eggs in one basket –Hard to expand –Has an upper limit
28
Web 2.0 Expo, 17 April 200728 Physical nodes Application handles distribution to multiple physical nodes –Disks, Boxes, Appliances, whatever One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever
29
Web 2.0 Expo, 17 April 200729 Physical Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once Minuses –Many ‘mounts’ to manage –More administration
30
Web 2.0 Expo, 17 April 200730 Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever
31
Web 2.0 Expo, 17 April 200731 Virtual Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once –Addressing is logical, not physical –Flexible volume sizing, consolidation Minuses –Many ‘mounts’ to manage –More administration
32
Web 2.0 Expo, 17 April 200732 Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files –read multiple parts simultaneously
33
Web 2.0 Expo, 17 April 200733 Chunked space Pluses –High performance –Limitless size Minuses –Conceptually complex –Can be hard to expand on the fly –Can’t manually poke it
34
Web 2.0 Expo, 17 April 200734 Real Life Case Studies
35
Web 2.0 Expo, 17 April 200735 GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks they’ve given Designed to store huge files for fast access
36
Web 2.0 Expo, 17 April 200736 GFS – Google File System Single ‘Master’ node holds metadata –SPF – Shadow master allows warm swap Grid of ‘chunkservers’ –64bit filenames –64 MB file chunks
37
Web 2.0 Expo, 17 April 200737 GFS – Google File System 1(a)2(a) 1(b) Master
38
Web 2.0 Expo, 17 April 200738 GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing –Synchronously for atomicity
39
Web 2.0 Expo, 17 April 200739 GFS – Google File System Reading is fast (parallelizable) –But requires a lease Master server is required for all reads and writes
40
Web 2.0 Expo, 17 April 200740 MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage
41
Web 2.0 Expo, 17 April 200741 MogileFS – OMG Files Single metadata store (MySQL) –MySQL Cluster avoids SPF Multiple ‘tracker’ nodes locate files Multiple ‘storage’ nodes store files
42
Web 2.0 Expo, 17 April 200742 MogileFS – OMG Files Tracker MySQL
43
Web 2.0 Expo, 17 April 200743 MogileFS – OMG Files Replication of file ‘classes’ happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes
44
Web 2.0 Expo, 17 April 200744 Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage
45
Web 2.0 Expo, 17 April 200745 Flickr File System No metadata store –Deal with it yourself Multiple ‘StorageMaster’ nodes Multiple storage nodes with virtual volumes
46
Web 2.0 Expo, 17 April 200746 Flickr File System SM
47
Web 2.0 Expo, 17 April 200747 Flickr File System Metadata stored by app –Just a virtual volume number –App chooses a path Virtual nodes are mirrored –Locally and remotely Reading is done directly from nodes
48
Web 2.0 Expo, 17 April 200748 Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately
49
Web 2.0 Expo, 17 April 200749 Amazon S3 A big disk in the sky Multiple ‘buckets’ Files have user-defined keys Data + metadata
50
Web 2.0 Expo, 17 April 200750 Amazon S3 ServersAmazon
51
Web 2.0 Expo, 17 April 200751 Amazon S3 ServersAmazon Users
52
Web 2.0 Expo, 17 April 200752 The cost Fixed price, by the GB Store: $0.15 per GB per month Serve: $0.20 per GB
53
Web 2.0 Expo, 17 April 200753 The cost S3
54
Web 2.0 Expo, 17 April 200754 The cost S3 Regular Bandwidth
55
Web 2.0 Expo, 17 April 200755 End costs ~$2k to store 1TB for a year ~$63 a month for 1Mb ~$65k a month for 1Gb
56
Web 2.0 Expo, 17 April 200756 Serving
57
Web 2.0 Expo, 17 April 200757 Serving files Serving files is easy! ApacheDisk
58
Web 2.0 Expo, 17 April 200758 Serving files Scaling is harder ApacheDisk ApacheDisk ApacheDisk
59
Web 2.0 Expo, 17 April 200759 Serving files This doesn’t scale well Primary storage is expensive –And takes a lot of space In many systems, we only access a small number of files most of the time
60
Web 2.0 Expo, 17 April 200760 Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)
61
Web 2.0 Expo, 17 April 200761 Why it works Keep a smaller working set Use faster hardware –Lots of RAM –SCSI –Outer edge of disks (ZCAV) Use more duplicates –Cheaper, since they’re smaller
62
Web 2.0 Expo, 17 April 200762 Two models Layer 4 –‘Simple’ balanced cache –Objects in multiple caches –Good for few objects requested many times Layer 7 –URL balances cache –Objects in a single cache –Good for many objects requested a few times
63
Web 2.0 Expo, 17 April 200763 Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies greatly with each
64
Web 2.0 Expo, 17 April 200764 Cache Churn How long do objects typically stay in cache? If it gets too short, we’re doing badly –But it depends on your traffic profile Make the cached object store larger
65
Web 2.0 Expo, 17 April 200765 Problems Caching has some problems: –Invalidation is hard –Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier
66
Web 2.0 Expo, 17 April 200766 CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people –Already in-place –In lots of places GSLB/DNS balancing
67
Web 2.0 Expo, 17 April 200767 Edge networks Origin
68
Web 2.0 Expo, 17 April 200768 Edge networks Origin Cache
69
Web 2.0 Expo, 17 April 200769 CDN Models Simple model –You push content to them, they serve it Reverse proxy model –You publish content on an origin, they proxy and cache it
70
Web 2.0 Expo, 17 April 200770 CDN Invalidation You don’t control the caches –Just like those awful ISP ones Once something is cached by a CDN, assume it can never change –Nothing can be deleted –Nothing can be modified
71
Web 2.0 Expo, 17 April 200771 Versioning When you start to cache things, you need to care about versioning –Invalidation & Expiry –Naming & Sync
72
Web 2.0 Expo, 17 April 200772 Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly –Avoid users finding old content –Save cache space
73
Web 2.0 Expo, 17 April 200773 Cache versioning Simple rule of thumb: –If an item is modified, change its name (URL) This can be independent of the file system!
74
Web 2.0 Expo, 17 April 200774 Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 example.com/foo_3.jpg Cached: foo_3.jpg foo_3.jpg -> foo.jpg
75
Web 2.0 Expo, 17 April 200775 Authentication Authentication inline layer –Apache / perlbal Authentication sideline –ICP (CARP/HTCP) Authentication by URL –FlickrFS
76
Web 2.0 Expo, 17 April 200776 Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin
77
Web 2.0 Expo, 17 April 200777 Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin
78
Web 2.0 Expo, 17 April 200778 Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the ‘keys’ for accessing files CacheOriginWeb Server
79
Web 2.0 Expo, 17 April 200779 BCP
80
Web 2.0 Expo, 17 April 200780 Business Continuity Planning How can I deal with the unexpected? –The core of BCP Redundancy Replication
81
Web 2.0 Expo, 17 April 200781 Reality On a long enough timescale, anything that can fail, will fail Of course, everything can fail True reliability comes only through redundancy
82
Web 2.0 Expo, 17 April 200782 Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many $node boxes can fail at once?
83
Web 2.0 Expo, 17 April 200783 Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage
84
Web 2.0 Expo, 17 April 200784 Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but may need duplication too
85
Web 2.0 Expo, 17 April 200785 Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these
86
Web 2.0 Expo, 17 April 200786 Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing –replication queuing Host and colo level rebuilding
87
Web 2.0 Expo, 17 April 200787 Reliable Reads & Writes Reliable reads are easy –2 or more copies of files Reliable writes are harder –Write 2 copies at once –But what do we do when we can’t write to one?
88
Web 2.0 Expo, 17 April 200788 Dual writes Queue up data to be written –Where? –Needs itself to be reliable Queue up journal of changes –And then read data from the disk whose write succeeded Duplicate whole volume after failure –Slow!
89
Web 2.0 Expo, 17 April 200789 Cost
90
Web 2.0 Expo, 17 April 200790 Judging cost Per GB? Per GB upfront and per year Not as simple as you’d hope –How about an example
91
Web 2.0 Expo, 17 April 200791 Hardware costs Cost of hardware Usable GB Single Cost
92
Web 2.0 Expo, 17 April 200792 Power costs Cost of power per year Usable GB Recurring Cost
93
Web 2.0 Expo, 17 April 200793 Power costs Power installation cost Usable GB Single Cost
94
Web 2.0 Expo, 17 April 200794 Space costs Cost per U Usable GB [ ] U’s needed (inc network) x Recurring Cost
95
Web 2.0 Expo, 17 April 200795 Network costs Cost of network gear Usable GB Single Cost
96
Web 2.0 Expo, 17 April 200796 Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs
97
Web 2.0 Expo, 17 April 200797 Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]
98
Web 2.0 Expo, 17 April 200798 TCO Total cost of ownership in two parts –Upfront –Ongoing Architecture plays a huge part in costing –Don’t get tied to hardware –Allow heterogeneity –Move with the market
99
(fin)
100
Web 2.0 Expo, 17 April 2007100 Photo credits flickr.com/photos/ebright/260823954/ flickr.com/photos/thomashawk/243477905/ flickr.com/photos/tom-carden/116315962/ flickr.com/photos/sillydog/287354869/ flickr.com/photos/foreversouls/131972916/ flickr.com/photos/julianb/324897/ flickr.com/photos/primejunta/140957047/ flickr.com/photos/whatknot/28973703/ flickr.com/photos/dcjohn/85504455/
101
Web 2.0 Expo, 17 April 2007101 You can find these slides online: iamcal.com/talks/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.