Scalable Data Management@facebook Srinivas Narayanan 11/13/09
Scale So what scale are we really talking about?
>200 billion monthly page views Over 300 million active users > 3.9 trillion feed actions processed per day >200 billion monthly page views 100 million search queries per day Over 1 million developers in 180 countries #2 site on the Internet (time on site) Exciting and also humbling to be able to serve 300M users, and so many operations. If Facebook were a country, we would be the 3rd largest after China and India. If we look at the amount of data users on Facebook upload daily, it’d exceed 350 feature length movies. Our users upload an average of 500GB of structured data per day in addition to about 2TB of photos and videos. As we will see during the talk,this kind of load is highly non-trivial. The challenge is to build systems that can support this scale and keep the site running 24 hours a day, each and every day. More than 232 photos… 2 billion pieces of content per week 6 billion minutes per day
Growth Rate Active Users 300M 2009 People log in more than once a month :) - It took over four years to reach 100 million users, a level we achieved in September of 2008. And another 1 year to reach 300 million! - Technology adoption is accelerating...Radio took 38 yrs to reach 50M, TV took 13 yrs, computers took 4 years... Facebook took only 3 years to reach 50M active users Why is this important? The rate of growth matters because you have very little time to change things. You just re-built something yesterday and yet, it no longer works today! Designing for exponential growth is really hard. Hard to predict what will go wrong or where the next set of bottlenecks will be.
Social Networks So what makes the work at Facebook challenging?
The social graph links everything People are only one dimension of the social graph. Social applications links people to many types of data..photos, videos, music, blog posts, groups, events, organizations, and even other applications
Scaling Social Networks Much harder than typical websites where... Typically 1-2% online: easy to cache the data Partitioning & scaling relatively easy What do you do when everything is interconnected? Facebook and social networks in general have a unique problem. The data is so connected that the only reasonable way to store it is essentially over a uniformly distributed set of data providers. It’s difficult to segment our data in any meaningful way to reside on the same disks without duplicating it everywhere. It’s also so frequently accessed that we simply can’t hit the database for each access.
name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo [Consequence of a Click -- II] When I click on a friend’s photo, for example, a lot of things happen. The data is retrieved from the database and checked in real time for privacy, visibility and other rules. Each time you click on a photo, status, name, friends, friends of friends, this process takes place. name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo
System Architecture So what makes the work at Facebook challenging?
Architecture Load Balancer (assigns a web server) Web Server (PHP assembles data) Kinda inaccurate - memcache and db are on the side. Lots of services... Memcache (fast, simple) Database (slow, persistent)
Memcache Simple in-memory hash table Supports get/set,delete,multiget, multiset Not a write-through cache Pros and Cons The Database Shield! Low latency, very high request rates Can be easy to corrupt, inefficient for very small items In many ways, this is the heart or core of the site! 120 million queries/second!!! equivalent of typing out over 50 volumes of the Encyclopedia Britannica in 1/10th second.
Memcache Optimization Multithreading and efficient protocol code - 50k req/s Polling network drivers - 150k req/s Breaking up stats lock - 200k req/s Batching packet handling - 250k req/s Breaking up cache lock - future
Network Incast Memcache Memcache Memcache Memcache Switch Many Small Get Requests PHP Client
Network Incast Memcache Memcache Memcache Memcache Switch Many big data packets PHP Client
Network Incast Memcache Memcache Memcache Memcache Switch PHP Client
Network Incast Memcache Memcache Memcache Memcache Switch - Implement flow control on client over multiple udp connections- aggressive timeout - blowing up memcache past a threshold PHP Client
Memcache Clustering Many small objects per server Many servers per large object If objects are small, round trips dominate so you want objects clusteredIf objects are large, transfer time dominates so you want objects distributedIn a web application you will almost always be dealing with small objects, and you can get in a situation where adding machines doesn’t help scaling Many small objects per server Many servers per large object
Memcache Clustering Memcache 10 Objects PHP Client
Memcache Clustering Memcache Memcache PHP Client 5 Objects 5 Objects 2 round trips total1 round trip per server PHP Client
Memcache Clustering Memcache Memcache Memcache PHP Client 4 Objects 3 round trips total1 round trip per server PHP Client
Memcache Pool Optimization Currently a manual process Replication for obvious hot data sets Interesting problem: Optimize the allocation based on access patterns
Vertical Partitioning of Object Types Specialized Replica 1 Specialized Replica 2 Shard 1 Shard 2 Shard 1 Shard 2 General pool with wide fanout Shard 1 Shard 2 Shard 3 Shard n ...
MySQL has played a role from the beginning Thousands of MySQL servers in two datacenters Scribe Today, our user database cluster is a large pool of independent MySQL servers. We have chosen to use a shared-nothing architecture both to achieve scalability and fault isolation. Battleship vs. army of foot soldiers
MySQL Usage Pretty solid transactional persistent store Logical migration of data is difficult Logical-Physical db mapping Rarely use advanced query features Performance Database resources are precious Web tier CPU is relatively cheap Distributed data - no joins! Sound administrative model
MySQL is better because it is Open Source We can enhance or extend the database ...as we see fit ...when we see fit Facebook extended MySQL to support distributed cache invalidation for memcache INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...
Scaling across datacenters West Coast East Coast VA MySQL VA Web VA Memcache Memcache Proxy SC Memcache SC Web SC MySQL SF Web SF Memcache Memcache Proxy MySql replication
Other Interesting Issues Application level batching and parallelization Super hot data items Cachekey versioning with continuous availability
Photos So what makes the work at Facebook challenging?
Photos + Social Graph = Awesome!
Photos: Scale 20 billion photos x4 = 80 billion Would wrap around the world more than 10 times! Over 40M new photos per day 600K photos / second
Photos Scaling - The easy wins Upload tier - handles uploads, scales images, stores on NFS Serving tier: Images served from NFS via HTTP However... File systems are not good at supporting large number of files Metadata too large to fit in memory causing too many IOs for each file read Limited by I/O not storage density Easy wins CDN Cachr (http server + caching) NFS file handle cache 10 i/os - 3 ios with easy wins
Photos: Haystack Overlay file system Index in memory One IO per read
Data Warehousing So what makes the work at Facebook challenging?
Data: How much? 200GB per day in March 2008 2+TB(compressed) raw data per day in April 2009 4+TB(compressed) raw data per day today
The Data Age Free or low cost of user services Consumer behavior hard to predict Data and analysis are critical More data beats better algorithms
Deficiencies of existing technologies Analysis/storage on proprietary systems too expensive Closed systems are hard to extend
Hadoop & Hive So what makes the work at Facebook challenging?
Hadoop Superior availability/scalability/manageability despite lower single node performance Open system Scalable costs Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python/perl) Need to publish data in well known schemas
Hive A system for managing and querying structured data built on top of Hadoop Components Map-Reduce for execution HDFS for storage Metadata in an RDBMS
Hive: New Technology, Familiar Interface hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar - input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh - output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
Hive: Sample Applications Reporting E.g.,: Daily/Weekly aggregations of impression/click counts Measures of user engagement Ad hoc Analysis E.g.,: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization E.g.,: User Engagement as a function of user attributes Lots More
Hive: Server Infrastructure 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch
Hive & Hadoop: Usage Stats 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day 80K compute hours per day 200 people run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs
Hive: Technical Overview So what makes the work at Facebook challenging?
Hive: Open and Extensible Query your own formats and types with your own Serializer/Deserializers Extend the SQL functionality through User Defined Functions Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
Hive: Smarter Execution Plans Map-side Joins Predicate Pushdown Partition Pruning Hash based Aggregations Parallel execution of operator trees Intelligent Scheduling
Hive: Possible Future Optimizations Pipelining? Finer operator control (controlling sorts) Cost based optimizations? HBase
Spikes: The Username Launch The numbers presented earlier should give you a sense of the scale Facebook operates at and the challenges they might throw. Facebook requires a giant infrastructure and at also a very diverse array of components. The cool thing is that for every class you’re into, we have the opportunity for you to dive into that field, and be right at the forefront. And luckily, Facebook is also a place that is really a continuation of our study, that is broad and challenging.
System Design Database tier cannot handle the load Dedicated memcache tier for assigned usernames Miss => Available Avoid database hits altogether Blacklists: bucketize, local tier cache timeout - Performance of avail checker was one of the most critical parts of the system - This wasn’t a big issue when were going down the auction path, but became critical once we decided to do FCFS - Generating suggestions adds more load (could do maybe 6-10 or more checks per page load) - Usual caching doesn’t help a lot since there will be a lot of misses. DB tier cannot handle the load Constant refreshing was a concern - so we added a countdown to incentivize users to just watch. Upon countdown completion, don’t auto-refresh - show a continue button. Disabled Chat bar. Lite include stack. Don’t pull in your set of pages unnecessarily. Tradeoffs between UX (extra clicks) and performance.
Username Memcache Tier Parallel pool in each data center Writes replicated to all nodes 8 nodes per pool Reads can go to any node (hashed by uid) PHP Client ... - Replicated memcache tier - Key gets hashed to a number 0-7 (based on uid?) - But these are not db backed keys & so sets need to be replicated UN0 UN1 UN7 Username Memcache
Write Optimization Hashout store Distributed key-value store (MySQL backed) Lockless (optimistic) concurrency control - Hashout store stores the mapping between username -> <alias fbid> (which has some more data) - - optimistic concurrency => no locks are obtained and since conflict rates are low, it is a win
Fault Tolerance Memcache nodes can go down Always check another node on miss Replay from a log file (scribe) Memcache sets are not guaranteed to succeed Self-correcting code: write again to mc if we detect it during db writes - One of the issues we were worried about with non-db backed keys & assuming MC as the system of record was that mc nodes can go down. - MC replicated pools allows some redundancy (on miss, the replicated pool mc infrastructure allows us to check another node on miss) - replay scribe logs for recovery from more than 1 failure (this actually happened!) - MC sets not guaranteed to succeed => bad user experience. writes will not succeed since the database will complain on writes, but still this is not ideal. So, if we find that a username is taken during the db-write stage, we call set again to write those usernames to the UN memcache tier. Future users are prevented from seeing the same problem
Nuclear Options Newsfeed Reduce number of stories Turn off scrolling, highlights Profile Make info tab the default Chat Reduce buddy list refresh rate Turn if off! Even though we designed the system to be highly performant, and tested it under a lot of load, there were chances that if a really really large number of people showed up at launch time, our system may not be able to handle it. So, we thought about ways in which we can reduce the load caused by other features on the site to provide extra capacity for the username launch. We affectionately called this the “nuclear options” since this was a last resort for handling the enormous load.
How much load? 200k in 3 min 1M in 1 hour 50M in first month Prepared for over 10x! The actual load was high, but not anywhere near what we were prepared for. Shows you an example of careful design, planning and testing can lead to a successful launch.
Some interesting problems The spike!
Some interesting problems Graph models and languages Low latency fast access Slightly more expressive queries Consistency, Staleness can be a bit loose Analysis over large data sets Privacy as part of the model Fat data pipes Push enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners). Controllable QoS
Some interesting problems (contd.) Search relevance Storage systems Middle tier (cache) optimization Application data access language
Questions? Sum it up.