Scalable Data Management@facebook Srinivas Narayanan 11/13/09.

Name: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.
Uploaded: 2018-01-15T09:56:51+00:00
Duration: PTM25S31
Channel: Marc Overfield
Description: Scalable Data Management@facebook Srinivas Narayanan 11/13/09.

Scalable Data Management@facebook
Srinivas Narayanan 11/13/09

Scale So what scale are we really talking about?

>200 billion monthly page views
Over 300 million active users > 3.9 trillion feed actions processed per day >200 billion monthly page views 100 million search queries per day Over 1 million developers in 180 countries #2 site on the Internet (time on site) Exciting and also humbling to be able to serve 300M users, and so many operations. If Facebook were a country, we would be the 3rd largest after China and India. If we look at the amount of data users on Facebook upload daily, it’d exceed 350 feature length movies. Our users upload an average of 500GB of structured data per day in addition to about 2TB of photos and videos. As we will see during the talk,this kind of load is highly non-trivial. The challenge is to build systems that can support this scale and keep the site running 24 hours a day, each and every day. More than 232 photos… 2 billion pieces of content per week 6 billion minutes per day

Growth Rate Active Users 300M 2009
People log in more than once a month :) - It took over four years to reach 100 million users, a level we achieved in September of 2008. And another 1 year to reach 300 million! - Technology adoption is accelerating...Radio took 38 yrs to reach 50M, TV took 13 yrs, computers took 4 years... Facebook took only 3 years to reach 50M active users Why is this important? The rate of growth matters because you have very little time to change things. You just re-built something yesterday and yet, it no longer works today! Designing for exponential growth is really hard. Hard to predict what will go wrong or where the next set of bottlenecks will be.

Social Networks So what makes the work at Facebook challenging?

The social graph links everything
People are only one dimension of the social graph. Social applications links people to many types of data..photos, videos, music, blog posts, groups, events, organizations, and even other applications

Scaling Social Networks
Much harder than typical websites where... Typically 1-2% online: easy to cache the data Partitioning & scaling relatively easy What do you do when everything is interconnected? Facebook and social networks in general have a unique problem. The data is so connected that the only reasonable way to store it is essentially over a uniformly distributed set of data providers. It’s difficult to segment our data in any meaningful way to reside on the same disks without duplicating it everywhere. It’s also so frequently accessed that we simply can’t hit the database for each access.

name, status, privacy, profile photo
name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo [Consequence of a Click -- II] When I click on a friend’s photo, for example, a lot of things happen. The data is retrieved from the database and checked in real time for privacy, visibility and other rules. Each time you click on a photo, status, name, friends, friends of friends, this process takes place. name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo

System Architecture So what makes the work at Facebook challenging?

Architecture Load Balancer (assigns a web server)
Web Server (PHP assembles data) Kinda inaccurate - memcache and db are on the side. Lots of services... Memcache (fast, simple) Database (slow, persistent)

Memcache Simple in-memory hash table
Supports get/set,delete,multiget, multiset Not a write-through cache Pros and Cons The Database Shield! Low latency, very high request rates Can be easy to corrupt, inefficient for very small items In many ways, this is the heart or core of the site! 120 million queries/second!!! equivalent of typing out over 50 volumes of the Encyclopedia Britannica in 1/10th second.

Memcache Optimization
Multithreading and efficient protocol code - 50k req/s Polling network drivers - 150k req/s Breaking up stats lock - 200k req/s Batching packet handling - 250k req/s Breaking up cache lock - future

Network Incast Memcache Memcache Memcache Memcache Switch
Many Small Get Requests PHP Client

Many big data packets PHP Client

Network Incast Memcache Memcache Memcache Memcache Switch PHP Client

- Implement flow control on client over multiple udp connections- aggressive timeout - blowing up memcache past a threshold PHP Client

Memcache Clustering Many small objects per server
Many servers per large object If objects are small, round trips dominate so you want objects clusteredIf objects are large, transfer time dominates so you want objects distributedIn a web application you will almost always be dealing with small objects, and you can get in a situation where adding machines doesn’t help scaling Many small objects per server Many servers per large object

Memcache Clustering Memcache 10 Objects PHP Client

Memcache Clustering Memcache Memcache PHP Client 5 Objects 5 Objects
2 round trips total1 round trip per server PHP Client

Memcache Clustering Memcache Memcache Memcache PHP Client 4 Objects
3 round trips total1 round trip per server PHP Client

Memcache Pool Optimization
Currently a manual process Replication for obvious hot data sets Interesting problem: Optimize the allocation based on access patterns

Vertical Partitioning of Object Types
Specialized Replica 1 Specialized Replica 2 Shard 1 Shard 2 Shard 1 Shard 2 General pool with wide fanout Shard 1 Shard 2 Shard 3 Shard n ...

MySQL has played a role from the beginning
Thousands of MySQL servers in two datacenters Scribe Today, our user database cluster is a large pool of independent MySQL servers. We have chosen to use a shared-nothing architecture both to achieve scalability and fault isolation. Battleship vs. army of foot soldiers

MySQL Usage Pretty solid transactional persistent store
Logical migration of data is difficult Logical-Physical db mapping Rarely use advanced query features Performance Database resources are precious Web tier CPU is relatively cheap Distributed data - no joins! Sound administrative model

MySQL is better because it is Open Source
We can enhance or extend the database ...as we see fit ...when we see fit Facebook extended MySQL to support distributed cache invalidation for memcache INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...

Scaling across datacenters
West Coast East Coast VA MySQL VA Web VA Memcache Memcache Proxy SC Memcache SC Web SC MySQL SF Web SF Memcache Memcache Proxy MySql replication

Other Interesting Issues
Application level batching and parallelization Super hot data items Cachekey versioning with continuous availability

Photos So what makes the work at Facebook challenging?

Photos + Social Graph = Awesome!

Photos: Scale 20 billion photos x4 = 80 billion
Would wrap around the world more than 10 times! Over 40M new photos per day 600K photos / second

Photos Scaling - The easy wins
Upload tier - handles uploads, scales images, stores on NFS Serving tier: Images served from NFS via HTTP However... File systems are not good at supporting large number of files Metadata too large to fit in memory causing too many IOs for each file read Limited by I/O not storage density Easy wins CDN Cachr (http server + caching) NFS file handle cache 10 i/os - 3 ios with easy wins

Photos: Haystack Overlay file system Index in memory One IO per read

Data Warehousing So what makes the work at Facebook challenging?

Data: How much? 200GB per day in March 2008
2+TB(compressed) raw data per day in April 2009 4+TB(compressed) raw data per day today

The Data Age Free or low cost of user services
Consumer behavior hard to predict Data and analysis are critical More data beats better algorithms

Deficiencies of existing technologies
Analysis/storage on proprietary systems too expensive Closed systems are hard to extend

Hadoop & Hive So what makes the work at Facebook challenging?

Hadoop Superior availability/scalability/manageability despite lower single node performance Open system Scalable costs Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python/perl) Need to publish data in well known schemas

Hive A system for managing and querying structured data built on top of Hadoop Components Map-Reduce for execution HDFS for storage Metadata in an RDBMS

Hive: New Technology, Familiar Interface
hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop dev-streaming.jar - input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh - output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*

Hive: Sample Applications
Reporting E.g.,: Daily/Weekly aggregations of impression/click counts Measures of user engagement Ad hoc Analysis E.g.,: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization E.g.,: User Engagement as a function of user attributes Lots More

Hive: Server Infrastructure
4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch

Hive & Hadoop: Usage Stats
4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day 80K compute hours per day 200 people run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs

Hive: Technical Overview
So what makes the work at Facebook challenging?

Hive: Open and Extensible
Query your own formats and types with your own Serializer/Deserializers Extend the SQL functionality through User Defined Functions Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script

Hive: Smarter Execution Plans
Map-side Joins Predicate Pushdown Partition Pruning Hash based Aggregations Parallel execution of operator trees Intelligent Scheduling

Hive: Possible Future Optimizations
Pipelining? Finer operator control (controlling sorts) Cost based optimizations? HBase

Spikes: The Username Launch
The numbers presented earlier should give you a sense of the scale Facebook operates at and the challenges they might throw. Facebook requires a giant infrastructure and at also a very diverse array of components. The cool thing is that for every class you’re into, we have the opportunity for you to dive into that field, and be right at the forefront. And luckily, Facebook is also a place that is really a continuation of our study, that is broad and challenging.

System Design Database tier cannot handle the load
Dedicated memcache tier for assigned usernames Miss => Available Avoid database hits altogether Blacklists: bucketize, local tier cache timeout - Performance of avail checker was one of the most critical parts of the system - This wasn’t a big issue when were going down the auction path, but became critical once we decided to do FCFS - Generating suggestions adds more load (could do maybe 6-10 or more checks per page load) - Usual caching doesn’t help a lot since there will be a lot of misses. DB tier cannot handle the load Constant refreshing was a concern - so we added a countdown to incentivize users to just watch. Upon countdown completion, don’t auto-refresh - show a continue button. Disabled Chat bar. Lite include stack. Don’t pull in your set of pages unnecessarily. Tradeoffs between UX (extra clicks) and performance.

Username Memcache Tier
Parallel pool in each data center Writes replicated to all nodes 8 nodes per pool Reads can go to any node (hashed by uid) PHP Client ... - Replicated memcache tier - Key gets hashed to a number 0-7 (based on uid?) - But these are not db backed keys & so sets need to be replicated UN0 UN1 UN7 Username Memcache

Write Optimization Hashout store
Distributed key-value store (MySQL backed) Lockless (optimistic) concurrency control - Hashout store stores the mapping between username -> <alias fbid> (which has some more data) - - optimistic concurrency => no locks are obtained and since conflict rates are low, it is a win

Fault Tolerance Memcache nodes can go down
Always check another node on miss Replay from a log file (scribe) Memcache sets are not guaranteed to succeed Self-correcting code: write again to mc if we detect it during db writes - One of the issues we were worried about with non-db backed keys & assuming MC as the system of record was that mc nodes can go down. - MC replicated pools allows some redundancy (on miss, the replicated pool mc infrastructure allows us to check another node on miss) - replay scribe logs for recovery from more than 1 failure (this actually happened!) - MC sets not guaranteed to succeed => bad user experience. writes will not succeed since the database will complain on writes, but still this is not ideal. So, if we find that a username is taken during the db-write stage, we call set again to write those usernames to the UN memcache tier. Future users are prevented from seeing the same problem

Nuclear Options Newsfeed Reduce number of stories
Turn off scrolling, highlights Profile Make info tab the default Chat Reduce buddy list refresh rate Turn if off! Even though we designed the system to be highly performant, and tested it under a lot of load, there were chances that if a really really large number of people showed up at launch time, our system may not be able to handle it. So, we thought about ways in which we can reduce the load caused by other features on the site to provide extra capacity for the username launch. We affectionately called this the “nuclear options” since this was a last resort for handling the enormous load.

How much load? 200k in 3 min 1M in 1 hour 50M in first month
Prepared for over 10x! The actual load was high, but not anywhere near what we were prepared for. Shows you an example of careful design, planning and testing can lead to a successful launch.

Some interesting problems
The spike!

Some interesting problems
Graph models and languages Low latency fast access Slightly more expressive queries Consistency, Staleness can be a bit loose Analysis over large data sets Privacy as part of the model Fat data pipes Push enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners). Controllable QoS

Some interesting problems (contd.)
Search relevance Storage systems Middle tier (cache) optimization Application data access language

Questions? Sum it up.

Scalable Data Management@facebook Srinivas Narayanan 11/13/09.

Similar presentations

Presentation on theme: "Scalable Data Management@facebook Srinivas Narayanan 11/13/09."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Data Management@facebook Srinivas Narayanan 11/13/09.

Similar presentations

Presentation on theme: "Scalable Data Management@facebook Srinivas Narayanan 11/13/09."— Presentation transcript:

Similar presentations

About project

Feedback