Scalable Data Management@facebook Srinivas Narayanan 11/13/09.

Slides:

Advertisements

Similar presentations

Capacity Planning for LAMP Architectures John Allspaw Manager, Operations Flickr.com Web Builder 2.0 Las Vegas.

Advertisements

Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.

Case Study: Photo.net March 20, What is photo.net? An online learning community for amateur and professional photographers 90,000 registered users.

Finding a needle in Haystack Facebook’s Photo Storage

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

Apache Hadoop and Hive.

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

By: Chris Hayes. Facebook Today, Facebook is the most commonly used social networking site for people to connect with one another online. People of all.

Hive: A data warehouse on Hadoop

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

What makes Facebook do what it does? By Gavin Mais.

The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,

Hive : A Petabyte Scale Data Warehouse Using Hadoop

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.

Data Structures & Algorithms and The Internet: A different way of thinking.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.

NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Cassandra as Memcache Edward Capriolo Media6Degrees.com.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Image taken from: slideshare

Netscape Application Server

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Large-scale file systems and Map-Reduce

Open Source distributed document DB for an enterprise

Spark Presentation.

Finding a Needle in Haystack : Facebook’s Photo storage

Steve Ko Computer Sciences and Engineering University at Buffalo

Introduction to MapReduce and Hadoop

Steve Ko Computer Sciences and Engineering University at Buffalo

湖南大学-信息科学与工程学院-计算机与科学系

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Ch 4. The Evolution of Analytic Scalability

Overview of big data tools

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Fast Accesses to Big Data in Memory and Storage Systems

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Scalable Data Management@facebook Srinivas Narayanan 11/13/09

Scale So what scale are we really talking about?

>200 billion monthly page views Over 300 million active users > 3.9 trillion feed actions processed per day >200 billion monthly page views 100 million search queries per day Over 1 million developers in 180 countries #2 site on the Internet (time on site) Exciting and also humbling to be able to serve 300M users, and so many operations. If Facebook were a country, we would be the 3rd largest after China and India. If we look at the amount of data users on Facebook upload daily, it’d exceed 350 feature length movies. Our users upload an average of 500GB of structured data per day in addition to about 2TB of photos and videos. As we will see during the talk,this kind of load is highly non-trivial. The challenge is to build systems that can support this scale and keep the site running 24 hours a day, each and every day. More than 232 photos… 2 billion pieces of content per week 6 billion minutes per day

Growth Rate Active Users 300M 2009 People log in more than once a month :) - It took over four years to reach 100 million users, a level we achieved in September of 2008. And another 1 year to reach 300 million! - Technology adoption is accelerating...Radio took 38 yrs to reach 50M, TV took 13 yrs, computers took 4 years... Facebook took only 3 years to reach 50M active users Why is this important? The rate of growth matters because you have very little time to change things. You just re-built something yesterday and yet, it no longer works today! Designing for exponential growth is really hard. Hard to predict what will go wrong or where the next set of bottlenecks will be.

Social Networks So what makes the work at Facebook challenging?

The social graph links everything People are only one dimension of the social graph. Social applications links people to many types of data..photos, videos, music, blog posts, groups, events, organizations, and even other applications

Scaling Social Networks Much harder than typical websites where... Typically 1-2% online: easy to cache the data Partitioning & scaling relatively easy What do you do when everything is interconnected? Facebook and social networks in general have a unique problem. The data is so connected that the only reasonable way to store it is essentially over a uniformly distributed set of data providers. It’s difficult to segment our data in any meaningful way to reside on the same disks without duplicating it everywhere. It’s also so frequently accessed that we simply can’t hit the database for each access.

name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo [Consequence of a Click -- II] When I click on a friend’s photo, for example, a lot of things happen. The data is retrieved from the database and checked in real time for privacy, visibility and other rules. Each time you click on a photo, status, name, friends, friends of friends, this process takes place. name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo

System Architecture So what makes the work at Facebook challenging?

Architecture Load Balancer (assigns a web server) Web Server (PHP assembles data) Kinda inaccurate - memcache and db are on the side. Lots of services... Memcache (fast, simple) Database (slow, persistent)

Memcache Simple in-memory hash table Supports get/set,delete,multiget, multiset Not a write-through cache Pros and Cons The Database Shield! Low latency, very high request rates Can be easy to corrupt, inefficient for very small items In many ways, this is the heart or core of the site! 120 million queries/second!!! equivalent of typing out over 50 volumes of the Encyclopedia Britannica in 1/10th second.

Memcache Optimization Multithreading and efficient protocol code - 50k req/s Polling network drivers - 150k req/s Breaking up stats lock - 200k req/s Batching packet handling - 250k req/s Breaking up cache lock - future

Network Incast Memcache Memcache Memcache Memcache Switch Many Small Get Requests PHP Client

Network Incast Memcache Memcache Memcache Memcache Switch Many big data packets PHP Client

Network Incast Memcache Memcache Memcache Memcache Switch PHP Client

Network Incast Memcache Memcache Memcache Memcache Switch - Implement flow control on client over multiple udp connections- aggressive timeout - blowing up memcache past a threshold PHP Client

Memcache Clustering Many small objects per server Many servers per large object If objects are small, round trips dominate so you want objects clusteredIf objects are large, transfer time dominates so you want objects distributedIn a web application you will almost always be dealing with small objects, and you can get in a situation where adding machines doesn’t help scaling Many small objects per server Many servers per large object

Memcache Clustering Memcache 10 Objects PHP Client

Memcache Clustering Memcache Memcache PHP Client 5 Objects 5 Objects 2 round trips total1 round trip per server PHP Client

Memcache Clustering Memcache Memcache Memcache PHP Client 4 Objects 3 round trips total1 round trip per server PHP Client

Memcache Pool Optimization Currently a manual process Replication for obvious hot data sets Interesting problem: Optimize the allocation based on access patterns

Vertical Partitioning of Object Types Specialized Replica 1 Specialized Replica 2 Shard 1 Shard 2 Shard 1 Shard 2 General pool with wide fanout Shard 1 Shard 2 Shard 3 Shard n ...

MySQL has played a role from the beginning Thousands of MySQL servers in two datacenters Scribe Today, our user database cluster is a large pool of independent MySQL servers. We have chosen to use a shared-nothing architecture both to achieve scalability and fault isolation. Battleship vs. army of foot soldiers

MySQL Usage Pretty solid transactional persistent store Logical migration of data is difficult Logical-Physical db mapping Rarely use advanced query features Performance Database resources are precious Web tier CPU is relatively cheap Distributed data - no joins! Sound administrative model

MySQL is better because it is Open Source We can enhance or extend the database ...as we see fit ...when we see fit Facebook extended MySQL to support distributed cache invalidation for memcache INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...

Scaling across datacenters West Coast East Coast VA MySQL VA Web VA Memcache Memcache Proxy SC Memcache SC Web SC MySQL SF Web SF Memcache Memcache Proxy MySql replication

Other Interesting Issues Application level batching and parallelization Super hot data items Cachekey versioning with continuous availability

Photos So what makes the work at Facebook challenging?

Photos + Social Graph = Awesome!

Photos: Scale 20 billion photos x4 = 80 billion Would wrap around the world more than 10 times! Over 40M new photos per day 600K photos / second

Photos Scaling - The easy wins Upload tier - handles uploads, scales images, stores on NFS Serving tier: Images served from NFS via HTTP However... File systems are not good at supporting large number of files Metadata too large to fit in memory causing too many IOs for each file read Limited by I/O not storage density Easy wins CDN Cachr (http server + caching) NFS file handle cache 10 i/os - 3 ios with easy wins

Photos: Haystack Overlay file system Index in memory One IO per read

Data Warehousing So what makes the work at Facebook challenging?

Data: How much? 200GB per day in March 2008 2+TB(compressed) raw data per day in April 2009 4+TB(compressed) raw data per day today

The Data Age Free or low cost of user services Consumer behavior hard to predict Data and analysis are critical More data beats better algorithms

Deficiencies of existing technologies Analysis/storage on proprietary systems too expensive Closed systems are hard to extend

Hadoop & Hive So what makes the work at Facebook challenging?

Hadoop Superior availability/scalability/manageability despite lower single node performance Open system Scalable costs Cons: Programmability and Metadata Map-reduce hard to program (users know sql/bash/python/perl) Need to publish data in well known schemas

Hive A system for managing and querying structured data built on top of Hadoop Components Map-Reduce for execution HDFS for storage Metadata in an RDBMS

Hive: New Technology, Familiar Interface hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar - input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh - output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*

Hive: Sample Applications Reporting E.g.,: Daily/Weekly aggregations of impression/click counts Measures of user engagement Ad hoc Analysis E.g.,: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization E.g.,: User Engagement as a function of user attributes Lots More

Hive: Server Infrastructure 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch

Hive & Hadoop: Usage Stats 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day 80K compute hours per day 200 people run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs

Hive: Technical Overview So what makes the work at Facebook challenging?

Hive: Open and Extensible Query your own formats and types with your own Serializer/Deserializers Extend the SQL functionality through User Defined Functions Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script

Hive: Smarter Execution Plans Map-side Joins Predicate Pushdown Partition Pruning Hash based Aggregations Parallel execution of operator trees Intelligent Scheduling

Hive: Possible Future Optimizations Pipelining? Finer operator control (controlling sorts) Cost based optimizations? HBase

Spikes: The Username Launch The numbers presented earlier should give you a sense of the scale Facebook operates at and the challenges they might throw. Facebook requires a giant infrastructure and at also a very diverse array of components. The cool thing is that for every class you’re into, we have the opportunity for you to dive into that field, and be right at the forefront. And luckily, Facebook is also a place that is really a continuation of our study, that is broad and challenging.

System Design Database tier cannot handle the load Dedicated memcache tier for assigned usernames Miss => Available Avoid database hits altogether Blacklists: bucketize, local tier cache timeout - Performance of avail checker was one of the most critical parts of the system - This wasn’t a big issue when were going down the auction path, but became critical once we decided to do FCFS - Generating suggestions adds more load (could do maybe 6-10 or more checks per page load) - Usual caching doesn’t help a lot since there will be a lot of misses. DB tier cannot handle the load Constant refreshing was a concern - so we added a countdown to incentivize users to just watch. Upon countdown completion, don’t auto-refresh - show a continue button. Disabled Chat bar. Lite include stack. Don’t pull in your set of pages unnecessarily. Tradeoffs between UX (extra clicks) and performance.

Username Memcache Tier Parallel pool in each data center Writes replicated to all nodes 8 nodes per pool Reads can go to any node (hashed by uid) PHP Client ... - Replicated memcache tier - Key gets hashed to a number 0-7 (based on uid?) - But these are not db backed keys & so sets need to be replicated UN0 UN1 UN7 Username Memcache

Write Optimization Hashout store Distributed key-value store (MySQL backed) Lockless (optimistic) concurrency control - Hashout store stores the mapping between username -> <alias fbid> (which has some more data) - - optimistic concurrency => no locks are obtained and since conflict rates are low, it is a win

Fault Tolerance Memcache nodes can go down Always check another node on miss Replay from a log file (scribe) Memcache sets are not guaranteed to succeed Self-correcting code: write again to mc if we detect it during db writes - One of the issues we were worried about with non-db backed keys & assuming MC as the system of record was that mc nodes can go down. - MC replicated pools allows some redundancy (on miss, the replicated pool mc infrastructure allows us to check another node on miss) - replay scribe logs for recovery from more than 1 failure (this actually happened!) - MC sets not guaranteed to succeed => bad user experience. writes will not succeed since the database will complain on writes, but still this is not ideal. So, if we find that a username is taken during the db-write stage, we call set again to write those usernames to the UN memcache tier. Future users are prevented from seeing the same problem

Nuclear Options Newsfeed Reduce number of stories Turn off scrolling, highlights Profile Make info tab the default Chat Reduce buddy list refresh rate Turn if off! Even though we designed the system to be highly performant, and tested it under a lot of load, there were chances that if a really really large number of people showed up at launch time, our system may not be able to handle it. So, we thought about ways in which we can reduce the load caused by other features on the site to provide extra capacity for the username launch. We affectionately called this the “nuclear options” since this was a last resort for handling the enormous load.

How much load? 200k in 3 min 1M in 1 hour 50M in first month Prepared for over 10x! The actual load was high, but not anywhere near what we were prepared for. Shows you an example of careful design, planning and testing can lead to a successful launch.

Some interesting problems The spike!

Some interesting problems Graph models and languages Low latency fast access Slightly more expressive queries Consistency, Staleness can be a bit loose Analysis over large data sets Privacy as part of the model Fat data pipes Push enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners). Controllable QoS

Some interesting problems (contd.) Search relevance Storage systems Middle tier (cache) optimization Application data access language

Questions? Sum it up.