Zookeeper at Facebook Vishal Kathuria.

Slides:



Advertisements
Similar presentations
Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Advertisements

Apache ZooKeeper By Patrick Hunt, Mahadev Konar
Wait-free coordination for Internet-scale systems
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Applying Benchmark Data To A Model for Relative Server Capacity CMG 2013 Joseph Temple, LC-NS Consulting John J Thomas, IBM.
CS 542: Topics in Distributed Systems Diganta Goswami.
Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.
Resource Management with YARN: YARN Past, Present and Future
SwatI Agarwal, Thomas Pan eBay Inc.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
Milestone 1 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Usability Test by Knowing User’s Every Move - Bharat chaitanya.
Lecture 8 Epidemic communication, Server implementation.
Operating Systems: Principles and Practice
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
31 January 2007Craig E. Ward1 Large-Scale Simulation Experimentation and Analysis Database Programming Using Java.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Software Engineer, #MongoDBDays.
MAHADEV KONAR Apache ZooKeeper. What is ZooKeeper? A highly available, scalable, distributed coordination kernel.
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
 Anil Nori Distinguished Engineer Microsoft Corporation.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Computer Measurement Group, India Optimal Design Principles for better Performance of Next generation Systems Balachandar Gurusamy,
Introduction to Hadoop and HDFS
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
김영태 선임 연구원 웹서비스를 위한 Personalization Server JPS 1.0 웹서비스를 위한 Personalization Server JPS 1.0.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
June 6, 2007TeraGrid '071 Clustering the Reliable File Transfer Service Jim Basney and Patrick Duda NCSA, University of Illinois This material is based.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Our goal is to make a web based multi-user organizer that can be accessed via cellular devices. There are three main component for this project: A main.
Darkstar. Darkstar is a Sun research project on massively parallel online games The objective (not yet demonstrated!) is to supply a framework for massively.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Server Performance, Scaling, Reliability and Configuration Norman White.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
A D - HOC MOBILE APPLICATION Submitted by: Artem Barger Itai Gannon Tatiana Shvartzman.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Cloud Computing Computer Science Innovations, LLC.
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Zookeeper Wait-Free Coordination for Internet-Scale Systems.
Next Generation of Apache Hadoop MapReduce Owen
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Self-service, with applications to distributed classifier construction Michael K. Reiter and Asad Samar April 27, 2006 Properties & Related Work Self-Service.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Understanding and Improving Server Performance
Introduction to Operating Systems
Affinity Depending on the application and client requirements of your Network Load Balancing cluster, you can be required to select an Affinity setting.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
BDII Performance Tests
CSE-291 Cloud Computing, Fall 2016 Kesden
LCGAA nightlies infrastructure
Introduction to Operating Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Wait-free coordination for Internet-scale systems
THE GOOGLE FILE SYSTEM.
Team 6: Ali Nickparsa, Yoshimichi Nakatsuka, Yuya Shiraki
The site to download BALBES:
A tutorial on building large-scale services
Pig Hive HBase Zookeeper
Presentation transcript:

Zookeeper at Facebook Vishal Kathuria

Agenda Zookeeper use at Facebook Project Zeus – Goals Tao Design Tao Workload simulator Early results of Zookeeper testing Zookeeper Improvements

Use Cases Inside Facebook HDFS For location of the name node Name node leader election 75K temporary (permanent in future) clients HBase For mapping of regions to region servers, location of ROOT node Region server failure detection and failover After UDBs more to HBase, ~100K permanent clients Titan Mapping of user to Prometheus web server within a cell Leader election of Prometheus web server Future: Selection of the Hbase geo-cell

Use cases (contd) Ads Scribe Future customers Leader Election Leader election of scribe aggregators Future customers TAO Sharding MySQL Search

Project Zeus “Make Zookeeper awesome” Zookeeper works at Facebook scale Zookeeper is one of the most reliable services at Facebook Solve pressing infrastructure problems using ZooKeeper Shard Manager for Tao Generic Shard Management capability in Tupperware MySQL HA

Caveats Project is 5 weeks old Initial sharing of ideas with the community Ideas not yet whetted or proven through prototypes

Tao Design Shard Map Based on ranges instead of consistent hash Stored in ZooKeeper Accessed by clients using Aether Populated by Eos Dynamically updated based on load information

Tao Projected Workload Scale requirements for a single cluster 24,000 Web machines Read only clients 6,000 Tao server machines Read/Write clients About 20 clusters site wide Shard Map is 2-3 MB of data

Tao Workload Simulator Clients Read the shard map of local cluster after connection Put a watch on the shard map Refresh shard map after watch fires Follower Servers These servers are clients of the leader servers Also read their own shard map Leader Servers Read their own shard map and of all of their followers Shard Manager - Eos Periodically updates the shard map

Hardware 3 node zookeeper ensemble Clients – 20 node cluster 8 core 8G RAM Clients – 20 node cluster Web class machines 12 G RAM

Scenario - Steady State Using Zookeeper ensemble per cluster model Assumptions 40K connections Small number of clients joining/leaving at any time Rare updates to the shard map – once every 10 minutes Result Zookeeper worked well in this

Scenario - Cluster Power Up/Down Cluster Powering Up 25K Clients simultaneously trying to connect Slow response time It took some clients 560s to connect and get data Cluster powering down 25 K clients simultaneously disconnect System Temporarily Unresponsive The disconnect requests filled zookeeper queues System would not accept any more new connections or requests After a short time, the disconnect requests were processed and the system became responsive again

Scenario – Zookeeper Node Failure Rolling Restart of ZooKeeper Nodes Startup/Shutdown of entire cluster With active clients Without active clients Result No corruptions or system hangs noticed so far

Zookeeper Design Client connect/disconnect is a persisted update involving all nodes The ping and connection timeout handling is done by the leader for all connections Single thread handling connect requests and data requests Zookeeper is implemented as a single threaded pipeline. All reads are serialized Low read throughput Uses only 3 cores at full load

Zookeeper Improvement Ideas Non persisted sessions with local session tracking Hacked a prototype to test potential Initial test runs very encouraging Dedicated connection creation thread Prototyped, test runs in progress Multiple threads for deserializing incoming requests

Zookeeper Improvement Ideas Dedicated parallel pipeline for read only clients