(Private) Cloud Computing with Mesos at Twitter Benjamin

Slides:



Advertisements
Similar presentations
Cloud Service Models and Performance Ang Li 09/13/2010.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Imranul Hoque. Today’s Cloud Computing.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.
Spark: Cluster Computing with Working Sets
Resource Management with YARN: YARN Past, Present and Future
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Platform as a Service (PaaS)
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Plan Introduction What is Cloud Computing?
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Platform for Fine-Grained Resource Sharing in the Data Center
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Cloud Computing Source:
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Cloud Computing Saneel Bidaye uni-slb2181. What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Computing Dave Elliman 11/10/2015G53ELC 1. Source: NY Times (6/14/2006) The datacenter is the computer!
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Enterprise Cloud Computing
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Web Technologies Lecture 13 Introduction to cloud computing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
A Platform for Fine-Grained Resource Sharing in the Data Center
Information Systems in Organizations 5.2 Cloud Computing.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Next Generation of Apache Hadoop MapReduce Owen
Cloud Computing ENG. YOUSSEF ABDELHAKIM. Agenda :  The definitions of Cloud Computing.  Examples of Cloud Computing.  Which companies are using Cloud.
KAASHIV INFOTECH – A SOFTWARE CUM RESEARCH COMPANY IN ELECTRONICS, ELECTRICAL, CIVIL AND MECHANICAL AREAS
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
What is Cloud Computing 1. Cloud computing is a service that helps you to perform the tasks over the Internet. The users can access resources as they.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Agenda  What is Cloud Computing?  Milestone of Cloud Computing  Common Attributes of Cloud Computing  Cloud Service Layers  Cloud Implementation.
Containers as a Service with Docker to Extend an Open Platform
Scalable containers with Apache Mesos and DC/OS
Introduction to Distributed Platforms
Cloud computing-The Future Technologies
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Apache Hadoop YARN: Yet Another Resource Manager
INDIGO – DataCloud PaaS
Overview Introduction VPS Understanding VPS Architecture
湖南大学-信息科学与工程学院-计算机与科学系
INFO 344 Web Tools And Development
Introduction to Apache
Introduction Apache Mesos is a type of open source software that is used to manage the computer clusters. This type of software has been developed by the.
Lecture 16 (Intro to MapReduce and Hadoop)
Cloud Computing Large-scale Resource Management
Big-Data Analytics with Azure HDInsight
Presentation transcript:

(Private) Cloud Computing with Mesos at Twitter Benjamin

what is cloud computing? self-service scalable economic elastic virtualized managed utility pay-as-you-go

what is cloud computing? “cloud” refers to large Internet services running on 10,000s of machines (Amazon, Google, Microsoft, etc) “cloud computing” refers to services by these companies that let external customers rent cycles and storage – Amazon EC2: virtual machines at 8.5¢/hour, billed hourly – Amazon S3: storage at 15¢/GB/month – Google AppEngine: free up to a certain quota – Windows Azure: higher-level than EC2, applications use API

what is cloud computing? cheap nodes, commodity networking self-service (use personal credit card) and pay-as- you-go virtualization – from co-location, to hosting providers running the web server, the database, etc and having you just FTP your files … now you do all that yourself again! economic incentives – provider: sell unused resources – customer: no upfront capital costs building data centers, buying servers, etc

“cloud computing” infinite scale …

“cloud computing” always available …

challenges in the cloud environment cheap nodes fail, especially when you have many – mean time between failures for 1 node = 3 years – mean time between failures for 1000 nodes = 1 day – solution: new programming models (especially those where you can efficiently “build-in” fault-tolerance) commodity network = low bandwidth – solution: push computation to the data

moving target infrastructure as a service (virtual machines)  software/platforms as a service why? programming with failures is hard managing lots of machines is hard

moving target infrastructure as a service (virtual machines)  software/platforms as a service why? programming with failures is hard managing lots of machines is hard

programming with failures is hard analogy: concurrency/parallelism – imagine programming with threads that randomly stop executing – can you reliably detect and differentiate failures? analogy: synchronization – imagine programming where communicating between threads might fail (or worse, take a very long time) – how might you change your code?

problem: distributed systems are hard

solution: abstractions (higher-level frameworks)

MapReduce Restricted data-parallel programming model for clusters (automatic fault- tolerance) Pioneered by Google – Processes 20 PB of data per day Popularized by Apache Hadoop project – Used by Yahoo!, Facebook, Twitter, …

beyond MapReduce many other frameworks follow MapReduce’s example of restricting the programming model for efficient execution on clusters – Dryad (Microsoft): general DAG of tasks – Pregel (Google): bulk synchronous processing – Percolator (Google): incremental computation – S4 (Yahoo!): streaming computation – Piccolo (NYU): shared in-memory state – DryadLINQ (Microsoft): language integration – Spark (Berkeley): resilient distributed datasets

everything else web servers (apache, nginx, etc) application servers (rails) databases and key-value stores (mysql, cassandra) caches (memcached) all our own twitter specific services …

managing lots of machines is hard getting efficient use of out a machine is non-trivial (even if you’re using virtual machines, you still want to get as much performance as possible) nginx Hadoo p nginx Hadoo p

managing lots of machines is hard getting efficient use of out a machine is non-trivial (even if you’re using virtual machines, you still want to get as much performance as possible) nginx Hadoo p

problem: lots of frameworks and services … how should we allocate resources (i.e., parts of a machine) to each?

idea: can we treat the datacenter as one big computer and multiplex applications and services across available machine resources?

solution: mesos common resource sharing layer – abstracts resources for frameworks nginx Hadoo p Mesos nginx Hadoo p uniprograming multiprograming

twitter and the cloud owns private datacenters (not a consumer) – commodity machines, commodity networks not selling excess capacity to third parties (not a provider) has lots of services (especially new ones) has lots of programmers wants to reduce CAPEX and OPEX

twitter and mesos use mesos to get cloud like properties from datacenter (private cloud) to enable “self-service” for engineers (but without virtual machines)

computation model: frameworks A framework (e.g., Hadoop, MPI) manages one or more jobs in a computer cluster A job consists of one or more tasks A task (e.g., map, reduce) is implemented by one or more processes running on a single machine 23 cluster Framework Scheduler (e.g., Job Tracker) Executor (e.g., Task Tracker) Executor (e.g., Task Traker) Executor (e.g., Task Tracker) Executor (e.g., Task Tracker) task 1 task 5 task 3 task 7task 4 task 2 task 6 Job 1: tasks 1, 2, 3, 4 Job 2: tasks 5, 6, 7

two-level scheduling Advantages: – Simple  easier to scale and make resilient – Easy to port existing frameworks, support new ones Disadvantages: – Distributed scheduling decision  not optimal 24 Mesos Master Organization policies Resource availability Framework scheduler Task schedule Fwk schedule Framework scheduler Framework Scheduler

resource offers Unit of allocation: resource offer – Vector of available resources on a node – E.g., node1:, node2: Master sends resource offers to frameworks Frameworks select which offers to accept and which tasks to run 25 Push task scheduling to frameworks

Hadoop JobTracker MPI JobTracker 8CPU, 8GB Hadoop Executor MPI executor task 1 8CPU, 16GB 16CPU, 16GB Hadoop Executor task 2 Allocation Module S1 S2 S3 S1 S2 S1 Mesos Architecture: Example 26 (S1:, S2: ) (task1:[S1: ] ; task2:[S2: ]) S1: S2: S3: Slaves continuously send status updates about resources Pluggable scheduler to pick framework to send an offer to Framework scheduler selects resources and provides tasks Framework executors launch tasks and may persist across tasks task 1: task 2: (S1:, S3: ) ([task1:S1:<4CPU,2GB ]) Slave S1 Slave S2 Slave S3 Mesos Master task1:

twitter applications/services if you build it … they will come let’s build a url shortner (t.co)!

development lifecycle 1. gather requirements 2. write a bullet-proof service (server) ‣ load test ‣ capacity plan ‣ allocate & configure machines ‣ package artifacts ‣ write deploy scripts ‣ setup monitoring ‣ other boring stuff (e.g., sarbanes-oxley) 3. resume reading timeline (waiting for machines to get allocated)

development lifecycle with mesos 1. gather requirements 2. write a bullet-proof service (server) ‣ load test ‣ capacity plan ‣ allocate & configure machines ‣ package artifacts ‣ write deploy configuration scripts ‣ setup monitoring ‣ other boring stuff (e.g., sarbanes-oxley) 3. resume reading timeline

t.co launch on mesos! CRUD via command line: $ scheduler create t_co t_co.mesos Creating job t_co OK (4 tasks pending for job t_co)

t.co launch on mesos! CRUD via command line: $ scheduler create t_co t_co.mesos Creating job t_co OK (4 tasks pending for job t_co) tasks represent shards

t.co cluster Scheduler Executor task 1 task 5 task 3 task 7task 4 task 2 task 6 $ scheduler create t_co t_co.mesos

t.co is it running? (“top” via a browser)

what it means for devs? write your service to be run anywhere in the cluster anticipate ‘kill -9’ treat local disk like /tmp

bad practices avoided machines fail; force programmers to focus on shared-nothing (stateless) service shards and clusters, not machines – hard-coded machine names (IPs) considered harmful – manually installed packages/files considered harmful – using the local filesystem for persistent data considered harmful

level of indirection #ftw nginx t.co Need replace server!

level of indirection #ftw nginx t.co Need replace server!

level of indirection #ftw example from operating systems?

isolation Executor task 1 task 5 what happens when task 5 executes: while (true) {}

isolation leverage linux kernel containers task 1 (t.co) task 2 (nginx) CPU RAM container 1container 2

software dependencies 1.package everything into a single artifact 2.download it when you run your task (might be a bit expensive for some services, working on next generation solution)

t.co + malware what if a user clicks a link that takes them some place bad? let’s check for malware!

t.co + malware a malware service already exists … but how do we use it? cluster Scheduler Executor task 1 task 5 task 3 task 1task 4 task 2 task 6

t.co + malware a malware service already exists … but how do we use it? cluster Scheduler Executor task 1 task 5 task 3 task 1task 4 task 2 task 6

t.co + malware a malware service already exists … but how do we use it? cluster Scheduler Executor task 1 task 5 task 3 task 1task 4 task 2 task 6 how do we name the malware service?

naming part 1 ‣ service discovery via ZooKeeper ‣ zookeeper.apache.org ‣ servers register, clients discover ‣ we have a Java library for this ‣ twitter.github.com/commons

naming part 2 ‣ naïve clients via proxy

naming PIDs /var/local/myapp/pid

t.co + malware okay, now for a redeploy! (CRUD) $ scheduler update t_co t_co.config Updating job t_co Restarting shards... Getting status... Failed Shards = []...

rolling updates … Updater {Job, Update Config} Restart Shards Healthy? Rollback Complete? Finish {Success, Failure} Yes No

datacenter operating system Mesos + Twitter specific scheduler + service proxy (naming) + updater + dependency manager datacenter operating system (private cloud)

Thanks!