Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge.

Slides:

Advertisements

Similar presentations

Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!

Advertisements

Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.

Lucene/Solr Architecture

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.

Technology of Data Analytics. INTRODUCTION OBJECTIVE  Data Analytics mindset – shallow and wide, deep when you need it  Quick overview, useful tidbits,

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland

Hadoop Ecosystem Overview

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.

Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.

Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, Hadoop and its applications at the.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Introduction to Hadoop and HDFS

SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Hadoop implementation of MapReduce computational model Ján Vaňo.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Nov 2006 Google released the paper on BigTable.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Zhangxi Lin Texas Tech University

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.

An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Apache David Schneider (schnei21) ITEC400. What is Hadoop? Distributed Computing Open Source Reliable Scalable Fun Facts What is a Hadoop? Hadoop was.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Image taken from: slideshare

Big Data is a Big Deal!.

An Open Source Project Commonly Used for Processing Big Data Sets

Hadoopla: Microsoft and the Hadoop Ecosystem

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Hadoop Clusters Tess Fulkerson.

Central Florida Business Intelligence User Group

Ministry of Higher Education

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

Introduction to Apache

Overview of big data tools

TIM TAYLOR AND JOSH NEEDHAM

Lucene/Solr Architecture

Charles Tappert Seidenberg School of CSIS, Pace University

Big Data, Simulations and HPC Convergence

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Kansas State University Olathe Thursday 14 August 2014 William H. Hsu Laboratory for Knowledge Discovery in Databases, Kansas State University Acknowledgements K-State Manhattan: Majed Alsadhan, Scott Finkeldei, Kyle Hudson, Surya Teja Kallumadi K-State Olathe: Dr. Prema Arasu, Dana Reinert, Paige Adams, Cathy Danahy, Angela Cummins, Emily Surdez, Quentin New, Amy Burgess Big Data Workshop: Day 2 Part III – Hadoop Ecosystem & Tools

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Workshop Overview: Topics Covered Day 1: Overview & Tutorial  Survey of Big Data: Data, Tools, Methods, & Applications  Tutorial on MapReduce Algorithms & Tools Day 2: Hands-On Tutorial & Real-World Examples  Hadoop Stack in Detail: Hive, Pig, Solr & Lucene, Mesos  Other Tools & Platforms: Scala, Python Day 3: Data Mining & Visualization  More Tools: Spark, Machine Learning (Mahout & Oryx)  Graphs (Neo4j), Data/Info Visualization (Tableau)

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Workshop Overview: Goals Day 1: Survey Real-World Applications & Methods  Present Apache Hadoop Stack & Its Uses  Introduce MapReduce using Hands-On Examples Day 2: Delve into MapReduce Framework & Hadoop  Understand Scalding, Python Streaming  Go Over Basic Common Patterns, Dissect Code Day 3: Review Tools & State of the Field  Look at Data Mining & Visualization: Tasks, Methods  Current Research and Development in Data Science

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Review - Hadoop Stack: High-Level Overview (2011) Figure © 2011, R. Kalakota

Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 6 Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant) 2006: Yahoo runs Hadoop on 5-20 nodes This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 7 Timeline March 2008: Cloudera founded July 2008: Hadoop wins TeraByte sort benchmark (1 st time a Java program won this competition) April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2 June 2011: Hortonworks founded This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 8 Timeline 27 dec 2011: Apache Hadoop release June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS 2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day 15 oct 2013: Apache Hadoop release (YARN) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 9 Contributions (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 10 “Core” Hadoop Hadoop Common (formerly Hadoop Core) Hadoop MapReduce Hadoop YARN (MapReduce 2.0) Hadoop Distributed File System (HDFS) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 11 The wider Hadoop Ecosystem This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Ambari, Zookeeper (managing & monitoring) HBase, Cassandra (database) Hive, Pig (data warehouse and query language) Mahout (machine learning) Chukwa, Avro, Oozie, Giraph, and many more

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 12 The wider Hadoop Ecosystem collins-charles-zedlewski-cloudera

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases “Hadoop is a hammer. Start by figuring out what house you‘re gonna build.“ Alistair Croll “If all you have is a hammer, throw away everything that is not a nail!“ Jimmy Lin 13 “Hadoop is a hammer” This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 14 MapReduce in 41 words (including “library”) Goal: count the number of books in the library. Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) Reduce: We all get together and add up our individual counts. (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases MapReduce in a nutshell 15 Task1 Task 2 Task 3 Output data Aggregated Result © Sven Schlarb

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 16 MapReduce “v1” issues JobTracker as a single-point of failure Deficiencies in scalability, memory consumption, threading-model, reliability and performance ( Aim to support programming paradigms other than MapReduce (BSP) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 17 MapReduce vs YARN (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 18 When to use Hadoop? Generally, always when “standard tools” don’t work anymore because of sheer data size (rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!) Aggregation across large data sets: use the power of Reducers! Large-scale ETL operations (extract, transform, load) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Tom White: Hadoop. The Definitive Guide (get 3rd ed. for extra YARN chapter) YARN explained (really quite well): Jimmy Lin: Text Processing with MapReduce: Reading 19 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 20 Happy Hadooping! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Architecture Apache Lucene /select/spellXMLCSVXML Binar y JSON Data Import Handler (SQL/RSS) Extracting Request Handler (PDF/WORD) CachingFaceting Query Parsing Apache Tika binary/admin High- lighting Schema Index Replication Request HandlersUpdate HandlersResponse Writers Query Search Components Spelling Faceting Highlightin g Signature Logging Update Processors Indexing Config Debug Statistics More like this Distributed Search Clustering FilteringSearch Core Search IndexReader/Searcher Indexing IndexWriter Text Analysis Analysis

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr plugins RequestHandlers – handle a request at a URL like /select SearchComponents – part of a SearchHandler, a componentized request handler  Includes, Query, Facet, Highlight, Debug, Stats  Distributed Search capable UpdateHandlers – handle an indexing request Update Processor Chains – per-handler componentized chain that handle updates Query Parser plugins  Mix and match query types in a single request  Function plugins for Function Query Text Analysis plugins: Analyzers, Tokenizers, TokenFilters ResponseWriters serialize & stream response to client 22

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Query Plugin Architecture 23 schema.xml solrconfig.xml Function QParser sqrt sum pow custom max log MyCustom QParser DisMax QParser Function Range Q XML QParser Lucene QParser <parser name=“mycustom” … <func name=“custom” class=… Whitespace Tokenizer Analyzer for “title” CustomFilter SynonymFilter Porter Stemmer // declaratively defines types // and analyzers for fields <field name=“title” type=“text1” <field name=“cust1” class=… Analyzer for “cust1” (potentially completely custom architecture not using tokenizer/filters) Declarative Analysis per-field - Tokenizer to split text - TokenFilter to transform tokens - Analyzer for completely custom - Separate query / index analyzer QParser plugins - Support different query syntaxes - Support different query execution - Function Query supports pluggable custom functions - Excellent support for nesting/mixing different query types in the same request.

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Request Plugins 24 /select RequestHandler Query Component Facet Component Highlight Component Debug Component Distributed Search MoreLikeThisStatisticsTerms SpellcheckTermVectorQueryElevation My Custom Binary response writer JSON respons e writer Custom response writer Request Handler (non- component based) /admin/luke Request Handler (custom) /mypath XML response writer XSLT response writer Query Response {“response”={ “docs”={ Additional plug-n-play search components Clustering

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Indexing 25 XML Update Handler CSV Update Handler /update/update/csv XML Update with custom processor chain /update/xml Extracting RequestHandler (PDF, Word, …) /update/extract Lucene Index Data Import Handler Database pull RSS pull Simple transforms SQL DB RSS feed Remove Duplicates processor Logging processor Index processor Custom Transform processor PDF HTTP POST pull Update Processor Chain (per handler) Lucene Text Index Analyzers

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Pig Background Rapid innovation in cluster computing frameworks Dryad Pregel Percolator C IEL

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster  …to maximize utilization  …to share data between frameworks

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Where We Want to Go Hadoop Pregel MPI Shared cluster Today: static partitioningMesos: dynamic sharing

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Solution Mesos is a common resource sharing layer over which diverse frameworks can run Mesos Node Hadoop Pregel … Node Hadoop Node Pregel …

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Other Benefits of Mesos Run multiple instances of the same framework  Isolate production and experimental jobs  Run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains  Better performance than general-purpose abstractions

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Outline Mesos Goals and Architecture Implementation Results Related Work

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Goals High utilization of resources Support diverse frameworks (current & future) Scalability to 10,000’s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Design Elements Fine-grained sharing:  Allocation at the level of tasks within a job  Improves utilization, latency, and data locality Resource offers:  Simple, scalable application-controlled scheduling mechanism

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 1: Fine-Grained Sharing Framework 1 Framework 2 Framework 3 Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos): + Improved utilization, responsiveness, data locality Storage System (e.g. HDFS) Fw. 1 Fw. 3 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 Fw. 2

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 2: Resource Offers Option: Global scheduler  Frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions – Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 2: Resource Offers Mesos: Resource offers  Offer available resources to frameworks, let them pick which resources to use and which tasks to launch + Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Resourc e offer Pick framework to offer resources to

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Pick framework to offer resources to Resourc e offer Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) } Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) }

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task Pick framework to offer resources to task Framework- specific scheduling Resourc e offer Launches and isolates executors

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Optimization: Filters Let frameworks short-circuit rejection by providing a predicate on resources to be offered  E.g. “nodes from list L” or “nodes with > 8 GB RAM”  Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Implementation

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Implementation Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs (up to 20× faster than Hadoop) Open source in Apache Incubator

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analytics UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Results »Utilization and performance vs static partitioning »Framework placement goals: data locality »Scalability »Fault recovery

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Dynamic Resource Sharing

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes FrameworkSpeedup on Mesos Facebook Hadoop Mix1.14× Large Hadoop Mix2.10× Spark1.26× Torque / MPI0.96×

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) Data Locality with Resource Offers 1.7×

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra- framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len)

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Fault Tolerance Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re-register with new master after a failure Result: fault detection and recovery in ~10 sec

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine)  Coarse-grained sharing for inelastic jobs (e.g. MPI) Virtual machine clouds  Coarse-grained sharing similar to HPC Condor  Centralized scheduler based on matchmaking Parallel work: Next-Generation Hadoop  Redesign of Hadoop to have per-application masters  Also aims to support non-MapReduce jobs  Based on resource request language with locality prefs

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements:  Fine-grained sharing at the level of tasks  Resource offers, a scalable mechanism for application-controlled scheduling Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, Conviva and UCSF

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Backup Slides

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Analysis Resource offers work well when:  Frameworks can scale up and down elastically  Task durations are homogeneous  Frameworks have many preferred nodes These conditions hold in current data analytics frameworks (MapReduce, Dryad, …)  Work divided into short tasks to facilitate load balancing and fault recovery  Data replicated across multiple nodes

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Revocation Mesos allocation modules can revoke (kill) tasks to meet organizational SLOs Framework given a grace period to clean up “Guaranteed share” API lets frameworks avoid revocation by staying below a certain share

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos API Scheduler Callbacks resourceOffer(offerId, offers) offerRescinded(offerId) statusUpdate(taskId, status) slaveLost(slaveId) Executor Callbacks launchTask(taskDescriptor) killTask(taskId) Executor Actions sendStatus(taskId, status) Scheduler Actions replyToOffer(offerId, tasks) setNeedsOffers(bool) setFilters(filters) getGuaranteedShare() killTask(taskId)

Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Hadoop Ecosystem We covered these starting Day 1 Today (Day 2) & next week (Day 3) we cover more of these Adapted from slide © 2013, M. Eltabakh, Worcester Polytechnic Institute