Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Kansas State University Olathe Thursday 14 August 2014 William H. Hsu Laboratory for Knowledge Discovery in Databases, Kansas State University Acknowledgements K-State Manhattan: Majed Alsadhan, Scott Finkeldei, Kyle Hudson, Surya Teja Kallumadi K-State Olathe: Dr. Prema Arasu, Dana Reinert, Paige Adams, Cathy Danahy, Angela Cummins, Emily Surdez, Quentin New, Amy Burgess Big Data Workshop: Day 2 Part III – Hadoop Ecosystem & Tools
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Workshop Overview: Topics Covered Day 1: Overview & Tutorial Survey of Big Data: Data, Tools, Methods, & Applications Tutorial on MapReduce Algorithms & Tools Day 2: Hands-On Tutorial & Real-World Examples Hadoop Stack in Detail: Hive, Pig, Solr & Lucene, Mesos Other Tools & Platforms: Scala, Python Day 3: Data Mining & Visualization More Tools: Spark, Machine Learning (Mahout & Oryx) Graphs (Neo4j), Data/Info Visualization (Tableau)
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Workshop Overview: Goals Day 1: Survey Real-World Applications & Methods Present Apache Hadoop Stack & Its Uses Introduce MapReduce using Hands-On Examples Day 2: Delve into MapReduce Framework & Hadoop Understand Scalding, Python Streaming Go Over Basic Common Patterns, Dissect Code Day 3: Review Tools & State of the Field Look at Data Mining & Visualization: Tasks, Methods Current Research and Development in Data Science
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Review - Hadoop Stack: High-Level Overview (2011) Figure © 2011, R. Kalakota
Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 6 Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant) 2006: Yahoo runs Hadoop on 5-20 nodes This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 7 Timeline March 2008: Cloudera founded July 2008: Hadoop wins TeraByte sort benchmark (1 st time a Java program won this competition) April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2 June 2011: Hortonworks founded This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 8 Timeline 27 dec 2011: Apache Hadoop release June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS 2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day 15 oct 2013: Apache Hadoop release (YARN) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 9 Contributions (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 10 “Core” Hadoop Hadoop Common (formerly Hadoop Core) Hadoop MapReduce Hadoop YARN (MapReduce 2.0) Hadoop Distributed File System (HDFS) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 11 The wider Hadoop Ecosystem This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ). Ambari, Zookeeper (managing & monitoring) HBase, Cassandra (database) Hive, Pig (data warehouse and query language) Mahout (machine learning) Chukwa, Avro, Oozie, Giraph, and many more
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 12 The wider Hadoop Ecosystem collins-charles-zedlewski-cloudera
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases “Hadoop is a hammer. Start by figuring out what house you‘re gonna build.“ Alistair Croll “If all you have is a hammer, throw away everything that is not a nail!“ Jimmy Lin 13 “Hadoop is a hammer” This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 14 MapReduce in 41 words (including “library”) Goal: count the number of books in the library. Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) Reduce: We all get together and add up our individual counts. (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases MapReduce in a nutshell 15 Task1 Task 2 Task 3 Output data Aggregated Result © Sven Schlarb
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 16 MapReduce “v1” issues JobTracker as a single-point of failure Deficiencies in scalability, memory consumption, threading-model, reliability and performance ( Aim to support programming paradigms other than MapReduce (BSP) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 17 MapReduce vs YARN (Cf. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 18 When to use Hadoop? Generally, always when “standard tools” don’t work anymore because of sheer data size (rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!) Aggregation across large data sets: use the power of Reducers! Large-scale ETL operations (extract, transform, load) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Tom White: Hadoop. The Definitive Guide (get 3rd ed. for extra YARN chapter) YARN explained (really quite well): Jimmy Lin: Text Processing with MapReduce: Reading 19 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases 20 Happy Hadooping! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐ (Grant Agreement number ).
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Architecture Apache Lucene /select/spellXMLCSVXML Binar y JSON Data Import Handler (SQL/RSS) Extracting Request Handler (PDF/WORD) CachingFaceting Query Parsing Apache Tika binary/admin High- lighting Schema Index Replication Request HandlersUpdate HandlersResponse Writers Query Search Components Spelling Faceting Highlightin g Signature Logging Update Processors Indexing Config Debug Statistics More like this Distributed Search Clustering FilteringSearch Core Search IndexReader/Searcher Indexing IndexWriter Text Analysis Analysis
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr plugins RequestHandlers – handle a request at a URL like /select SearchComponents – part of a SearchHandler, a componentized request handler Includes, Query, Facet, Highlight, Debug, Stats Distributed Search capable UpdateHandlers – handle an indexing request Update Processor Chains – per-handler componentized chain that handle updates Query Parser plugins Mix and match query types in a single request Function plugins for Function Query Text Analysis plugins: Analyzers, Tokenizers, TokenFilters ResponseWriters serialize & stream response to client 22
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Query Plugin Architecture 23 schema.xml solrconfig.xml Function QParser sqrt sum pow custom max log MyCustom QParser DisMax QParser Function Range Q XML QParser Lucene QParser <parser name=“mycustom” … <func name=“custom” class=… Whitespace Tokenizer Analyzer for “title” CustomFilter SynonymFilter Porter Stemmer // declaratively defines types // and analyzers for fields <field name=“title” type=“text1” <field name=“cust1” class=… Analyzer for “cust1” (potentially completely custom architecture not using tokenizer/filters) Declarative Analysis per-field - Tokenizer to split text - TokenFilter to transform tokens - Analyzer for completely custom - Separate query / index analyzer QParser plugins - Support different query syntaxes - Support different query execution - Function Query supports pluggable custom functions - Excellent support for nesting/mixing different query types in the same request.
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Request Plugins 24 /select RequestHandler Query Component Facet Component Highlight Component Debug Component Distributed Search MoreLikeThisStatisticsTerms SpellcheckTermVectorQueryElevation My Custom Binary response writer JSON respons e writer Custom response writer Request Handler (non- component based) /admin/luke Request Handler (custom) /mypath XML response writer XSLT response writer Query Response {“response”={ “docs”={ Additional plug-n-play search components Clustering
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Lucene/Solr Indexing 25 XML Update Handler CSV Update Handler /update/update/csv XML Update with custom processor chain /update/xml Extracting RequestHandler (PDF, Word, …) /update/extract Lucene Index Data Import Handler Database pull RSS pull Simple transforms SQL DB RSS feed Remove Duplicates processor Logging processor Index processor Custom Transform processor PDF HTTP POST pull Update Processor Chain (per handler) Lucene Text Index Analyzers
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Pig Background Rapid innovation in cluster computing frameworks Dryad Pregel Percolator C IEL
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster …to maximize utilization …to share data between frameworks
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Where We Want to Go Hadoop Pregel MPI Shared cluster Today: static partitioningMesos: dynamic sharing
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Solution Mesos is a common resource sharing layer over which diverse frameworks can run Mesos Node Hadoop Pregel … Node Hadoop Node Pregel …
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Other Benefits of Mesos Run multiple instances of the same framework Isolate production and experimental jobs Run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains Better performance than general-purpose abstractions
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Outline Mesos Goals and Architecture Implementation Results Related Work
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Goals High utilization of resources Support diverse frameworks (current & future) Scalability to 10,000’s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Design Elements Fine-grained sharing: Allocation at the level of tasks within a job Improves utilization, latency, and data locality Resource offers: Simple, scalable application-controlled scheduling mechanism
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 1: Fine-Grained Sharing Framework 1 Framework 2 Framework 3 Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos): + Improved utilization, responsiveness, data locality Storage System (e.g. HDFS) Fw. 1 Fw. 3 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 3 Fw. 1 Fw. 2 Fw. 1 Fw. 3 Fw. 2
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 2: Resource Offers Option: Global scheduler Frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions – Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Element 2: Resource Offers Mesos: Resource offers Offer available resources to frameworks, let them pick which resources to use and which tasks to launch + Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Resourc e offer Pick framework to offer resources to
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Mesos slave MPI executor task Pick framework to offer resources to Resourc e offer Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) } Resource offer = list of (node, availableResources) E.g. { (node1, ), (node2, ) }
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Allocati on module Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task Pick framework to offer resources to task Framework- specific scheduling Resourc e offer Launches and isolates executors
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Optimization: Filters Let frameworks short-circuit rejection by providing a predicate on resources to be offered E.g. “nodes from list L” or “nodes with > 8 GB RAM” Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Implementation
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Implementation Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs (up to 20× faster than Hadoop) Open source in Apache Incubator
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analytics UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Results »Utilization and performance vs static partitioning »Framework placement goals: data locality »Scalability »Fault recovery
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Dynamic Resource Sharing
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes FrameworkSpeedup on Mesos Facebook Hadoop Mix1.14× Large Hadoop Mix2.10× Spark1.26× Torque / MPI0.96×
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes) Data Locality with Resource Offers 1.7×
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra- framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len)
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Fault Tolerance Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re-register with new master after a failure Result: fault detection and recovery in ~10 sec
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine) Coarse-grained sharing for inelastic jobs (e.g. MPI) Virtual machine clouds Coarse-grained sharing similar to HPC Condor Centralized scheduler based on matchmaking Parallel work: Next-Generation Hadoop Redesign of Hadoop to have per-application masters Also aims to support non-MapReduce jobs Based on resource request language with locality prefs
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements: Fine-grained sharing at the level of tasks Resource offers, a scalable mechanism for application-controlled scheduling Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, Conviva and UCSF
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Backup Slides
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Analysis Resource offers work well when: Frameworks can scale up and down elastically Task durations are homogeneous Frameworks have many preferred nodes These conditions hold in current data analytics frameworks (MapReduce, Dryad, …) Work divided into short tasks to facilitate load balancing and fault recovery Data replicated across multiple nodes
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Revocation Mesos allocation modules can revoke (kill) tasks to meet organizational SLOs Framework given a grace period to clean up “Guaranteed share” API lets frameworks avoid revocation by staying below a certain share
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Mesos API Scheduler Callbacks resourceOffer(offerId, offers) offerRescinded(offerId) statusUpdate(taskId, status) slaveLost(slaveId) Executor Callbacks launchTask(taskDescriptor) killTask(taskId) Executor Actions sendStatus(taskId, status) Scheduler Actions replyToOffer(offerId, tasks) setNeedsOffers(bool) setFilters(filters) getGuaranteedShare() killTask(taskId)
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge Discovery in Databases Hadoop Ecosystem We covered these starting Day 1 Today (Day 2) & next week (Day 3) we cover more of these Adapted from slide © 2013, M. Eltabakh, Worcester Polytechnic Institute