Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark.

Slides:



Advertisements
Similar presentations
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
Advertisements

©2011 Improving Enterprises, Inc. Breaking down the Epic User Story.
1 Agile Estimation V. Lee Henson CST. 2 Founded in Salt Lake City, UT Personally Trained, Coached, and or Mentored at 41 of the Fortune 100 Companies.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
DESIGNING FOR MOBILE NIKHIL J DESHPANDE. Nikhil Deshpande Digital Strategy Director, GeorgiaGov
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Why Spark on Hadoop Matters
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
Agile Metrics, Value, and Software
Hadoop Ecosystem Overview
Is Agile Any Better? Damon Poole 2009 Scrum and Kanban Like Chocolate and Peanut Butter Damon Poole – CTO, AccuRev.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Responsive Web Design Nikhil J Deshpande Webinar – May 14, 2014 Sponsored by.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Analysis in Agile: It’s More Than Just User Stories Kent Webinar Series 2015.
Data Virtualization & Information As A Service (IaaS) By Anil Allewar Senior Solutions Architect - Synerzip 1.
Todd Little Sr. Development Manager Landmark Graphics Context Driven Agile Leadership One Size Doesn’t Fit All.
Slicing Pie EUREKA!. Win a signed copy: SlicingPie.com/synerzip
Valtivity Panning for User Story Gold.
Lifecycle of a User Story Webinar Series © Three Beacons LLC, 2015 Lifecycle of a User Story Mike Hall Three Beacons
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
©2011 Improving Enterprises, Inc. Epics and Agile Planning.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Webinar Series 2015 ©Pollyanna Pixton Team Ownership: How do we help it happen? Presented by Pollyanna Pixton.
Webinar Series Sins of Scrum and other Agile Anti-Patterns Todd Little VP Product Development September Webinar.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Using Agile Approach with Fixed Budget Projects April 15, 2009.
Webinar Series Running Your Services On Docker An experience report.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Matthew Winter and Ned Shawa
© 2015 Webinar Series 2015 what is the role of an architect in an agile organization? 1 The Agile Architect / November 2015.
1 Copyright © 2015, Drilling Info, Inc. All right reserved. All brand names and trademarks are the properties of their respective companies. Webinar Series.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Next Generation of Apache Hadoop MapReduce Owen
1 Copyright © 2016, Drilling Info, Inc. All right reserved. All brand names and trademarks are the properties of their respective companies. Webinar Series.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
BIG DATA/ Hadoop Interview Questions.
Microsoft Partner since 2011
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Platform and Analytics Foundational Training
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Interactive Data Analytics with Spark on Tachyon in Baidu
Introduction to Spark.
Introduction to Apache
Overview of big data tools
One Size Doesn’t Fit All
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark

2 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Webinar Objectives  Intro: what is Hadoop and what is Spark?  Spark's capabilities and advantages vs Hadoop  From Hadoop to Spark – how to? 2

Introduction Hadoop and Spark Comparison From Hadoop to Spark

4 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop in 20 Seconds  ‘The’ Big data platform  Very well field tested  Scales to peta-bytes of data  MapReduce : Batch oriented compute

5 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Eco System Batch Real Time

6 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Ecosystem – by function  HDFS –provides distributed storage  Map Reduce –Provides distributed computing  Pig –High level MapReduce  Hive –SQL layer over Hadoop  HBase –NoSQL storage for real-time queries

7 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark in 20 Seconds  Fast & Expressive Cluster computing engine  Compatible with Hadoop  Came out of Berkeley AMP Lab  Now Apache project  Version 1.3 just released (April 2015) “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com

8 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Eco-System Spark Core Spark SQL Spark SQL Spark Streaming Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers Cluster managers GraphX Graph processing

9 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hypo-meter

10 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Trends

11 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Benchmarks Source : stratio.com

12 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Code / Activity © Elepha nt Scale, 2014 Source : stratio.com

13 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Timeline : Hadoop & Spark

Hadoop and Spark Comparison Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark

15 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Vs. Spark Hadoop Spark Source :

16 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Comparison With Hadoop HadoopSpark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce frameworkGeneralized computation Usually data on disk (HDFS)On disk / in memory Not ideal for iterative workGreat at Iterative workloads (machine learning..etc) Batch process- Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration

17 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop + Yarn : OS for Distributed Compute HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Cluster Management Applications (or at least, that’s the idea)

18 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Is Better Fit for Iterative Workloads

19 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Programming Model  More generic than MapReduce

20 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Is Spark Replacing Hadoop?  Spark runs on Hadoop / YARN –Complimentary  Spark programming model is more flexible than MapReduce  Spark is really great if data fits in memory (few hundred gigs),  Spark is ‘storage agnostic’ (see next slide)

21 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Pluggable Storage Spark (compute engine) Spark (compute engine) HDFS Amazon S3 Cassandra ???

22 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Hadoop Use CaseOtherSpark Batch processingHadoop’s MapReduce (Java, Pig, Hive) Spark RDDs (java / scala / python) SQL queryingHadoop : HiveSpark SQL Stream Processing / Real Time processing Storm Kafka Spark Streaming Machine LearningMahoutSpark ML Lib Real time lookupsNoSQL (Hbase, Cassandra..etc) No Spark component. But Spark can query data in NoSQL stores

23 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop & Spark Future ???

Going from Hadoop to Spark Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark

25 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Why Move From Hadoop to Spark?  Spark is ‘easier’ than Hadoop  ‘friendlier’ for data scientists / analysts –Interactive shell fast development cycles adhoc exploration  API supports multiple languages –Java, Scala, Python  Great for small (Gigs) to medium (100s of Gigs) data

26 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark : ‘Unified’ Stack  Spark supports multiple programming models –Map reduce style batch processing –Streaming / real time processing –Querying via SQL –Machine learning  All modules are tightly integrated –Facilitates rich applications  Spark can be the only stack you need ! –No need to run multiple clusters (Hadoop cluster, Storm cluster, … etc.) Image: buymeposters.com

27 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Migrating From Hadoop  Spark FunctionalityHadoopSpark Distributed StorageHDFSCloud storage like Amazon S3 Or NFS mounts SQL queryingHiveSpark SQL ETL work flowPig-Spork : Pig on Spark -Mix of Spark SQL Machine LearningMahoutML Lib NoSQL DBHBase???

28 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Five Steps of Moving From Hadoop to Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning

29 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Data Size : “You Don’t Have Big Data”

30 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size (T-shirt sizing) Image credit : blog.trumpi.co.za 10 G G + 1 TB TB + PB + < few G Hadoop Spark

31 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size  Lot of Spark adoption at SMALL – MEDIUM scale –Good fit –Data might fit in memory !! –Hadoop may be overkill  Applications –Iterative workloads (Machine learning, etc.) –Streaming  Hadoop is still preferred platform for TB + data

32 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 2) File System  Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS  File system choices for Spark –HDFS - Hadoop File System Reliable Good performance (data locality) Field tested for PB of data –S3 : Amazon Reliable cloud storage Huge scale –NFS : Network File System (‘shared FS across machines)

33 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark File Systems

34 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems For Spark HDFSNFSAmazon S3 Data localityHigh (best) Local enoughNone (ok) ThroughputHigh (best) Medium (good) Low (ok) LatencyLow (best) LowHigh ReliabilityVery High (replicated) LowVery High CostVaries $30 / TB / Month

35 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems Throughput Comparison  Data : 10G + (11.3 G)  Each file : ~1+ G ( x 10)  400 million records total  Partition size : 128 M  On HDFS & S3  Cluster : –8 Nodes on Amazon m3.xlarge (4 cpu, 15 G Mem, 40G SSD ) –Hadoop cluster, Latest Horton Works HDP v2.2 –Spark : on same 8 nodes, stand-alone, v 1.2

36 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 (lower is better) © Elepha nt Scale, 2014

37 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 Conclusions HDFSS3 Data locality  much higher throughput Data is streamed  lower throughput Need to maintain an Hadoop clusterNo Hadoop cluster to maintain  convenient Large data sets (TB + )Good use case: -Smallish data sets (few gigs) -Load once and cache and re-use

38 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 3) SQL in Hadoop / Spark HadoopSpark EngineHiveSpark SQL LanguageHiveQL- HiveQL - RDD programming in Java / Python / Scala ScalePetabytesTerabytes ? Inter operabilityCan read Hive tables or stand alone data FormatsCSV, JSON, Parquet

39 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark SQL Vs. Hive © Elepha nt Scale, 2014 Fast on same HDFS data !

40 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL on Hadoop / Spark HadoopSpark ETL ToolsPig, Cascading, OozieNative RDD programming (Scala, Java, Python) PigHigh level ETL workflowSpork : Pig on Spark CascadingHigh levelSpark-scalding

41 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL On Hadoop / Spark : Conclusions  Try spork or spark-scalding –Code re-use –Not re-writing from scratch  Program RDDs directly –More flexible –Multiple language support : Scala / Java / Python –Simpler / faster in some cases  Our experience of porting a financial application –Tresata vs. RDD

42 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 5) Machine Learning : Hadoop / Spark HadoopSpark ToolMahoutMLLib APIJavaJava / Scala / Python Iterative AlgorithmsSlowerVery fast (in memory) In Memory processingNoYES Mahout runs on Hadoop or on Spark New and young lib Latest news!Mahout only accepts new code that runs on Spark Mahout & MLLib on Spark Future? Many opinions

43 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Our experience, legal (eDiscovery) FreeEed (Hadoop)3VEed (Storm, Spark) Scalable document processing All Enron docs in 1 hour (50-node Hadoop) Allows dynamically adding data sources Use case: more data discovered for the same lawsuit Allows real-time data processing User case: real-time s Provide much improved load balancing Example: 10 GB PST mailbox Overall: a much better fit for modern data governance 43 Copyright © 2015 Elephant Scale LLC. All rights reserved.

44 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Final Thoughts  Already on Hadoop? –Try Spark side-by-side –Process some data in HDFS –Try Spark SQL for Hive tables  Contemplating Hadoop? –Try Spark (standalone) –Choose NFS or S3 file system  Take advantage of caching –Iterative loads –Spark Job servers –Tachyon  Build new class of ‘big / medium data’ apps

45 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Thanks ! Expert consulting & training in Big Data (Now offering Spark training)

46 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching!  Reading data from remote FS (S3) can be slow  For small / medium data ( 10 – 100s of GB) use caching –Pay read penalty once –Cache –Then very high speed computes (in memory) –Recommended for iterative work-loads

47 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Caching Results Cached!

48 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching  Caching is pretty effective (small / medium data sets)  Cached data can not be shared across applications (each application executes in its own sandbox)

49 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Sharing Cached Data  1) ‘spark job server’ –Multiplexer –All requests are executed through same ‘context’ –Provides web-service interface  2) Tachyon –Distributed In-memory file system –Memory is the new disk! –Out of AMP lab, Berkeley –Early stages (very promising)

50 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server

51 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server  Open sourced from Ooyala  ‘Spark as a Service’ – simple REST interface to launch jobs  Sub-second latency !  Pre-load jars for even faster spinup  Share cached RDDs across requests (NamedRDD) App1 : ctx.saveRDD(“my cached rdd”, rdd1) App2: RDD rdd2 = ctx.loadRDD (“my cached rdd”) 

52 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Tachyon + Spark

53 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next : New Big Data Applications With Spark

54 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Big Data Applications : Now  Analysis is done in batch mode (minutes / hours)  Final results are stored in a real time data store like Cassandra / Hbase  These results are displayed in a dashboard / web UI  Doing interactive analysis ???? –Need special BI tools

55 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. With Spark…  Load data set (Giga bytes) from S3 and cache it (one time)  Super fast (sub-seconds) queries to data  Response time : seconds (just like a web app !)

56 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Lessons Learned  Build sophisticated apps !  Web-response-time (few seconds) !!  In-depth analytics –Leverage existing libraries in Java / Scala / Python  ‘data analytics as a service’

57 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved Ashish Shanker

58 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip in a Nutshell  Software product development partner for small/mid-sized technology companies Exclusive focus on small/mid-sized technology companies, typically venture- backed companies in growth phase By definition, all Synerzip work is the IP of its respective clients Deep experience in full SDLC – design, dev, QA/testing, deployment  Dedicated team of high caliber software professionals for each client Seamlessly extends client’s local team offering full transparency Stable teams with very low turn-over NOT just “staff augmentation, but provide full management support  Actually reduces risk of development/delivery Experienced team – uses appropriate level of engineering discipline Practices Agile development – responsive yet disciplined  Reduces cost – dual-site team, 50% cost advantage  Offers long-term flexibility – allows (facilitates) taking offshore team captive – aka “BOT” option 58

59 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip Clients 59

60 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Join Us In Person Agile Texas 2015 Tour Presented by Hemant Elhence & Vinayak Joglekar 60

61 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next Webinar 7 Sins of Scrum and other Agile Anti-Patterns Complimentary Webinar: Tuesday, September 22, Noon CST Presented by: Todd Little IHM 61

62 Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Ashish Shanker Connect with linkedin.com/company/synerzip facebook.com/Synerzip 62