Download presentation
1
Why Spark on Hadoop Matters
Abstract: The Hadoop technology stack, from its roots in batch processing, has evolved at a rapid pace to address more real-time business needs. Spark with an in-memory processing framework provides a great complimentary stack to Hadoop. Not surprisingly, the integration of the full Spark stack on Hadoop is showing tremendous promise for MapR customers. I will present some of these use cases and discuss how and when the integration of Spark and Hadoop delivers the best value for the end user. MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014
2
MapR Overview Top Ranked Exponential Growth 500+ Customers
Cloud Leaders 3X bookings Q1 ‘13 – Q1 ‘14 90% software licenses 80% of accounts expand 3X < 1% lifetime churn The MapR distribution for Hadoop is globally recognized as the technology leader Forrester published a Wave for Big Data Hadoop Solutions where it placed MapR as the highest ranking product based on current offering as well as roadmap. Cloud: MapR has been selected by two of the companies most experienced with MapReduce technology which is a testament to the technology advantages of MapR’s distribution. Amazon through its Elastic MapReduce service (EMR) hosted over 2 million clusters in the past year. Amazon selected MapR to complement EMR as the only commercial Hadoop distribution being offered, sold and supported as a service by Amazon to its customers. MapR was also selected by Google – the pioneer of MapReduce and the company whose white paper on MapReduce inspired the creation of Hadoop – has also selected MapR to make our distribution available on Google Compute Engine. > $1B in incremental revenue generated by 1 customer
3
Rapidly Evolving Landscape
APACHE HADOOP AND OSS ECOSYSTEM Batch SQL ML, Graph NoSQL & Search Streaming Data Integrtn. & Access Security Workflow & Data Gov. Provision Tez* Spark Drill* Management Savannah* Cascading GraphX Shark Accumulo* Storm* Hue Juju Pig MLLib Impala Solr Spark Streaming HttpFS MR v1 & v2 Mahout Hive HBase Flume Knox* Falcon* Whirr YARN Sqoop Sentry* Oozie ZooKeeper EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS MapR Data Platform * 2014 TIMELINE
4
The Complete Spark Stack on Hadoop
APACHE HADOOP AND OSS ECOSYSTEM Batch SQL ML, Graph NoSQL & Search Streaming Data Integrtn. & Access Security Workflow & Data Gov. Provision Tez* Spark Drill* Management Savannah* Cascading GraphX Shark Accumulo* Storm* Hue Juju Pig MLLib Impala Solr Spark Streaming HttpFS MR v1 & v2 Mahout Hive HBase Flume Knox* Falcon* Whirr YARN Sqoop Sentry* Oozie ZooKeeper EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS MapR Data Platform * 2014 TIMELINE
5
A Winning Combination
6
IN-MEMORY PERFORMANCE
Spark Advantages: Easier APIs Python, Scala, Java EASE OF DEVELOPMENT IN-MEMORY PERFORMANCE RDDs DAGs Unify Processing Shark, ML, Streaming, GraphX COMBINE WORKFLOWS
7
WIDE RANGE OF APPLICATIONS
Hadoop Advantages: UNLIMITED SCALE Multiple data sources Multiple applications Multiple users Reliability Multi-tenancy Security ENTERPRISE PLATFORM WIDE RANGE OF APPLICATIONS Files Databases Semi-structured
8
The Combination of Spark on Hadoop
UNLIMITED SCALE EASE OF DEVELOPMENT Operational Applications Augmented by In-Memory Performance IN-MEMORY PERFORMANCE ENTERPRISE PLATFORM WIDE RANGE OF APPLICATIONS COMBINE WORKFLOWS
9
Case Studies
10
Industry Leading Ad-Targeting Platform
High performance analytics over MapR M7 NoSQL Load from M7 table into RDD to augment scoring in real-time Results fed back to M7 for other applications
11
Leading Pharma Company: NextGen Genomics
Existing process takes several weeks to align chemical compounds with genes ADAM on Spark allows realignment in a few hours Geneticists can minimize engineering dependency a. Interested in Adam project - runs on top of Spark - for nextgen genomics - good whiteppaer - search for APche spark Adam Git hub for Adam - Notes - links to the whitepaper - AMPLab - Genomics realignment tool - crux - next ten gemonic medicine - allows you to much more quickly access and manage the alignment of data. b. Existing genomics pipeline - many weeks to realign - drill down augment and working through their chemical compounds - genetics can come and test their compound against the alignment c. Not the exact sections of then they cannot focus - back out and zoom back in on the right set of sequence. that process 6 weeks - geneticist - 1 day and shift it - 6 weeks to get to the change - Adam in a matter of hours - geneticists can do it themselves - whole team of HPC experts otherwise d. this is the case for almost all pharma companies - Novartis is also the same e. One tool in a bigger framework - several other use cases as well.
12
Cisco: Security Intelligence Operations
Sensor data lands in M7 Spark Streaming on M7 for first check on known threats Data next processed on GraphX and Mahout Results queried using SQL via Shark and Impala
13
Insurance Giant: Addressing Health Care Regulations
Patient information in M7 combined with clinical records to compute re-admittance probability Process uses Spark with transactional data in M7 Insurance options decided in real-time on online portals
14
In Summary
15
Spark on Hadoop gains traction for Real-time applications
16
Pick the Right Tool for the Job
17
MapR is Unbiased Open Source (a la Linux)
Open source distribution is about providing choice Linux includes MySQL, PostgreSQL and SQLite Linux includes Apache httpd, nginx and Lighttpd MapR Distribution for Hadoop Distribution C Distribution H Spark Spark (all of it) and Shark Spark only No Interactive SQL Shark, Impala, Drill, Hive/Tez One option (Impala) (Hive/Tez) Versions Hive 0.10, 0.11, 0.12, 0.13 Pig 0.11, 012 HBase 0.94, 0.98 One version
18
Thank you Engage with us! @mapr maprtech mapr-technologies MapR
maprtech
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.