Making Fly Parviz Deyhim http://bit.ly/sparkemr.

Slides:



Advertisements
Similar presentations
Ivan Pleština Amazon Simple Storage Service (S3) Amazon Elastic Block Storage (EBS) Amazon Elastic Compute Cloud (EC2)
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Amazon Web Services Justin DeBrabant CIS Advanced Systems - Fall 2013.
University of Notre Dame
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Amazon Web Services and Eucalyptus
AWS Simple Icons v2.1 Usage Guidelines Check to make sure you have the most recent set of AWS Simple Icons. This version was last updated 4/18/2013 (v2.1)
1 NETE4631 Cloud deployment models and migration Lecture Notes #4.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Infrastructure as a Service (IaaS) Amazon EC2
Spark Performance Patrick Wendell Databricks.
Spark: Cluster Computing with Working Sets
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Patrick Wendell Databricks Deploying and Administering Spark.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Introduction to Amazon Web Services (AWS)
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
Our Experience Running YARN at Scale Bobby Evans.
The eHealth Services Capstone Project
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Very Large Scale Stream Processing inside Alibaba Alibaba.
CLOUD WITH AMAZON. Amazon Web Services AWS is a collection of remote computing services Elastic Compute Cloud (EC2) provides scalable virtual private.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Licensed under Creative Commons Attribution-Share Alike 3.0 Unported License Cloud Hosting Practices Lessons DuraSpace has learned Bill Branan Open Repositories.
Stairway to the cloud or can we take the highway? Taivo Liik.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
CLOUDWATCH VERY SIRIUS MONITORING. SENDING METRICS Metrics can be sent in simply through additional handlers to common tools No handler yet for collectd.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
S3 Lifecycle Policies to Glacier
Deploying Docker Datacenter on AWS © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Simple Icons v AWS Simple Icons: Usage Guidelines
Amazon AWS Solution Architect Associate Exam Questions PDF associate.html AWS Solution Training Exam.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Introduction to Distributed Platforms
ITCS-3190.
Agenda Who am I? Whirlwind introduction to the Cloud
Spark Presentation.
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.
AWS Certified Advanced Networking – Specialty Exam Dumps For Full Exam Info:
Where can I download Aws Devops Engineer Professional Exam Study Material - Get Updated Aws Devops Engineer Professional Braindumps Dumps4downlaod.us
2018 Amazon AWS DevOps Engineer Professional Dumps - DumpsProfessor
Get Amazon AWS-DevOps-Engineer-Professional Exam Real Questions - Amazon AWS-DevOps-Engineer-Professional Dumps Realexamdumps.com
AWS Cloud Computing Masaki.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Lecture 16 (Intro to MapReduce and Hadoop)
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Making Fly Parviz Deyhim http://bit.ly/sparkemr

Scalable, Highly-available and Secure Obsession Scalable, Highly-available and Secure

Scalable, Highly-available and Secure Obsession Scalable, Highly-available and Secure

Amazon Elastic MapReduce

Cost effective AWS wrapper What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services EMR Is managed Hadoop Offering that takes burden of deploying and maintaining hadoop clusters away from developers. EMR uses Apache Hadoop mapreduce engine and integrates with variety of different tools.

Integration HDFS Amazon EMR

Integration HDFS Amazon EMR Amazon S3 Amazon DynamoDB

Integration Amazon EMR Data management Analytics languages Amazon S3 HDFS Amazon EMR Amazon S3 Amazon DynamoDB

Integration Amazon EMR Data management Analytics languages Amazon RDS HDFS Amazon EMR Amazon S3 Amazon DynamoDB

Integration Amazon EMR Data Pipeline Data management Analytics languages Integration Amazon RDS HDFS Amazon EMR Amazon S3 Amazon DynamoDB Data Pipeline Amazon RedShift

Integration Amazon EMR Data Pipeline Data management Analytics languages Integration Amazon RDS HDFS Amazon EMR Amazon S3 Amazon DynamoDB Data Pipeline Amazon RedShift

Amazon EMR Concepts Master Node Core Nodes Task Nodes

Core Nodes DataNode (HDFS) Amazon EMR cluster Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group

Core Nodes Can Add Core Nodes: More CPU More Memory More HDFS Space Amazon EMR cluster Can Add Core Nodes: More CPU More Memory More HDFS Space Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS HDFS Core instance group

Core Nodes Can’t remove core nodes: HDFS corruption Amazon EMR cluster Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS HDFS Core instance group

Task Nodes No HDFS Provides compute resources: CPU Memory Amazon EMR cluster No HDFS Provides compute resources: CPU Memory Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group

Task Nodes Can add and remove task nodes Amazon EMR cluster Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group

Spark On Amazon EMR

Bootstrap Actions Ability to run or install additional packages/software on EMR nodes Simple bash script stored on S3 Script gets executed during node/instance boot time Script gets executed on every node that gets added to the cluster

Spark on Amazon EMR http://bit.ly/sparkemr Bootstrap action installs Spark on EMR nodes Currently on Spark 0.73 & upgrading to 0.8 very soon http://bit.ly/sparkemr

Why Spark on Amazon EMR? Deploy small and large Spark clusters in minutes EMR handles node recover in case of failures Integration with EC2 Spot Market, Amazon Redshift, Amazon Data pipeline, Amazon Cloudwatch and etc

Why Spark on Amazon EMR? Shipping Spark logs to S3 for debugging Define S3 bucket at cluster deploy time

Spark on EC2 Spot Market Bid on un-used EC2 capacity Spark is memory hungry. Bid on large memory instances with the fraction of the cost

Spark on EC2 Spot Market 1TB Memory Cluster Instance Type # of nodes On-demand Cost Spot Cost Cost/GB Of Memory Using Spot M1.xlarge 63 $31/h $4.41/h 0.44c/GB/h CC2.8xlarge 16 $39/h $4.64/h 0.46c/GB/h M2.4xlarge 15 $24/h $2.25/h 0.22c/GB/h

Spark on EC2 Spot Market Launch initial Spark cluster with core nodes Amazon EMR cluster Master instance group Master Node Launch initial Spark cluster with core nodes HDFS to store and checkpoint RDDs HDFS HDFS 32GB Memory

Spark on EC2 Spot Market Amazon EMR cluster Master instance group Master Node Add Task nodes in spot market to increase memory capacity HDFS HDFS 32GB Memory 256GB Memory

Spark on EC2 Spot Market Create RDDs from HDFS or Amazon S3 with: Amazon EMR cluster Create RDDs from HDFS or Amazon S3 with: sc.textFile OR sc.sequenceFile Run Computation on RDDs Master instance group Master Node HDFS HDFS 32GB Memory 256GB Memory Amazon S3

Spark on EC2 Spot Market Save the resulting RDDs to HDFS or S3 with: Amazon EMR cluster Master instance group Save the resulting RDDs to HDFS or S3 with: rdd.saveAsSequenceFile OR rdd.saveAsObjectFile Master Node HDFS HDFS 32GB Memory 256GB Memory Amazon S3 saveAsObjectFile

Spark on EC2 Spot Market Shutdown TaskNodes when your job is done Amazon EMR cluster Master instance group Master Node Shutdown TaskNodes when your job is done HDFS HDFS 32GB Memory

Elastic Spark With Amazon EMR

Autoscaling Spark Master Node Amazon EMR cluster HDFS 32GB Memory

Autoscaling Spark Amazon EMR cluster 32GB Memory 256GB Memory Master Node Amazon EMR cluster HDFS 32GB Memory 256GB Memory

Elastic Spark When to Scale? CPU bounded or Memory intensive? Depends on your job CPU bounded or Memory intensive? Probably both for Spark jobs Use CPU/Memory util. metrics to decide when to scale

Amazon EMR Cloudwatch metrics EMR integrates with Cloudwatch Provides many metrics. Examples: Load Metrics HDFS Metrics S3 Metrics

Amazon EMR Cloudwatch metrics

Basics on Cloudwatch Metrics Pick any Cloudwatch metrics Pick a threshold that you like to be notified if its breached Setup Cloudwatch Alarms based on your thresholds Receive SNS notification in forms of: Email SNS HTTP API Call

Basics on Cloudwatch Metrics Receive Email Notification Monitor With Cloudwatch Take Manual Action Such As Adding More Task Nodes

Basics on Cloudwatch Metrics Monitor With Cloudwatch HTTP API Calls Take Automated Actions

Spark Autoscaling Based on Load Setup Cloudwatch alarm on EMR “TotalLoad” metric Receive Email/SNS/HTTP notification Add more worker nodes by adding EMR task nodes

Spark Autoscaling Based Memory Spark needs memory Lost of it!! How to scale based on the memory usage?

Spark Metrics Spark 0.8 provides cluster metrics Source and Sink topology

Spark Metrics Spark Metric Sources (metrics.properties): worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Spark Metrics Spark Metric Sinks (metrics.properties): Package: org.apache.spark.metrics.sink ConsoleSink JmxSink CsvSink GangliaSink

Spark Metrics Spark Metric Sinks (metrics.properties): CloudwatchSink

Spark Metrics & Cloudwatch

Spark Metrics & Cloudwatch Monitor Spark metrics with Cloudwatch Setup Cloudwatch alarms and get notified if any metrics reached your threshold. Example: if JvmHeapUsed > 20G Receive notification and take manual or automated actions

Spark Streaming and Amazon Kinesis

Amazon Kinesis Kinesis

Amazon Kinesis CreateStream Creates a new Data Stream within the Kinesis Service PutRecord Adds new records to a Kinesis Stream DescribeStream Provides metadata about the Stream, including name, status, Shards, etc. GetNextRecord Fetches next record for processing by user business logic MergeShard / SplitShard Scales Stream up/ down DeleteStream Deletes the Stream

Amazon Kinesis Kinesis

Spark Streaming and Amazon Kinesis SparkStreaming Kinesis Receiver Extends NetworkReceiver Creates a single Receiver per shard and reads from Kinesis

Misc. New AWS instances provide enhanced networking in VPC Higher PPS Less Jitter Great CPU Power Suitable for Spark: Serialization and Shuffle

What Do You Like To See On Spark By Amazon? Send Feedbacks To: Parviz Deyhim parvizd@amazon.com @pdeyhim