Download presentation
Presentation is loading. Please wait.
1
Making Fly Parviz Deyhim
2
Scalable, Highly-available and Secure
Obsession Scalable, Highly-available and Secure
3
Scalable, Highly-available and Secure
Obsession Scalable, Highly-available and Secure
4
Amazon Elastic MapReduce
5
Cost effective AWS wrapper
What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services EMR Is managed Hadoop Offering that takes burden of deploying and maintaining hadoop clusters away from developers. EMR uses Apache Hadoop mapreduce engine and integrates with variety of different tools.
6
Integration HDFS Amazon EMR
7
Integration HDFS Amazon EMR Amazon S3 Amazon DynamoDB
8
Integration Amazon EMR Data management Analytics languages Amazon S3
HDFS Amazon EMR Amazon S3 Amazon DynamoDB
9
Integration Amazon EMR Data management Analytics languages Amazon RDS
HDFS Amazon EMR Amazon S3 Amazon DynamoDB
10
Integration Amazon EMR Data Pipeline Data management
Analytics languages Integration Amazon RDS HDFS Amazon EMR Amazon S3 Amazon DynamoDB Data Pipeline Amazon RedShift
11
Integration Amazon EMR Data Pipeline Data management
Analytics languages Integration Amazon RDS HDFS Amazon EMR Amazon S3 Amazon DynamoDB Data Pipeline Amazon RedShift
12
Amazon EMR Concepts Master Node Core Nodes Task Nodes
13
Core Nodes DataNode (HDFS) Amazon EMR cluster Master instance group
Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group
14
Core Nodes Can Add Core Nodes: More CPU More Memory More HDFS Space
Amazon EMR cluster Can Add Core Nodes: More CPU More Memory More HDFS Space Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS HDFS Core instance group
15
Core Nodes Can’t remove core nodes: HDFS corruption Amazon EMR cluster
Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS HDFS Core instance group
16
Task Nodes No HDFS Provides compute resources: CPU Memory
Amazon EMR cluster No HDFS Provides compute resources: CPU Memory Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group
17
Task Nodes Can add and remove task nodes Amazon EMR cluster
Master instance group Master Node We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode. HDFS HDFS Core instance group
18
Spark On Amazon EMR
19
Bootstrap Actions Ability to run or install additional packages/software on EMR nodes Simple bash script stored on S3 Script gets executed during node/instance boot time Script gets executed on every node that gets added to the cluster
20
Spark on Amazon EMR http://bit.ly/sparkemr
Bootstrap action installs Spark on EMR nodes Currently on Spark 0.73 & upgrading to 0.8 very soon
21
Why Spark on Amazon EMR? Deploy small and large Spark clusters in minutes EMR handles node recover in case of failures Integration with EC2 Spot Market, Amazon Redshift, Amazon Data pipeline, Amazon Cloudwatch and etc
22
Why Spark on Amazon EMR? Shipping Spark logs to S3 for debugging
Define S3 bucket at cluster deploy time
23
Spark on EC2 Spot Market Bid on un-used EC2 capacity
Spark is memory hungry. Bid on large memory instances with the fraction of the cost
24
Spark on EC2 Spot Market 1TB Memory Cluster Instance Type # of nodes
On-demand Cost Spot Cost Cost/GB Of Memory Using Spot M1.xlarge 63 $31/h $4.41/h 0.44c/GB/h CC2.8xlarge 16 $39/h $4.64/h 0.46c/GB/h M2.4xlarge 15 $24/h $2.25/h 0.22c/GB/h
25
Spark on EC2 Spot Market Launch initial Spark cluster with core nodes
Amazon EMR cluster Master instance group Master Node Launch initial Spark cluster with core nodes HDFS to store and checkpoint RDDs HDFS HDFS 32GB Memory
26
Spark on EC2 Spot Market Amazon EMR cluster Master instance group Master Node Add Task nodes in spot market to increase memory capacity HDFS HDFS 32GB Memory 256GB Memory
27
Spark on EC2 Spot Market Create RDDs from HDFS or Amazon S3 with:
Amazon EMR cluster Create RDDs from HDFS or Amazon S3 with: sc.textFile OR sc.sequenceFile Run Computation on RDDs Master instance group Master Node HDFS HDFS 32GB Memory 256GB Memory Amazon S3
28
Spark on EC2 Spot Market Save the resulting RDDs to HDFS or S3 with:
Amazon EMR cluster Master instance group Save the resulting RDDs to HDFS or S3 with: rdd.saveAsSequenceFile OR rdd.saveAsObjectFile Master Node HDFS HDFS 32GB Memory 256GB Memory Amazon S3 saveAsObjectFile
29
Spark on EC2 Spot Market Shutdown TaskNodes when your job is done
Amazon EMR cluster Master instance group Master Node Shutdown TaskNodes when your job is done HDFS HDFS 32GB Memory
30
Elastic Spark With Amazon EMR
31
Autoscaling Spark Master Node Amazon EMR cluster HDFS 32GB Memory
32
Autoscaling Spark Amazon EMR cluster 32GB Memory 256GB Memory
Master Node Amazon EMR cluster HDFS 32GB Memory 256GB Memory
33
Elastic Spark When to Scale? CPU bounded or Memory intensive?
Depends on your job CPU bounded or Memory intensive? Probably both for Spark jobs Use CPU/Memory util. metrics to decide when to scale
34
Amazon EMR Cloudwatch metrics
EMR integrates with Cloudwatch Provides many metrics. Examples: Load Metrics HDFS Metrics S3 Metrics
35
Amazon EMR Cloudwatch metrics
36
Basics on Cloudwatch Metrics
Pick any Cloudwatch metrics Pick a threshold that you like to be notified if its breached Setup Cloudwatch Alarms based on your thresholds Receive SNS notification in forms of: SNS HTTP API Call
37
Basics on Cloudwatch Metrics
Receive Notification Monitor With Cloudwatch Take Manual Action Such As Adding More Task Nodes
38
Basics on Cloudwatch Metrics
Monitor With Cloudwatch HTTP API Calls Take Automated Actions
39
Spark Autoscaling Based on Load
Setup Cloudwatch alarm on EMR “TotalLoad” metric Receive /SNS/HTTP notification Add more worker nodes by adding EMR task nodes
40
Spark Autoscaling Based Memory
Spark needs memory Lost of it!! How to scale based on the memory usage?
41
Spark Metrics Spark 0.8 provides cluster metrics
Source and Sink topology
42
Spark Metrics Spark Metric Sources (metrics.properties):
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
43
Spark Metrics Spark Metric Sinks (metrics.properties):
Package: org.apache.spark.metrics.sink ConsoleSink JmxSink CsvSink GangliaSink
44
Spark Metrics Spark Metric Sinks (metrics.properties): CloudwatchSink
45
Spark Metrics & Cloudwatch
46
Spark Metrics & Cloudwatch
Monitor Spark metrics with Cloudwatch Setup Cloudwatch alarms and get notified if any metrics reached your threshold. Example: if JvmHeapUsed > 20G Receive notification and take manual or automated actions
47
Spark Streaming and Amazon Kinesis
48
Amazon Kinesis Kinesis
49
Amazon Kinesis CreateStream
Creates a new Data Stream within the Kinesis Service PutRecord Adds new records to a Kinesis Stream DescribeStream Provides metadata about the Stream, including name, status, Shards, etc. GetNextRecord Fetches next record for processing by user business logic MergeShard / SplitShard Scales Stream up/ down DeleteStream Deletes the Stream
50
Amazon Kinesis Kinesis
51
Spark Streaming and Amazon Kinesis
SparkStreaming Kinesis Receiver Extends NetworkReceiver Creates a single Receiver per shard and reads from Kinesis
52
Misc. New AWS instances provide enhanced networking in VPC Higher PPS
Less Jitter Great CPU Power Suitable for Spark: Serialization and Shuffle
53
What Do You Like To See On Spark By Amazon?
Send Feedbacks To: Parviz Deyhim @pdeyhim
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.