Download presentation
Presentation is loading. Please wait.
Published byBenjamin Horn Modified over 9 years ago
1
Spark and Scala Sheng QIAN 2015-06-17
2
The Berkeley Data Analytics Stack
3
The Goal of Spark
4
Compare between Spark and Hadoop
6
Spark supports … Scala (Best) Python(2.7.*) Java (…)
7
All based on RDD (Resilient Distributed Dataset) A list of partitions A function for computing each split A list of dependencies on other RDDs Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
8
All based on RDD (Resilient Distributed Dataset) A list of partitions A function for computing each split A list of dependencies on other RDDs Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
9
The process 1. File System(HDFS/HBase)/Collections RDD 2. Transformation (Delay execution) * Faster than MR due to this 3. Action (execution)
10
Transformations and actions
11
Fault tolerance Every RDD records RDDs it depends on
12
Cluster Overview
14
Task Schedule
15
DAG Scheduler 基于Stage构建DAG,决定每个任务的最佳位置 记录哪个RDD或者Stage输出被物化 将taskset传给底层调度器TaskScheduler 重新提交shuffle输出丢失的stage
16
Task Scheduler 提交taskset(组task)到集群运 并汇报结果 出现shuffle输出lost要报告fetch failed错误 碰到straggle任务需要放到别的节点上重试 为每个TaskSet维护个TaskSetManager(追踪本地性及 错误信息)
17
Job Schedule
18
Job Optimization
19
Why Scala Base on JVM FP + OO
20
Scala - Grammar On Evernote
21
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.