Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack.

Similar presentations


Presentation on theme: "Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack."— Presentation transcript:

1 Spark and Scala Sheng QIAN 2015-06-17

2 The Berkeley Data Analytics Stack

3 The Goal of Spark

4 Compare between Spark and Hadoop

5

6 Spark supports … Scala (Best) Python(2.7.*) Java (…)

7 All based on RDD (Resilient Distributed Dataset) A list of partitions A function for computing each split A list of dependencies on other RDDs Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

8 All based on RDD (Resilient Distributed Dataset) A list of partitions A function for computing each split A list of dependencies on other RDDs Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

9 The process 1. File System(HDFS/HBase)/Collections  RDD 2. Transformation (Delay execution) * Faster than MR due to this 3. Action (execution)

10 Transformations and actions

11 Fault tolerance Every RDD records RDDs it depends on

12 Cluster Overview

13

14 Task Schedule

15 DAG Scheduler 基于Stage构建DAG,决定每个任务的最佳位置 记录哪个RDD或者Stage输出被物化 将taskset传给底层调度器TaskScheduler 重新提交shuffle输出丢失的stage

16 Task Scheduler 提交taskset(组task)到集群运  并汇报结果 出现shuffle输出lost要报告fetch failed错误 碰到straggle任务需要放到别的节点上重试 为每个TaskSet维护个TaskSetManager(追踪本地性及 错误信息)

17 Job Schedule

18 Job Optimization

19 Why Scala Base on JVM FP + OO

20 Scala - Grammar On Evernote

21 Thank you


Download ppt "Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack."

Similar presentations


Ads by Google