Hadoop Daniel Hu. Scale-up vs Scale-out 并行计算分解任务。关键是消除任务间的依赖。整合结果。 ◦ 每个任务产生一个结果，然后要把这些结果组合起来得出最终结果。 ◦ 结果相互独立，但每个任务产生一个结果。 ◦ 有的任务不产生结果。 ◦ 只有一个任务产生最终的结果。

Hadoop Daniel Hu

Scale-up vs Scale-out

并行计算分解任务。关键是消除任务间的依赖。整合结果。 ◦ 每个任务产生一个结果，然后要把这些结果组合起来得出最终结果。 ◦ 结果相互独立，但每个任务产生一个结果。 ◦ 有的任务不产生结果。 ◦ 只有一个任务产生最终的结果。

并行计算构架 —— 主要的概念任务产生器处理器结果收集器

Master-Worker Pattern

Random Workers Designated Workers

Data Storage and Analysis There’s more to being able to read and write data in parallel to or from multiple disks. The first problem to solve is hardware failure The second problem is that most analysis tasks need to be able to combine the data in some way

RDBMS Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?

RDBMS compared to MapReduce

Grid Computing Grid Computing works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck, and compute nodes become idle.

MapReduce ◦Dividing the work into equal-size pieces isn’t always easy or obvious ◦Combining the results from independent processes can need further processing. ◦You are still limited by the processing capacity of a single machine

MapReduce MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.

Similar presentations

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similar presentations

Similar presentations

About project

Feedback