Hadoop Daniel Hu
Scale-up vs Scale-out
并行计算 分解任务。关键是消除任务间的依赖。 整合结果。 ◦ 每个任务产生一个结果,然后要把这些结果组合起来得 出最终结果。 ◦ 结果相互独立,但每个任务产生一个结果。 ◦ 有的任务不产生结果。 ◦ 只有一个任务产生最终的结果。
并行计算构架 —— 主要的概念 任务 产生器 处理器 结果 收集器
Master-Worker Pattern
Random Workers Designated Workers
Data Storage and Analysis There’s more to being able to read and write data in parallel to or from multiple disks. The first problem to solve is hardware failure The second problem is that most analysis tasks need to be able to combine the data in some way
RDBMS Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?
RDBMS compared to MapReduce
Grid Computing Grid Computing works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck, and compute nodes become idle.
MapReduce ◦Dividing the work into equal-size pieces isn’t always easy or obvious ◦Combining the results from independent processes can need further processing. ◦You are still limited by the processing capacity of a single machine
MapReduce MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.