Introduction to Hadoop Richard Holowczak Baruch College
Problems of Scale As data size and processing complexity grows: Contention for disks – disks have limited throughput Processing cores per server/OS Image limited … processing throughput is limited Reliability of distributed systems: Tightly coupled distributed systems fall apart when one component (disk, network, cpu, etc.) fails What happens to processing jobs when there is a failure? Rigid structure of distributed systems Consider our ETL processes: Target schema is fixed ahead of time
Hadoop A distributed data processing eco system that is Scalable Reliable Fault Tolerant A collection of projects currently maintained under the Apache Foundation: hadoop.apache.org Storage Layer: Hadoop Distributed File System (HDFS) Scheduling Layer: Hadoop YARN Execution Layer: Hadoop MapReduce Plus many more projects built on top of this
Hadoop Distributed File System (HDFS) Created on top of commodity hardware and operating system Any functioning Linux (or Windows) system can be set up as a node Files are split in to 64MB blocks that are distributed and replicated across nodes Typically at least 3 copies of a blocks are made File I/O semantics are simplified: Write once (no notion of update) Read many times as a stream (no random file I/O) When a node fails, additional blocks copies are created on other nodes A special Name Node keeps track of how a file blocks is stored across different nodes Some location designations Node Rack Data Center
HDFS Example 1 Name Node File Block xyz.txtblock1_N1 xyz.txtblock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 … Network
HDFS Example 2 Name Node File Block xyz.txtBlock1_N1 xyz.txtBlock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 … Network Node Failure
HDFS Example 3 Name Node File Block xyz.txtBlock1_N1 xyz.txtBlock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 xyz.txtBlock1_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 xyz.txtBlock1 … Network Blocks from failed node are replicated
Hadoop Execution Layer: MapReduce Processing architecture for Hadoop Processing functions are sent to where the data reside on nodes Map function is mainly concerned about parsing and filtering data Collects instances of vales V for each key K This function is programmed by the developer Shuffle Instances of { Ki, Vi } to merge This step is done automatically by MapReduce Reduce function is mainly concerned with summarizing data Summarize a set of V for each Key K This function is programmed by the developer
Hadoop Scheduling Layer Job Tracker writes out a plan for completing a job and then tracks its progress A job is broken up into independent Tasks Route a task to CPU that is close to the data (Same Node, Same Rack, different rack) Nodes have Task Trackers that carry out the Tasks required to complete a job When a node fails, Job Tracker automatically re-starts the task on a new node Scheduler may also distribute the same task to multiple nodes and keep the results from the node that finishes first
MapReduce Example Compare 2012 total sales with 2011 total sales broken down by product category Data set: Sales transaction records: Date, Product, ProductCategory, CustomerName, …, Quantity, Price Key: [ Year, ProductCategory ] Value: [ Price * Quantity ] Map Function:For every record, form the Key then multiply Price * Quantity and then assign it to the Value. Shuffle: Merge/Sort all of the pairs on common key Reduce Function: For each K, sum up all of the associated values V
MapReduce Example Name Node File Block xyz.txtblock1_N1 xyz.txtblock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txtBlock23/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Node 2 File Block xyz.txtBlock16/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 … Node 3 File Block xyz.txtBlock1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txtBlock23/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Network Job Tracker Node Job TaskNodeBlock J101TaN1Block1 J101TaN2Block1 J101TbN3Block2 … TaskManager: Ta, Tx, Ty TaskManager: Ta, Tz TaskManager: Tb, Tz
Common MapReduce domains Indexing documents or web pages Counting word frequencies Processing log files ETL Processing Image archives Common characteristics Files/Blocks can be independently processed and the results easily merged Scales with the number of nodes, size of data, number of CPUs
Additional Apache/Hadoop Projects Hbase – Large table NoSQL database Hive – Data warehousing infrastructure / SQL support PIG – Data processing scripting / MapReduce OOZIE – Workflow Scheduling FLUME – Distributed log file processing MAHOUT – Machine learning libraries