CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1
Large-Scale Data Analytics 2 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Performance (indexing, tuning, data organization tech.) Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Many enterprises turn to Hadoop computing paradigm for big data applications : Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support
What is Hadoop Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : – Large datasets Terabytes or petabytes of data – Large clusters Hundreds or thousands of nodes Open-source implementation for Google MapReduce Simple programming model : MapReduce Simple data model: flexible for any data 3
Hadoop Framework Two main layers: – Distributed file system (HDFS) – Execution engine (MapReduce) 4 Hadoop is designed as a master-slave shared-nothing architecture
Hadoop Master/Slave Architecture Hadoop is designed as a master-slave shared-nothing architecture 5 Master node (single node) Many slave nodes
Key Ideas of Hadoop Automatic parallelization & distribution – Hidden from end-user Fault tolerance and automatic recovery – Failed nodes/tasks recover automatically Simple programming abstraction – Users provide two functions “map” and “reduce” 6
Who Uses Hadoop ? Google: Invent MapReduce computing paradigm Yahoo: Develop Hadoop open-source of MapReduce Integrators: IBM, Microsoft, Oracle, Greenplum Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn Many others … 7
Hadoop Architecture 8 Master node (single node) Many slave nodes Distributed file system (HDFS) Execution engine (MapReduce)
Hadoop Distributed File System (HDFS) 9 Centralized namenode - Maintains metadata info about files Many datanodes (1000 s ) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F Blocks (64 MB)
HDFS File System Properties Large Space: An HDFS instance may consist of thousands of server machines for storage Replication: Each data block is replicated Failure: Failure is norm rather than exception Fault Tolerance: Automated detection of faults and recovery 10
Map-Reduce Execution Engine (Example: Color Count) 11 Shuffle & Sorting based on k Input blocks on HDFS Produces (k, v) (, 1) Consumes(k, [v]) (, [1,1,1,1,1,1..]) Produces(k’, v’) (, 100) Users only provide the “Map” and “Reduce” functions
MapReduce Engine Job Tracker is the master node (runs with the namenode) – Receives the user’s job – Decides on how many tasks will run (number of mappers) – Decides on where to run each mapper (locality) 12 This file has 5 Blocks run 5 map tasks Run task reading block “1” on Node 1 or 3. Node 1Node 2 Node 3
MapReduce Engine Task Tracker is the slave node (runs on each datanode) – Receives the task from Job Tracker – Runs task to completion (either map or reduce task) – Communicates with Job Tracker to report its progress 13 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
About Key-Value Pairs Developer provides Mapper and Reducer functions Developer decides what is key and what is value Developer must follow the key-value pair interface Mappers: – Consume pairs – Produce pairs Shuffling and Sorting: – Groups all similar keys from all mappers, – sorts and passes them to a certain reducer – in the form of > Reducers: – Consume > – Produce 14
MapReduce Phases 15
Another Example : Word Count Job: Count occurrences of each word in a data set 16 Map Tasks Reduce Tasks
Summary : Hadoop vs. Typical DB Distributed DBsHadoop Computing Model-Notion of transactions -Transaction is the unit of work -ACID properties, Concurrency control -Notion of jobs -Job is the unit of work -No concurrency control Data Model-Structured data with known schema -Read/Write mode -Any data format -ReadOnly mode Cost Model-Expensive servers-Cheap commodity machines Fault Tolerance-Failures are rare -Recovery mechanisms -Failures are common over thousands of machines -Simple fault tolerance Key Characteristics- Efficiency, Powerful, optimizations- Scalability, flexibility, fault tolerance 17
Cloud Computing 18 Cloud Computing A computing model where any computing infrastructure can run on the cloud Hardware & Software are provided as remote services Elastic: grows and shrinks based on the user’s demand Example: Amazon EC2