Lecture 4. HDFS, MapReduce Implementation, and Spark

Lecture 4. HDFS, MapReduce Implementation, and Spark
COSC6376 Cloud Computing Lecture 4. HDFS, MapReduce Implementation, and Spark Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline GFS/HDFS MapReduce Implementation Spark

Distributed File System

Goals of HDFS Very Large Distributed File System
10K nodes, 100 million files, 10PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth

Assumptions Inexpensive components that often fail Large files Large streaming reads and small random reads Large sequential writes Multiple users append to the same file High bandwidth is more important than low latency

Hadoop Cluster

Failure Trends in a Large Disk Drive Population
The data are broken down by the age a drive was when it failed.

Architecture Chunks Master server Chunk servers
File  chunks  location of chunks (replicas) Master server Single master Keep metadata accept requests on metadata Most mgr activities Chunk servers Multiple Keep chunks of data Accept requests on chunk data

User Interface Commads for HDFS User: Commands for HDFS Administrator
hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision datanodename Web Interface

Design decisions Single master Large chunk size: e.g., 64M
Simplify design Single point-of-failure Limited number of files Meta data kept in memory Large chunk size: e.g., 64M Advantages Reduce client-master traffic Reduce network overhead – less network interactions Chunk index is smaller Disadvantages Not favor small files

Master: meta data Metadata is stored in memory Namespaces
Directory  physical location Files  chunks  chunk locations Chunk locations Not stored by master, sent by chunk servers Operation log

Master Operations All namespace operations Manage chunk replicas
Name lookup Create/remove directories/files, etc Manage chunk replicas Placement decision Create new chunks & replicas Balance load across all chunkservers Garbage claim

Master: chunk replica placement
Goals: maximize reliability, availability and bandwidth utilization Physical location matters Lowest cost within the same rack “Distance”: # of network switches In practice (hadoop) If we have 3 replicas Two chunks in the same rack The third one in another rack Choice of chunkservers Low average disk utilization Limited # of recent writes  distribute write traffic

Master: chunk replica placement
Re-replication Lost replicas for many reasons Prioritized: low # of replicas, live files, actively used chunks Following the same principle to place Rebalancing Redistribute replicas periodically Better disk utilization Load balancing

Master: garbage collection
Lazy mechanism Mark deletion at once Reclaim resources later Regular namespace scan For deleted files: remove metadata after three days (full deletion) For orphaned chunks, let chunkservers know they are deleted Stale replica Use chunk version numbers

Read/Write File Read File Write HDFS client Distributed FileSystem
FSData InputStream client node client JVM HDFS client Distributed FileSystem FSData OutputStream client node client JVM 1. open 2. get block locations 1. create 2. create name node NameNode name node NameNode 3. read 3. write 8. complete 6. close 7. close 4. get a list of 3 data nodes 4. read from the closest node 5. write packet 6. ack packet 5. read from the 2nd closest node data node DataNode data node DataNode data node DataNode data node DataNode data node DataNode data node DataNode If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the partial data from the crashed node later, and Namenode allocates an another node.

Consistency It is expensive to maintain strict consistency GFS uses a relaxed consistency Better support for appending checkpointing

Fault Tolerance High availability Data integrity Fast recovery
Chunk replication Master replication: inactive backup Data integrity Checksumming Use CRC32 Incremental update checksum to improve performance A chunk is split into 64K-byte units Update checksum after adding a unit

Discussion Advantages Tradeoffs Latest upgrades (GFS II)
Works well for large data processing Using cheap commodity servers Tradeoffs Single master design Reads most, appends most Latest upgrades (GFS II) Distributed masters Introduce the “cell” – a number of racks in the same data center Improved performance of random r/w

MapReduce Implementation

How is this distributed?
Execution How is this distributed? Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!

MapReduce - Dataflow

Map Implementation Chunk(block) A Map Process …
K1~ki Ki+1-kj … … … R parts in map’s local storage Map processes are allocated to be close to the chunks as possible One node can run a number of map processes. It depends on the setting.

Reducer Implementation
Mapper1 output Mapper2 output Mappern output K1~ki Ki+1-kj … K1~ki Ki+1-kj … K1~ki Ki+1-kj … R Reducers … - R final output files stored in the user designated directory

Execution

Execution Initialization
Split input file into 64MB sections (GFS)‏ Read in parallel by multiple machines Fork off program onto multiple machines One machine is Master Master assigns idle machines to either Map or Reduce tasks Master Coordinates data communication between map and reduce machines

Overview of Hadoop/MapReduce

Partition Function Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R

Partition Key-Value Pairs for Reduce
Hadoop uses a hash code as its default method to partition key-value pairs. The default hash code used by a string object in Java (used by Hadoop). Wn represents the nth element in a string. A partition function typically uses the hash code of the key and the modulo of reducers to determine which reducer to send the key-value pair to.

Dataflow Input, final output are stored on a distributed file system
Scheduler tries to schedule map tasks “close” to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task

How Many Map and Reduce jobs?
M map tasks, R reduce tasks Rule of thumb Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files

Misc. Remarks

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished.

Locality Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks in HDFS (by default): same size as Google File System chunks

Job Processing TaskTracker 0 TaskTracker 1 TaskTracker 2 JobTracker TaskTracker 3 TaskTracker 4 TaskTracker 5 Client submits “wordcount” job, indicating code and input files JobTracker breaks input file into k chunks, (in this case 6). Assigns work to T.trackers. After map(), T.trackers exchange map-output to build reduce() keyspace JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. reduce() output may go to HDFS “wordcount”

Fault Tolerance Worker failure Master failure
Map/reduce fails  reassign to other workers Node failure  redo jobs in other nodes Chunk replicas make it possible Master failure Log/checkpoint Master process and GFS master server Backup tasks for “straggler”: The whole process is often delayed by a few workers, because of various reasons (network, disk I/O, node failure…) As close to completion, master starts multiple workers for each in-progress job Without vs. with backup tasks: 44% longer

Experiment: Effect of Backup Tasks, Reliability (example: sorting)

MapReduce Chaining

MapReduce Summary

MapReduce Conclusions
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let the middleware deal with messy details

In Real World Applied to various domains in Google Machine learning
Clustering Reports Web page processing Indexing Graph computation … Mahout library Scalable machine learning and data mining Research projects

MapReduce: The Good Built in fault tolerance Optimized IO path
Scalable Developer focuses on Map/Reduce, not infrastructure Simple? API

MapReduce: The Bad Optimized for disk IO Primitive API
Doesn’t leverage memory well Iterative algorithms go through disk IO path again and again Primitive API Developer’s have to build on very simple abstraction Key/Value in/out Even basic things like join require extensive code Result often many files that need to be combined appropriately

Hadoop Acceleration

FPGA Cluster

FPGA for MapReduce

Density and Energy

ARM Cluster 6 node, 6 TB Hadoop cluster on ARM

MapReduce Using GPUs Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms

GPGPU Programming Each block can have up to 512 threads that synchronize Millions of blocks can be issued

Programming Environment: CUDA
Compute Unified Device Architecture (CUDA) A parallel computing architecture developed by NVIDIA The computing engine in GPUs is accessible to software developers through industry standard programming language

Mars: A MapReduce Framework on Graphics Processors
System Workflow and Configuration

Applications

Hardware

Mars: Performance Performance speedup between Phoenix and the
optimized Mars with the data size varied.

MapReduce Advantages/Disadvantages
Now it’s easy to program for many CPUs Communication management effectively gone I/O scheduling done for us Fault tolerance, monitoring machine failures, suddenly-slow machines, etc are handled Can be much easier to design and program! Can cascade several MapReduce tasks But … it further restricts solvable problems Might be hard to express problem in MapReduce Data parallelism is key Need to be able to break up a problem by data chunks

In-Memory MapReduce

In-Memory Data Grid

Speed of In-Memory Data Grid

Real-time

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrustion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds Tathagata Das. Spark Streaming: Large-scale near-real-time stream processing.

Available data source Twitter public status updates
Jagane Sundar. Realtime Sentiment Analysis Application Using Hadoop and HBase

Firehose “Firehose” is the name given to the massive, real-time stream of Tweets that flow from Twitter each day. Twitter provides access to this “firehose”, using a streaming technology called XMPP

Who are using it

Requirements Scalable to large clusters Second-scale latencies
Simple programming model

Apache Spark

Apache Spark Originally developed in UC Berkeley’s AMP Lab
Fully open sourced – now at Apache Software Foundation Commercial Vendor Developing/Supporting Apache Spark. MapR Technologies.

Spark: Easy and Fast Big Data
Easy to Develop Rich APIs in Java, Scala, Python Interactive shell Fast to Run General execution graphs In-memory storage 2-5× less code Apache Spark. MapR Technologies.

Lecture 4. HDFS, MapReduce Implementation, and Spark

Similar presentations

Presentation on theme: "Lecture 4. HDFS, MapReduce Implementation, and Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4. HDFS, MapReduce Implementation, and Spark

Similar presentations

Presentation on theme: "Lecture 4. HDFS, MapReduce Implementation, and Spark"— Presentation transcript:

Similar presentations

About project

Feedback