Download presentation
Presentation is loading. Please wait.
Published byAlvin Marshall Modified over 7 years ago
1
Lecture 4. HDFS, MapReduce Implementation, and Spark
COSC6376 Cloud Computing Lecture 4. HDFS, MapReduce Implementation, and Spark Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
2
Outline GFS/HDFS MapReduce Implementation Spark
3
Distributed File System
4
Goals of HDFS Very Large Distributed File System
10K nodes, 100 million files, 10PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth
5
Assumptions Inexpensive components that often fail Large files Large streaming reads and small random reads Large sequential writes Multiple users append to the same file High bandwidth is more important than low latency
8
Hadoop Cluster
9
Failure Trends in a Large Disk Drive Population
The data are broken down by the age a drive was when it failed.
10
Failure Trends in a Large Disk Drive Population
The data are broken down by the age a drive was when it failed.
13
Architecture Chunks Master server Chunk servers
File chunks location of chunks (replicas) Master server Single master Keep metadata accept requests on metadata Most mgr activities Chunk servers Multiple Keep chunks of data Accept requests on chunk data
15
User Interface Commads for HDFS User: Commands for HDFS Administrator
hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision datanodename Web Interface
16
Design decisions Single master Large chunk size: e.g., 64M
Simplify design Single point-of-failure Limited number of files Meta data kept in memory Large chunk size: e.g., 64M Advantages Reduce client-master traffic Reduce network overhead – less network interactions Chunk index is smaller Disadvantages Not favor small files
17
Master: meta data Metadata is stored in memory Namespaces
Directory physical location Files chunks chunk locations Chunk locations Not stored by master, sent by chunk servers Operation log
18
Master Operations All namespace operations Manage chunk replicas
Name lookup Create/remove directories/files, etc Manage chunk replicas Placement decision Create new chunks & replicas Balance load across all chunkservers Garbage claim
19
Master: chunk replica placement
Goals: maximize reliability, availability and bandwidth utilization Physical location matters Lowest cost within the same rack “Distance”: # of network switches In practice (hadoop) If we have 3 replicas Two chunks in the same rack The third one in another rack Choice of chunkservers Low average disk utilization Limited # of recent writes distribute write traffic
20
Master: chunk replica placement
Re-replication Lost replicas for many reasons Prioritized: low # of replicas, live files, actively used chunks Following the same principle to place Rebalancing Redistribute replicas periodically Better disk utilization Load balancing
21
Master: garbage collection
Lazy mechanism Mark deletion at once Reclaim resources later Regular namespace scan For deleted files: remove metadata after three days (full deletion) For orphaned chunks, let chunkservers know they are deleted Stale replica Use chunk version numbers
22
Read/Write File Read File Write HDFS client Distributed FileSystem
FSData InputStream client node client JVM HDFS client Distributed FileSystem FSData OutputStream client node client JVM 1. open 2. get block locations 1. create 2. create name node NameNode name node NameNode 3. read 3. write 8. complete 6. close 7. close 4. get a list of 3 data nodes 4. read from the closest node 5. write packet 6. ack packet 5. read from the 2nd closest node data node DataNode data node DataNode data node DataNode data node DataNode data node DataNode data node DataNode If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the partial data from the crashed node later, and Namenode allocates an another node.
23
Consistency It is expensive to maintain strict consistency GFS uses a relaxed consistency Better support for appending checkpointing
24
Fault Tolerance High availability Data integrity Fast recovery
Chunk replication Master replication: inactive backup Data integrity Checksumming Use CRC32 Incremental update checksum to improve performance A chunk is split into 64K-byte units Update checksum after adding a unit
25
Cost
26
Discussion Advantages Tradeoffs Latest upgrades (GFS II)
Works well for large data processing Using cheap commodity servers Tradeoffs Single master design Reads most, appends most Latest upgrades (GFS II) Distributed masters Introduce the “cell” – a number of racks in the same data center Improved performance of random r/w
27
MapReduce Implementation
28
How is this distributed?
Execution How is this distributed? Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!
29
MapReduce - Dataflow
30
Map Implementation Chunk(block) A Map Process …
K1~ki Ki+1-kj … … … R parts in map’s local storage Map processes are allocated to be close to the chunks as possible One node can run a number of map processes. It depends on the setting.
31
Reducer Implementation
Mapper1 output Mapper2 output Mappern output K1~ki Ki+1-kj … K1~ki Ki+1-kj … K1~ki Ki+1-kj … R Reducers … - R final output files stored in the user designated directory
32
Execution
33
Execution Initialization
Split input file into 64MB sections (GFS) Read in parallel by multiple machines Fork off program onto multiple machines One machine is Master Master assigns idle machines to either Map or Reduce tasks Master Coordinates data communication between map and reduce machines
34
Overview of Hadoop/MapReduce
35
Partition Function Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R
36
Partition Key-Value Pairs for Reduce
Hadoop uses a hash code as its default method to partition key-value pairs. The default hash code used by a string object in Java (used by Hadoop). Wn represents the nth element in a string. A partition function typically uses the hash code of the key and the modulo of reducers to determine which reducer to send the key-value pair to.
37
Dataflow Input, final output are stored on a distributed file system
Scheduler tries to schedule map tasks “close” to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task
38
How Many Map and Reduce jobs?
M map tasks, R reduce tasks Rule of thumb Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files
39
Misc. Remarks
40
Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished.
41
Locality Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks in HDFS (by default): same size as Google File System chunks
42
Job Processing TaskTracker 0 TaskTracker 1 TaskTracker 2 JobTracker TaskTracker 3 TaskTracker 4 TaskTracker 5 Client submits “wordcount” job, indicating code and input files JobTracker breaks input file into k chunks, (in this case 6). Assigns work to T.trackers. After map(), T.trackers exchange map-output to build reduce() keyspace JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. reduce() output may go to HDFS “wordcount”
43
Fault Tolerance Worker failure Master failure
Map/reduce fails reassign to other workers Node failure redo jobs in other nodes Chunk replicas make it possible Master failure Log/checkpoint Master process and GFS master server Backup tasks for “straggler”: The whole process is often delayed by a few workers, because of various reasons (network, disk I/O, node failure…) As close to completion, master starts multiple workers for each in-progress job Without vs. with backup tasks: 44% longer
44
Experiment: Effect of Backup Tasks, Reliability (example: sorting)
45
MapReduce Chaining
46
MapReduce Summary
47
MapReduce Conclusions
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let the middleware deal with messy details
48
In Real World Applied to various domains in Google Machine learning
Clustering Reports Web page processing Indexing Graph computation … Mahout library Scalable machine learning and data mining Research projects
49
MapReduce: The Good Built in fault tolerance Optimized IO path
Scalable Developer focuses on Map/Reduce, not infrastructure Simple? API
50
MapReduce: The Bad Optimized for disk IO Primitive API
Doesn’t leverage memory well Iterative algorithms go through disk IO path again and again Primitive API Developer’s have to build on very simple abstraction Key/Value in/out Even basic things like join require extensive code Result often many files that need to be combined appropriately
51
Hadoop Acceleration
52
FPGA Cluster
53
FPGA for MapReduce
54
Density and Energy
55
ARM Cluster 6 node, 6 TB Hadoop cluster on ARM
56
MapReduce Using GPUs Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms
57
GPGPU Programming Each block can have up to 512 threads that synchronize Millions of blocks can be issued
58
Programming Environment: CUDA
Compute Unified Device Architecture (CUDA) A parallel computing architecture developed by NVIDIA The computing engine in GPUs is accessible to software developers through industry standard programming language
59
Mars: A MapReduce Framework on Graphics Processors
System Workflow and Configuration
60
Applications
61
Hardware
62
Mars: Performance Performance speedup between Phoenix and the
optimized Mars with the data size varied.
63
MapReduce Advantages/Disadvantages
Now it’s easy to program for many CPUs Communication management effectively gone I/O scheduling done for us Fault tolerance, monitoring machine failures, suddenly-slow machines, etc are handled Can be much easier to design and program! Can cascade several MapReduce tasks But … it further restricts solvable problems Might be hard to express problem in MapReduce Data parallelism is key Need to be able to break up a problem by data chunks
64
In-Memory MapReduce
65
In-Memory Data Grid
66
Speed of In-Memory Data Grid
67
Real-time
68
Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrustion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds Tathagata Das. Spark Streaming: Large-scale near-real-time stream processing.
69
Available data source Twitter public status updates
Jagane Sundar. Realtime Sentiment Analysis Application Using Hadoop and HBase
70
Firehose “Firehose” is the name given to the massive, real-time stream of Tweets that flow from Twitter each day. Twitter provides access to this “firehose”, using a streaming technology called XMPP
71
Who are using it
72
Requirements Scalable to large clusters Second-scale latencies
Simple programming model
73
Apache Spark
76
Apache Spark Originally developed in UC Berkeley’s AMP Lab
Fully open sourced – now at Apache Software Foundation Commercial Vendor Developing/Supporting Apache Spark. MapR Technologies.
77
Spark: Easy and Fast Big Data
Easy to Develop Rich APIs in Java, Scala, Python Interactive shell Fast to Run General execution graphs In-memory storage 2-5× less code Apache Spark. MapR Technologies.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.