Download presentation
Published byAllyson Haynes Modified over 9 years ago
1
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
2
Agenda Why Hadoop? HDFS MapReduce
3
Why Hadoop?
4
The V’s of Big Data! Volume Velocity Variety Veracity Volume
Need to manage lots of data Social enterprises and Industrial Internet generate lots of data Velocity Data is coming in at an extremely fast rate, need to respond in milliseconds Variety Today’s agile business require rapid response to changing requirements Fixed schemas are too rigid, need 'emerging' schemas May not fit into traditional 'relational' and 'transactional' data model of RDBMS Veracity Lots of uncertainty in the data due to inconsistency, ambiguities, latency, and model approximations
5
Data Sources Social Media Web Logs Video Networks Sensors
Transactions (banking, etc.) , Text Messaging Paper Documents
6
Value in all of this! Fraud Detection Predictive Models
Recommendations Analyzing Threats Market Analysis Others!
7
How do we extract value? Monolithic Computing!
Build bigger faster computers! Limited solution Expensive and does not scale as data volume increases
8
Enter Distributed Computing
Processing is distributed across many computers Distribute the workload to powerful compute nodes with some separate storage
9
New Class of Challenges
Does not scale gracefully Hardware failure is common Sorting, combining, and analyzing data spread across thousands of machines is not easy
10
RDBMS is still alive! SQL is still very relevant, with many complex business rules mapping to a relational model But there is much more than can be done by leaving relational behind and looking at all of your data NoSQL can expand what you have, but it has its disadvantages
11
NoSQL Downsides Increased middle-tier complexity
Constraints on query capability No standard semantics for query Complex to setup and maintain
12
An Ideal Cluster Linear horizontal scalability
Analytics run in isolation Simple API with multiple language support Available in spite of hardware failure
13
Enter Hadoop Hits these major requirements Two core pieces
Distributed File System (HDFS) Flexible analytic framework (MapReduce) Many ecosystem components to expand on what core Hadoop offers
14
Scalability Near-linear horizontal scalability
Clusters are built on commodity hardware Component failure is an expectation
15
Data Access Moving data from storage to a processor is expensive
Store data and process the data on the same machines Process data intelligently by being local
16
Disk Performance Disk technology has made significant advancements
Take advantage of multiple disks in parallel 1 disk, 3TB of data, 300MB/s, ~2.5 hours to read 1,000 disks, same data, ~10 seconds to read Distribution of data and co-location of processing makes this a reality
17
Complex Processing Code
Hadoop framework abstracts complex distributed computing environment No synchronization code No networking code No I/O code MapReduce developer focuses on the analysis Job runs the same on one node or 4,000 nodes!
18
Fault Tolerance Failure is inevitable, and it is planned for
Component failure is automatically detected and seamlessly handled System continues to operate as expected with minimal degradation
19
Hadoop History Spun off of Nutch, an open source web search engine
Google Whitepapers GFS MapReduce Nutch re-architected to birth Hadoop Hadoop is very mainstream today
20
Hadoop Versions Hadoop 2.x recently became stable
Two Java APIs, referred to as the old and new APIs Two task management systems MapReduce v1 JobTracker, TaskTracker MapReduce v2 – A YARN application MR runs in containers in the YARN framework
21
Hdfs
22
Hadoop Distributed File System
Inspired by Google File System (GFS) High performance file system for storing data Relatively simple centralized management Fault tolerance through data replication Optimized for MapReduce processing Exposing data locality Linearly scalable
23
Hadoop Distributed File System
Use of commodity hardware Files are write once, read many Leverages large streaming reads vs random Favors high throughput vs low latency Modest number of huge files
24
HDFS Architecture Split large files into blocks
Distribute and replicate blocks to nodes Two key services Master NameNode Many DataNodes Backup/Checkpoint NameNode for HA User interacts with a UNIX file system
25
NameNode Single master service for HDFS Was a single point of failure
Stores file to block to location mappings in a namespace All transactions are logged to disk Can recover based on checkpoints of the namespace and transaction logs
26
NameNode Memory HDFS prefers fewer larger files as a result of the metadata Consider 1GB of data with 64MB block size Stored as one file Name: 1 item Blocks: 16*3 = 48 items Total items: 49 Stored as MB files Names: 1024 items Blocks: 1024*3 = 3072 items Total items: 4096 Each item is around 200 bytes
27
Checkpoint Node (Secondary NN)
Performs checkpoints of the namespace and logs Not a hot backup! HDFS 2.0 introduced NameNode HA Active and a Standby NameNode service coordinated via ZooKeeper
28
DataNode Stores blocks on local disk
Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNodes for I/O
29
How HDFS Works - Writes Client contacts NameNode to write data
NameNode says write it to these nodes Client sequentially writes blocks to DataNode Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
30
DataNodes replicate data
How HDFS Works - Writes DataNodes replicate data blocks, orchestrated by the NameNode After the file is closed, the data nodes traffic data around to replicate the blocks to a triplicate, all orchestrated by the namenode In the event of a node failure, data can be accessed on other nodes and the namenode will move data blocks to other nodes
31
How HDFS Works - Reads Client contacts NameNode to read data
NameNode says you can find it here Client sequentially reads blocks from DataNode Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
32
How HDFS Works - Failure
Client connects to another node serving that block Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
33
HDFS Blocks Default block size is 64MB Default replication is three
Configurable Common sizes are 128 MB and 256 MB Default replication is three Also configurable, not rarely changed Stored as files on the DataNode’s local fs Cannot associate any block with its true file
34
Replication Strategy First copy is written to the same node as the client If the client is not part of the cluster, first block goes to a random node Second copy is written to a node on a different rack Third copy is written to a different node on the same rack as the second copy
35
Data Locality Key in achieving performance
MapReduce tasks run as close to data as possible Some terms Local On-Rack Off-Rack
36
Data Corruption Use of checksums to to ensure block integrity
Checksum is calculatd on read and compared against that when it was written Fast to calculate and space-efficient If the checksums differ, client reads the block from another DataNode Corrupted block will be deleted and replicated by a non-corrupt block DataNode periodically runs a background thread to do this checksum process For those files that aren’t read very often
37
Fault Tolerance If no heartbeat is received from DN within a configurable time window, it is considered lost 10 minute default NameNode will: Determine which blocks were on the lost node Locate other DataNodes with valid copies Instruct DataNodes with copies to replicate blocks to other DataNodes
38
Interacting with HDFS Primary interfaces are CLI and Java API
Web UI for read-only access HFTP provides HTTP/S read-only WebHDFS provides RESTful read/write FUSE, allows HDFS to be mounted on standard file system Typically used so legacy apps can use HDFS data ‘hdfs dfs -help’ will show CLI usage
39
Hadoop MapReduce
40
MapReduce! A programming model for processing data
Contains two phases! Map Perform a map function on key/value pairs Reduce Perform a reduce function on key/value groups Groups are created by sorting map output Operations on key/value pairs open the door for very parallelizable algorithms
41
Hadoop MapReduce Automatic parallelization and distribution of tasks
Framework handles scheduling task and repeating failed tasks Developer can code many pieces of the puzzle Framework handles a Shuffle and Sort phase between map and reduce tasks. Developer need only focus on the task at hand, rather than how to manage where data comes from and where it goes
42
MRv1 and MRv2 Both manage compute resources, jobs, and tasks
Job API the same MRv1 is proven in production JobTracker / TaskTrackers MRv2 is a new application on YARN YARN is a generic platform for developing distributed applications
43
MRv1 - JobTracker Monitors job and task progress
Issues task attempts to TaskTrackers Re-tries failed task attempts Four failed attempts of same task = one failed job Default scheduler is FIFO order CapacityScheduler or FairScheduler is preferred Single point of failure for MapReduce
44
MRv1 – TaskTrackers Runs on same node as DataNodes
Sends heartbeats and task reports to JT Configurable number of map and reduce slots Runs map and reduce task attempts in a separate JVM
45
How MapReduce Works Client submits job to JobTracker
JobTracker submits tasks to TaskTrackers JobTracker reports metrics Client submits a job to the JobTracker for processing JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations Job statistics – not output – is reported back to the client upon job completion Job output is written to DataNodes w/replication
46
How MapReduce Works - Failure
JobTracker assigns task to different node Client submits a job to the JobTracker for processing JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations Job statistics – not output – is reported back to the client upon job completion
47
Map Input Map Output Reducer Input Groups Reducer Output
Uses key value pairs as input and output to both phases Highly parallelizable paradigm – very easy choice for data processing on a Hadoop cluster Reducer Output
48
Example -- Word Count Count the number of times each word is used in a body of text Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
49
Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); }
50
Shuffle and Sort Mapper outputs to a single partitioned file
Reducers copy their parts Mapper outputs data to a single file that is logically partitioned by key Reducers copy their partition over to the local machine – “shuffle” Each reducer then sorts their partitions into a single sorted local file for processing Reducer merges partitions
51
Reducer Code public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));
52
Resources, Wrap-up, etc. http://hadoop.apache.org Supportive community
Hadoop-DC Data Science MD Baltimore Hadoop Users Group Plenty of resources available to learn more Books lists Blogs Easy to install on VM and begin exploring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.