Hadoop Framework and Its Applications

Name: Hadoop Framework and Its Applications
Uploaded: 2018-01-13T14:20:06+00:00
Duration: PTM21S19
Channel: Lee Carr
Description: Hadoop Framework and Its Applications

Hadoop Framework and Its Applications
CSCI5570 Large Scale Data Processing Systems Instructor: James Cheng, CSE, CUHK Slide Ack.: modified based on the slides from Shumo Chu

Outline Background Hadoop Framework Applications of Hadoop
What is Hadoop Hadoop Distributed File System (HDFS) MapReduce Mapper Reducer Combiner Custom Partitioner Applications of Hadoop HBase TF-IDF SSSP

Parallel processing is the trend
Moore’s Law: roughly stated, processing power doubles every two years Worse: data grow at a much faster rate!!! However, we will reach the physical limitation soon!

Single Node Architecture
Load-Process-Dump: Load the data Process the data Dump the result to persistent storage CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Motivation: Google Example
20+ billion web pages x 20KB = 400+ TB Load: 1 computer reads MB/sec from disk ~4 months to read the web Process: huge memory (>400 TB RAM) Dump: ~1,000 hard drives to store the web A standard architecture for such problems: Cluster of commodity Linux nodes Commodity network (ethernet) to connect them J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Cluster Architecture … … 2-10 Gbps backbone between racks
1 Gbps between any pair of nodes in a rack Switch Switch Switch Mem Disk CPU Mem Disk CPU Mem Disk CPU Mem Disk CPU … … Each rack contains nodes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Drawbacks of traditional distributed framework
MPI (Message Passing Interface) PVM (Parallel Virtual Machine) Condor Programming is complicated Data exchange requires synchronization Difficult to deal with partial system failure (not uncommon in a large cluster)

Why Hadoop? Reliability: the system automatically handles partial failures Scalability: automatically scales to thousands or more computing nodes Programmability: applications are written in high-level code, programmers do not worry about network programming, temporal dependencies, fault tolerance, etc

What is Hadoop Hadoop Distributed File System (HDFS) MapReduce Mapper Reducer Applications of Hadoop HBase TF-IDF SSSP

What is Hadoop ? Hadoop is an open-source project by Apache Software Foundation The key concept of Hadoop is based on papers published by Google in 2003 and (Google File System and MapReduce) Hadoop was matured and became popular due to commitment by many organizations Google, Yahoo, Facebook, Cloudera, etc

Hadoop Components Hadoop consists of two core components
The Hadoop Distributed File System (HDFS) MapReduce There are many other projects based on core Hadoop Referred as Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

Hadoop: System Configuration

Hadoop Component: HDFS
HDFS is responsible for storing data in the cluster Data are split into blocks and distributed across multiple nodes Each block is typically 64MB or 128MB in size Each block is replicated multiple times Default is to replicate each block 3 times Replicas are stored on different nodes For reliability and availability

Hadoop Component: HDFS
NameNode The centerpiece of an HDFS file system Keeps the directory tree of all files, and tracks where in the cluster the data of each file are kept Does not store the data of these files itself NameNode is a single point of failure, thus secondary NameNode

How files are stored: Example
NameNode holds metadata DataNodes hold the actual data blocks

How is HDFS working? When a client application wants to read a file: It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside in It then communicates directly with the DataNodes to read the data

Hadoop component: MapReduce
MapReduce is a method for distributing a task across multiple nodes Each node processes data stored in that node (for locality) Where possible Consists of two phases: Map Reduce Features Automatic parallelization Fault tolerance A clean abstraction for programmers (away from other tedious and complicated works for distributed/parallel computing)

MapReduce: The Big Picture

MapReduce: The Mapper The mapper reads data in the form of key/value pairs It outputs zero or more key/value pairs

MapReduce: The Mapper (cont’d)
The mapper may use or completely ignore the input key For example, a standard pattern is to read a line of a file at a time The key is the byte offset in the file at which line starts The value is the contents of the line itself Typically the key is considered irrelevant The output must be in the form of key/value pairs

Example Mapper: Upper Case Mapper
Turn input into upper case: Here, ‘foo’ and ‘FOO’ are different keys E.g., we could have ‘foo’ and ‘fOO’, which will be both converted into ‘FOO’

Example Mapper: Filter Mapper
Only output key/value pairs where the input value is a prime number:

MapReduce: The Reducer
After the map phase is over, all the intermediate values for the same intermediate key are combined together into a list The list is sent to a reducer There may be a single reducer, or multiple reducers All values associated with a particular intermediate key are guaranteed to go to the the same reducer This step is known as the “shuffle” step (done by the system, e.g., Hadoop, automatically by external sorting) Each reducer then computes on the intermediate keys and their value lists Results are written to HDFS

Example Reducer: Sum Reducer
Sum up all the values associated with each intermediate key

Example Reducer: Identity Reducer
The Identity Reducer is very common, i.e., simply return the output of the mappers:

MapReduce Execution

MapReduce: Data Localization
Whenever possible, Hadoop will attempt to ensure that a map task on a node is working on a block of data stored locally in the node via HDFS If this is not possible, the map task will have to transfer the data across the network as it processes that data Once the map task has finished, the data are then transferred across the network to reducers Although a reducer may run on the same physical machines as the map tasks, there is no concept of data locality for reducers All mappers will, in general, have to communicate with all reducers

MapReduce: Is the shuffle step a bottleneck
It appears that the shuffle phase is a bottleneck No reducers can start until all mappers have finished? In practice, Hadoop will start to transfer data from mappers to reducers as the mappers finish work This mitigates against transferring a huge amount of data that starts only after the last mapper finishes

MapReduce: Is a slow mapper a bottleneck?
It is possible for one map task to run more slowly than others Perhaps due to faulty hardware, or just a very slow machine It would appear that this would create a bottleneck No reducers can start until every mapper has finished Hadoop uses speculative execution to mitigate this If a mapper appears to be running significantly more slowly than others, a new instance of the mapper will be started on another machine, operating on the same data The result of the first mapper to finish will be used Hadoop will kill off the mapper which is still running Similarly for a slow reducer

MapReduce: The Combiner
Often, mappers produce a large amount of intermediate data That data must be passed to reducers This can result in a lot of network traffic It is often possible to specify a combiner Like a “mini-reduce” Runs locally on a single mapper’s output Output from the combiner is sent to reducers Combiner and reducer codes are often identical Technically, this is possible if the operation performed is commutative and associative

MapReduce Example: Word Count
Count the number of occurrences of each word in a large amount of input data This is the “hello world” of MapReduce programming

MapReduce Example: Word Count (cont’d)
Input to Mappers Output from Mappers

MapReduce Example: Word Count (cont’d)
Intermediate data sent to Reducers Final Output from Reducers:

Word Count with Combiner
Combiners would reduce the amount of data sent to reducers Intermediate data sent to reducers after a combiner using the same code as a reducer Combiners decrease the amount of network traffic required during the shuffle phase Often also decrease the amount of work needed to be done by reducers

MapReduce: Custom Partitioners
Sometimes you will need to write your own partitioner Number of partitions = number of reducers Example: You may want that all keys with value in a range to go to the same reducer The default partitioner is not sufficient in this case Write your own partitioner like: job.setPartitionerClass(MyPartitioner.class);

Custom Partitioners (cont’d)
Custom partitioners are needed when performing a secondary sort Custom partitioners are also useful to avoid potential performance issues To avoid one reducer having to deal with many large lists of values

HBase vs RDBMS

HBase Data as Input to MapReduce Jobs
Rows from an HBase table can be used as input to a MapReduce job Each row is treated as a single record MapReduce jobs can sort/search/index/query data in bulk

Data Mining – TF-IDF Term Frequency – Inverse Document Frequency (TF-IDF) Answers the question “How important is this term in a document” Known as a term weighting function Assigns a score (weight) to each term (word) in a document Very commonly used in text processing and search Has many applications in data mining

TF-IDF: Motivation Merely counting the number of occurrences of a word in a document is not a good enough measure of its relevance If the word appears in many other documents, it is probably less relevance Some words appear too frequently in all documents to be relevant Known as ‘stopwords’ TF-IDF considers both the frequency of a word in a given document and the number of documents which contain the word

TF-IDF: Definition Term Frequency (TF)
Number of times a term appears in a document (i.e., the count) Inverse Document Frequency (IDF) N: total number of documents n: number of documents that contain a term TD-IDF TF × IDF

Computing TF-IDF With MapReduce
Overview of algorithm: 3 MapReduce jobs Job 1: compute term frequencies Job 2: compute number of documents each word appears in Job 3: compute TD-IDF Notations: tf = term frequency n = number of documents a term appears in N = total number of documents docid = a unique id for each document

Computing TF-IDF: Job 1 – Compute tf
Mapper Input: (docid, contents) For each term in the document, generate a (term, docid) pair i.e., we have seen this term in this document once Output: ((term, docid), 1) Reducer Sum counts for word in document Outputs ((term, docid), tf) i.e., the term frequency of term in docid is tf We can add a combiner, which will use the same code as a reducer

Computing TF-IDF: Job 2 – Compute n
Mapper Input: ((term, docid), tf) Output: (term, (docid, tf, 1)) Reducer Sum ‘1’s to compute n (number of documents containing term) Note: need to buffer (docid, tf) pairs while we are doing this (more later) For each (docid, tf) pair: Outputs ((term, docid), (tf, n))

Computing TF-IDF: Job 3 – Compute TF-IDF
Mapper Input: ((term, docid), (tf, n)) Assume N is known (easy to find) Output ((term, docid), TF × IDF) Reducer The identity function

Computing TF-IDF: Working At Scale
For Job 2, we need to buffer (docid, tf) pairs while summing ‘1’s (to compute n) Potential problem: pairs may not fit in memory! How many documents does the word “the” appear in? Possible solutions Ignore very-high-frequency words Write out intermediate data to a file Use another MapReduce pass

TF-IDF: Final Thoughts
Several small jobs add up to full algorithm Thinking in MapReduce often means decomposing a complex algorithm into a sequence of smaller jobs Beware of memory usage for large amounts of data! Any time when you need to buffer data, there’s a potential scalability bottleneck

Graph Algorithm: SSSP Graph usually represented as adjacency lists
Serial algorithm: Dijkstra’s Algorithm Not suitable for parallelization MapReduce algorithm: parallel breadth-first search

Parallel Breadth-First Search
The algorithm, intuitively: Distance from the source to itself = 0 For all neighbors of the source: distance = 1 For all nodes that are neighbors of some node v in the graph, distance from the source: = 1 + min(distance from the source to v)

Parallel Breadth-First Search: Algorithm
Mapper: Input key is a node id Input value is (d, adjacency list), where d is the distance from source Processing: for each node in the adjacency list, emit (node id, d +1) If the distance to this node is d, then the distance to any of its neighbors is d + 1 Reducer: Receive a node id and a list of distance values Processing: selects the smallest distance value for that node

PBFS: Pseudo-Code

Iterations of PBFS A MapReduce job corresponds to one iteration of parallel breadth-first search Each iteration advances the ‘known frontier’ by one hop Iteration is accomplished by using the output from one job as the input to the next How many iterations are needed? Multiple iterations are needed to explore the entire graph As many as the diameter of the graph Graph diameters are surprisingly small, even for large graphs ‘Six degrees of separation’ Controlling iterations in Hadoop Use counters; when you reach a node, ‘count’ it At the end of each iteration, check the counters When you’ve reached all the nodes, you finish

Hadoop Framework and Its Applications

Similar presentations

Presentation on theme: "Hadoop Framework and Its Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop Framework and Its Applications

Similar presentations

Presentation on theme: "Hadoop Framework and Its Applications"— Presentation transcript:

Similar presentations

About project

Feedback