Apache Hadoop and Spark

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Image taken from: slideshare
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Hadoop.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
An Open Source Project Commonly Used for Processing Big Data Sets
Hadoop MapReduce Framework
Spark Presentation.
Introduction to MapReduce and Hadoop
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Hadoop Clusters Tess Fulkerson.
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to Spark.
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Apache Hadoop and Spark Instructor: Bei Kang

Huge Amount of Data every Minute

Hadoop Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing. Hadoop consists of three key parts – Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop. Map-Reduce – It is the data processing layer of Hadoop. YARN – It is the resource management layer of Hadoop.

Hadoop Hadoop Architecture – courtesy of Data Flair

Hadoop – HDFS System Hadoop Distributed File System (HDFS)

Hadoop – HDFS System NameNode and DataNodes HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Hadoop – HDFS System The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

Hadoop – HDFS Stsyem Goals of HDFS Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

Hadoop – Map Reduce

Hadoop – Map Reduce Bear, Deer, River and Car Example – word count

Hadoop – Map Reduce Another example: The chart is electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg 1979 23 2 43 24 25 26 1980 27 28 30 31 29 1981 32 33 34 35 36 1984 39 38 41 42 40 1985 00 45

Hadoop – Map Reduce The above data is saved as sample.txtand given as input. The input file looks as shown below. 1979 23 23 2 43 24 25 26 26 26 26 25 26 25 1980 26 27 28 28 28 30 31 31 31 30 30 30 29 1981 31 32 32 32 33 34 35 36 36 34 34 34 34 1984 39 38 39 39 39 41 42 43 40 39 38 38 40 1985 38 39 39 39 39 41 41 41 00 40 39 39 45 What we can do to find the max or min values for the given data?

Apache Spark Apache Spark is an open-source, lightning fast big data framework which is designed to enhance the computational speed. Hadoop MapReduce, read and write from the disk as a result it slows down the computation. While Spark can run on top of Hadoop and provides a better computational speed solution. Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. On the other hand, Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed. Supports Java, Scala and Python

Relation and Difference between Spark and Hadoop 1. Both Hadoop and Spark are tools or frameworks for big data handling. 2. Hadoop focuses on HDFS, data storage and map reduce, whereas Spark focus on in-memory calculation. 3. Hadoop processes the static data sets, whereas Spark can process either static or real time/streaming data. 4. Spark uses Hadoop as the file/data storage. 5. Disaster recovery – Hadoop is better than Spark because of disk operations vs memory operations.

Spark – RDD System partitioned collection of records Resilient Distributed Datasets partitioned collection of records spread across the cluster read-only caching dataset in memory different storage levels available fallback to disk possible

Spark – RDD Operations transformations to build RDDs through deterministic operations on other RDDs transformations include map, filter, join Lazy evaluation: Nothing computed until an action requires it actions to return value or export data actions include count, collect, save triggers execution

Spark Example (Python) # Estimate π (compute-intensive task). # Pick random points in the unit square ((-1, -1) to (1,1)), # See how many fall in the unit circle. The fraction should be π / 4 # Note that “parallelize” method creates an RDD def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 count = spark.parallelize(range(0, NUM_SAMPLES), partitions).map(sample) \ .reduce(add) print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Pi calculation

Spark Example – Word Count lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count))

Spark Example – Word Count .flatMap: we take the RDD of lines and transform it to an RDD of words. map: we transform RDD of words into RDD of tuples (word, 1). It’s also called key-value RDD.reduceByKey: for each key (word) we reduce all the values by summing all the values together.