About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Slides:



Advertisements
Similar presentations
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Hadoop Aakash Kag What Why How 1.
Machine Learning Library for Apache Ignite
Hadoop.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
Scaling Spark on HPC Systems
HDFS Yarn Architecture
Hadoop MapReduce Framework
Spark Presentation.
Data Platform and Analytics Foundational Training
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Distributed Systems CS
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Hadoop Technopoints.
Overview of big data tools
Spark and Scala.
Interpret the execution mode of SQL query in F1 Query paper
Lecture 16 (Intro to MapReduce and Hadoop)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
COS 518: Distributed Systems Lecture 11 Mike Freedman
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce

What is HDFS HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data. HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-size blocks, also known as partitions or splits. The default block size is 128 MB.

YARN YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly. At a time process different types of data such as Batch process, streaming, iterative data and more. It's unified stack.

What is Mapreduce? Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data. Internally it process disk to disk. So It's very very slow. Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.

Common data sources

Processing too slow

Data lost

HDFS is No1 to store data paralelly There is no competetor to store data reliabelly in scalable manner with Low cost. But problem is process the data quickly. How to overcome to process quickly? The problem with Mapreduce is It's very very slow How to resolve it?

Speed and Durability is too key factors

Problem - Solution Disk to Disk processing Very Very slow. So that Mapreduce taking a lot of time. Framework - framework creates new processing problems. In-memory Processing is processing data everything in RAM. So that very very processing

LIBRARY lIBRARY LIBRARY LIBRARY

Why only Spark why not others?

10 times less code, 10 times Fast

Why I switch to Spark? The key features of Spark include the following: • Easy to use (progrmmer friendly) • Fast (in-memory) • General-purpose • Scalable parallelly process the data • Optimized Fault tolerant Unified platform

Different type of data Batch processing-- Hadoop Streaming --- Strom Iterative --MLLib or graphx Interactive --SQL/BI

key entities ........................ 1) driver program, 2) cluster manager, 3) worker node, 4) executors, 5) tasks

What is Driver Program? The spark driver is the program that declares/defines the transformations and actions on RDDs of data and submits such requests to the master. Where the driver program is placed to process, that node is called Driver node, it might either within or out of the cluster.

Cluster manager(Yarn) It's a distributed OS. It's schedule the tasks and allocate the resources in the cluster. Allocate RAM and CPUS to Executors based on Node manager request

Worker nodes/node manager In Hadoop terminaligy it's also called node manager It's manage the executors If executors cross limits, nodemanager kill the executors

Tasks A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor on a worker node. Each task performs some computations to either return a result to a driver program or S3/hdfs. Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amount of parallelism is determined by the number of partitions. More partitions mean more tasks processing data in parallel.

Executors Spark acquires executors on each nodes in the cluster, which are processes that run computations and store data for your application. It has the same fixed number of cores and same ram to process the data. It's almost similar to Containers, but additionally it support in-memory concept.

Spark Job Submission in Yarn

: HDFS

Abstraction Fundamental element to process the data . Hive -- Table Pig -- Relation SQL - Schema Spark - RDD (1.x), DataSet(2.x) ....

What is RDD? Collection of data partitions called RDD. These RDD must follow few properties such is: Immutable, Fault Tolerant, lazyness Distributed, In-memory More. Here RDD is either structured or unstructured Spark revolves around the concept of a RDDs

RDD part3 part1 part2 Partition1 Partition1 Partition2 Partition3 DN3/NM3 DN2/NM2 DN3/NM3

How RDD Distribute the data?

How RDD Distribute the data?

Ways to create RDDs Two ways to create RDDs parallelizing an existing parallelize method convert scala obj to rdd val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) Referencing a dataset from external storage textFile method val distFile = sc.textFile("data.txt")

RDD Operations RDDs support two types of operations: Transformations: which create a new dataset/RDD from an existing RDD. It follows lazyness to computate. Actions: which return a value to the driver program after running a computation on the RDD

Different type of RDDs Based on Operations, each RDD generating different types of RDDs. It's not just for identification purpus and most often used to debug or testing the application. Usually no need to consider.

Transformations A function that feel lazy, don't do any computation called Transformations. The result or Transformation results should be another RDD. It's not modified existent RDD Just apply a logic/functionality and create a new RDD from another RDD

Actions After transformations, apply a logic/ functionality to compute to obtain results called Actions. After performing action on RDD, the result will be returned to either driver program or written to the storage system

Why lazyness? It's not good idea to touch always RAM/HDFS, It's bottleneck and minimize the performance. When we call action only one time touch the RAM/HDFS.

Cache Vs Persist It's not good idea to touch always RAM/HDFS, It's bottleneck and minimize the performance. When we call action only one time touch the RAM/HDFS.

Catch Vs persists In spark after processing everything is clean and there is no old processed RDDs. If you repeating same steps with little modifications especially Iterative algorithms always touches to RAM/HDFS. It'snot good idea so that spark allows special functionality called catch and persists. To store the data based on usecases.

RDDs Storage levels rdd.cache() // cache in-memory using default STORAGE_LEVEL (MEMORY_ONLY) rdd.persist(STORAGE_LEVEL) // cache on specific level // STORAGE_LEVEL: // MEMORY_ONLY // MEMORY_ONLY_SER // MEMORY_AND_DISK // MEMORY_AND_DISK_SER // DISK_ONLY rdd.cache() // call persist()

Difference between catch, persists catch() by default store in memory. Internally it use MEMORY_ONLY But persist can store anywhere, to store for the long time, usually use persist() Remember that caching acts the same as transformations means feel lazy.