About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce

What is HDFS HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data. HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-size blocks, also known as partitions or splits. The default block size is 128 MB.

YARN YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly. At a time process different types of data such as Batch process, streaming, iterative data and more. It's unified stack.

What is Mapreduce? Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data. Internally it process disk to disk. So It's very very slow. Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.

Common data sources

Processing too slow

Data lost

HDFS is No1 to store data paralelly
There is no competetor to store data reliabelly in scalable manner with Low cost. But problem is process the data quickly. How to overcome to process quickly? The problem with Mapreduce is It's very very slow How to resolve it?

Speed and Durability is too key factors

Problem - Solution Disk to Disk processing Very Very slow. So that Mapreduce taking a lot of time. Framework - framework creates new processing problems. In-memory Processing is processing data everything in RAM. So that very very processing

LIBRARY lIBRARY LIBRARY LIBRARY

Why only Spark why not others?

10 times less code, 10 times Fast

Why I switch to Spark? The key features of Spark include the following: • Easy to use (progrmmer friendly) • Fast (in-memory) • General-purpose • Scalable parallelly process the data • Optimized Fault tolerant Unified platform

Different type of data Batch processing-- Hadoop Streaming --- Strom
Iterative --MLLib or graphx Interactive --SQL/BI

key entities 1) driver program, 2) cluster manager, 3) worker node, 4) executors, 5) tasks

What is Driver Program? The spark driver is the program that declares/defines the transformations and actions on RDDs of data and submits such requests to the master. Where the driver program is placed to process, that node is called Driver node, it might either within or out of the cluster.

Cluster manager(Yarn)
It's a distributed OS. It's schedule the tasks and allocate the resources in the cluster. Allocate RAM and CPUS to Executors based on Node manager request

Worker nodes/node manager
In Hadoop terminaligy it's also called node manager It's manage the executors If executors cross limits, nodemanager kill the executors

Tasks A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor on a worker node. Each task performs some computations to either return a result to a driver program or S3/hdfs. Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amount of parallelism is determined by the number of partitions. More partitions mean more tasks processing data in parallel.

Executors Spark acquires executors on each nodes in the cluster, which are processes that run computations and store data for your application. It has the same fixed number of cores and same ram to process the data. It's almost similar to Containers, but additionally it support in-memory concept.

Spark Job Submission in Yarn

: HDFS

Abstraction Fundamental element to process the data . Hive -- Table
Pig -- Relation SQL - Schema Spark - RDD (1.x), DataSet(2.x) ....

What is RDD? Collection of data partitions called RDD. These
RDD must follow few properties such is: Immutable, Fault Tolerant, lazyness Distributed, In-memory More. Here RDD is either structured or unstructured Spark revolves around the concept of a RDDs

RDD part3 part1 part2 Partition1 Partition1 Partition2 Partition3 DN3/NM3 DN2/NM2 DN3/NM3

How RDD Distribute the data?

Ways to create RDDs Two ways to create RDDs
parallelizing an existing parallelize method convert scala obj to rdd val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) Referencing a dataset from external storage textFile method val distFile = sc.textFile("data.txt")

RDD Operations RDDs support two types of operations:
Transformations: which create a new dataset/RDD from an existing RDD. It follows lazyness to computate. Actions: which return a value to the driver program after running a computation on the RDD

Different type of RDDs Based on Operations, each RDD generating different types of RDDs. It's not just for identification purpus and most often used to debug or testing the application. Usually no need to consider.

Transformations A function that feel lazy, don't do any computation called Transformations. The result or Transformation results should be another RDD. It's not modified existent RDD Just apply a logic/functionality and create a new RDD from another RDD

Actions After transformations, apply a logic/ functionality to compute to obtain results called Actions. After performing action on RDD, the result will be returned to either driver program or written to the storage system

Why lazyness? It's not good idea to touch always RAM/HDFS, It's bottleneck and minimize the performance. When we call action only one time touch the RAM/HDFS.

Cache Vs Persist It's not good idea to touch always RAM/HDFS, It's bottleneck and minimize the performance. When we call action only one time touch the RAM/HDFS.

Catch Vs persists In spark after processing everything is clean and there is no old processed RDDs. If you repeating same steps with little modifications especially Iterative algorithms always touches to RAM/HDFS. It'snot good idea so that spark allows special functionality called catch and persists. To store the data based on usecases.

RDDs Storage levels rdd.cache() // cache in-memory using default STORAGE_LEVEL (MEMORY_ONLY) rdd.persist(STORAGE_LEVEL) // cache on specific level // STORAGE_LEVEL: // MEMORY_ONLY // MEMORY_ONLY_SER // MEMORY_AND_DISK // MEMORY_AND_DISK_SER // DISK_ONLY rdd.cache() // call persist()

Difference between catch, persists
catch() by default store in memory. Internally it use MEMORY_ONLY But persist can store anywhere, to store for the long time, usually use persist() Remember that caching acts the same as transformations means feel lazy.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Similar presentations

Presentation on theme: "About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Similar presentations

Presentation on theme: "About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets."— Presentation transcript:

Similar presentations

About project

Feedback