Introduction to Hadoop and Spark

Introduction to Hadoop and Spark
Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Large-scale Computation
Traditional solutions for computing large quantities of data relied mainly on processor Complex processing made on data moved in memory Scale only by adding power (more memory, faster processor) Works for relatively small-medium amounts of data but cannot keep up with larger datasets How to cope with today’s indefinitely growing production of data? Terabytes per day

Distributed Computing
Multiple machines connected among each other and cooperating for a common job «Cluster» Challenges Complexity of coordination – all processes and data have to be maintained syncronized about the global system state Failures Data distribution

Hadoop Open source platform for distributed processing of large datasets Based on a project developed at Google Functions: Distribution of data and processing across machines Management of the cluster Simplified programming model Easy to write distributed algorithms

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware Cluster can easily scale up with little or no modifications to the programs

Hadoop Concepts Applications are written in common high-level languages Inter-node communication is limited to the minimum Data is distributed in advance Bring the computation close to the data Data is replicated for availability and reliability Scalability and fault-tolerance

Scalability and Fault-tolerance
Scalability principle Capacity can be increased by adding nodes to the cluster Increasing load does not cause failures but in the worst case only a graceful degradation of performance Fault-tolerance Failure of nodes are considered inevitable and are coped with in the architecture of the platform System continues to function when failure of a node occurs – tasks are re-scheduled Data replication guarantees no data is lost Dynamic reconfiguration of the cluster when nodes join and leave

Benefits of Hadoop Previously impossible or impractical analysis made possible Lower cost of hardware Less time Ask Bigger Questions

Hadoop Components Core Components Hive Pig Sqoop HBase Flume Mahout
Oozie Core Components

Hadoop Core Components
HDFS: Hadoop Distributed File System Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Structure of an Hadoop Cluster
Group of machines working together to store and process data Any number of “worker” nodes Run both HDFS and MapReduce components Two “Master” nodes Name Node: manages HDFS Job Tracker: manages MapReduce

Hadoop Principle I’m one big data set
Hadoop is basically a middleware platform that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

The MapReduce Paradigm
Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category

MapReduce Concepts Automatic parallelization and distribution
Fault-tolerance A clean abstraction for programmers MapReduce programs are usually written in Java Can be written in any language using Hadoop Streaming All of Hadoop is written in Java MapReduce abstracts all the ‘housekeeping’ away from the developer Developer can simply concentrate on writing the Map and Reduce functions

MapReduce and Hadoop Hadoop MapReduce HDFS
MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?

MapReduce and Hadoop Hadoop Output is written on HDFS
MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

Hive Apache Hive is a high-level abstraction on top of MapReduce
– Uses an SQL/like language called HiveQL – Generates MapReduce jobs that run on the Hadoop cluster – Originally developed by Facebook for data warehousing – Now an open/source Apache project

Overview HiveQL queries are transparently mapped into MapReduce jobs at runtime by the Hive execution engine Also makes optimizations Jobs are submitted to the Hadoop cluster

Hive Tables Hive works on the abstraction of table, similar to a table in a relational database Main difference: a Hive table is simply a directory in HDFS, containing one or more files By default files are in text format but different formats can be specified The structure and location of the tables are stored in a backing SQL database called the metastore Transparent for the user Can be any RDBMS, specified at configuration time

Hive Tables At query time, the metastore is consulted to check if the query is consistent with the tables it invokes The query itself operates on the actual data files stored in HDFS

Hive Tables By default, tables are stored in a warehouse directory on HDFS Default location: /user/hive/warehouse/<db>/<table> Each subdirectory of the warehouse directory is considered a database Each subdirectory of a database directory is a table All files in a table directory are considered part of the table when querying Must have the same structure

Pig Tool for querying data on Hadoop clusters
Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Overview of Pig PigLatin Grunt Interpreter and execution engine
Language for definition of data flow Grunt Interactive shell for typing and executing PigLatin statements Interpreter and execution engine

RHadoop Collection of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the processing Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

RHadoop Packages rhdfs rmr2 rhbase
Interface for reading and writing files from/to a HDFS cluster rmr2 Interface to MapReduce through R rhbase Interface to HBase

rhdfs As Hadoop MapReduce programs use HDFS for taking their input and writing their output, it is necessary to access them from R console The R programmer can easily perform read and write operations on distributed data files. Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS.

rmr2 rmr2 is an R interface for providing Hadoop MapReduce facility inside the R environment. So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr2 methods. After that, rmr2 calls the Hadoop streaming MapReduce API with several job parameters as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over Hadoop cluster.

mapreduce map and reduce function present the usual interface
The mapreduce function takes as input a set of named parameters input: input path or variable input.format: specification of input format output: output path or variable map: map function reduce: reduce function map and reduce function present the usual interface A call to keyval(k,v) inside the map and reduce function is used to emit respectively intermediate and output key-value pairs

WordCount in R wordcount = wc.reduce = function(
input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Reading delimited data
tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "\t") keyval( sapply(delim,function(x) x[1]), sapply(delim,function(x) x[-1]))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) keyval(v[1,], 1), reduce = function(k, vv) keyval(k, sum(vv)))

Reading named columns tsv.reader = function(con, nrecs){
lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "\t") keyval(sapply(delim, function(x) x[1]), data.frame( location = sapply(delim, function(x) x[2]), name = sapply(delim, function(x) x[3]), value = sapply(delim, function(x) x[4])))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) { filter = (v$name == "blarg") keyval(k[filter], log(as.numeric(v$value[filter])))}, reduce = function(k, vv) keyval(k, mean(vv)))

Apache Spark A general purpose framework for big data processing
It interfaces with many distributed file systems, such as Hdfs (Hadoop Distributed File System), Amazon S3, Apache Cassandra and many others 100 times faster than Hadoop for in-memory computation

Multilanguage API You can write applications in various languages
Java Python Scala R In the context of this course we will consider Python

Built-in Libraries

RDD - Resilient Distributed Dataset
The RDD is the core abstraction used by Spark to work on data A RDD is a collection of elements partitioned in every cluster node. Spark operates in parallel on them Every RDD is created from a file on Hadoop filesystem They can be made persistent in memory

Transformations For example, map is a transformation that takes all elements of the dataset, pass them to a function and returns another RDD with the results resultRDD = originalRDD.map(myFunction)

Actions For example, reduce is an action. It aggregates all elements of the RDD using a function and returns the result to the driver program result = rdd.reduce(function)

SparkSQL and DataFrames
SparkSQL is the spark module for structured data processing DataFrame API is one of the ways to interact with SparkSQL

DataFrames A DataFrame is a collection of data organized into columns
Similar to tables in relational databases Can be created from various sources: structured data files, Hive Tables, external db, csv etc…

Example operations on DataFrames
To show the content of the DataFrame df.show() To print the Schema of the DataFrame df.printSchema() To select a column df.select(‘columnName’).show() To filter by some parameter df.filter(df[‘columnName’] > N).show()

Introduction to Hadoop and Spark

Similar presentations

Presentation on theme: "Introduction to Hadoop and Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Hadoop and Spark

Similar presentations

Presentation on theme: "Introduction to Hadoop and Spark"— Presentation transcript:

Similar presentations

About project

Feedback