Big Data Technology: Introduction to Hadoop

Big Data Technology: Introduction to Hadoop
Antonino Virgillito

Motivation The main characterization of Big Data is mostly to be…well… “Big” Intuitive definition: a size that “creates problems” when handled with ordinary tools and methods However, the exact definition of “Big” is a moving target Where do we draw a line? Big Data tools in IT were specifically tailored to handle those cases when common data handling tools fail for some reasons E.g. Google, Facebook…

Motivation Large size that grows continuously and indefinitely
Difficult to define a size of the storage that can fit Processing and querying huge data sets require a lot of memory and CPU No matter how much you expand the technical specifications: if data is “Big” you eventually hit the roof…

Is Big Data Big in Official Statistics?
Do we really have to handle those massive dimensions? Think about the largest dataset you ever used… Yes The example of scanner data in Istat Maybe We should be ready when it will happen No Big Data technology can still be useful for complex processing of «normal» data sets

Big Data Technology Handling volume -> Distributed platforms
The standard: Hadoop Handling variety -> NoSQL databases

Hadoop Open source platform for distributed processing of large data
Distributed: works on a cluster of servers Functions: Distribution of data and processing across machine Management of the cluster Distribution is transparent for the programmer-analyst

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs

Hadoop Components HDFS: Hadoop Distributed File System MapReduce
Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Hadoop Principle I’m one big data set
Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

MapReduce and Hadoop Hadoop MapReduce HDFS
MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?

MapReduce and Hadoop Hadoop Output is written on HDFS
MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

The MapReduce Paradigm
Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category

Hadoop pros & cons Good for Not good for
Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data

Hadoop vs. RDBMS Hadoop RDBMS is not transactional
is not optimized for random access does not natively support data updates privileges long-running, batch work RDBMS disk space is more expensive cannot scale indefinitely

Hadoop Distributions Hadoop is an open source project promoted by the Apache Foundation As such, it can be downloaded and used for free However, all the configuration and maintenance of all the components must be done by the user, mainly with command-line tools Software vendors provide Hadoop distributions that facilitate in various ways the use of the platform Distributions are normally free but there is a paid-for support Additional features User interface Management console Installation tools

Common Hadoop Distribution
Hortonworks Completely open-source Also have a Windows version Used in: Big Data Sandbox Cloudera Mostly standard Hadoop but extended with proprietary components Highlights: Cloudera Manager (console) and Impala (high-performance query) Used in: Istat Big Data Platform

Tools for Data Analysis with Hadoop
Pig Hive Hadoop Statistical Software MapReduce HDFS Hive is treated only in the appendix

Hive Hive is a SQL interface for Hadoop that facilitates queries of data on the file system and the analysis of large datasets stored in Hadoop Hive provides a SQL-like language called HiveQL Well, it is SQL Due its straightforward SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop

Using Hive Files in tabular format stored in HDFS can be represented as tables Sets of typed columns Tables are treated in the traditional way like in a relational database However, a query translates triggers ore or more MapReduce jobs Things can get slow… All common SQL constructs can be used Joins, subqueries, functions

Hive vs. RDBMS Hive works on flat files and does not support indexes and transactions Hive does not support updates and deletes. Rows can only be added incrementally A table is actually a directory in HDFS, so rows are inserted just by adding new files in the directory In this sense, Hive works more as a datawarehouse than as a DBMS

Pig Tool for querying data on Hadoop clusters
Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Pig: Motivations Pig is another high-level interface to MapReduce
Scripts written in PigLatin translate into MapReduce jobs However, working in Pig is much simpler than writing native MapReduce programs

Pig Commands Loading datasets from HDFS
users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray);

Pig Commands Filtering data
users_1825 = filter users by age>=18 and age<=25;

Pig Commands Join datasets
joined = join users_1825 by username, pages by username;

Pig Commands Group records grouped = group joined by url;
Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; ( {(alice, 15), (bob, 18)}) ( {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset
summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset Filter first n rows
sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS
store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Pig: Used Defined Functions
There are times when Pig’s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udfFilter(post); Load Function Ex: res = load 'file.txt' using udfLoad(); Eval Ex: res = FOREACH bag GENERATE udfEval($1) Choice between several programming languages Java, Python, Javascript

Hive vs. Pig Hive Uses plain SQL so it is straightforward to start with Requires data to be in tabular format Only allow single queries to be issued Pig Requires learning a new language Allows to work on data in a free schema Allows to write scripts with multiple processing steps Both languages can be used for pre-processing and analysis

Interactive Querying in Hadoop
Response times of MapReduce are typically slow and makes unsuitable for interactive workloads Hadoop distributions provide alternative solutions for querying data with low latency Hortonworks: Hive-on-Tez Cloudera: Impala The idea is to bypass the MapReduce mechanism and avoid its high latency Great advantage for aggregation queries Plain Hive still makes sense for low-throughput data transformations

Using Hadoop from Statistical Software
packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In-Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data

RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

WordCount in R wordcount = wc.reduce = function(
input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Big Data Technology: Introduction to Hadoop

Similar presentations

Presentation on theme: "Big Data Technology: Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Technology: Introduction to Hadoop

Similar presentations

Presentation on theme: "Big Data Technology: Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback