Big Data Technology: Introduction to Hadoop

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Advertisements

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
RHadoop rev
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Big Data Analytics Training
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Nov 2006 Google released the paper on BigTable.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
MapReduce Compilers-Apache Pig
Image taken from: slideshare
Big Data, Data Mining, Tools
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
MapReduce Compiler RHadoop
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop Aakash Kag What Why How 1.
Hadoop.
ITCS-3190.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
LOCO Extract – Transform - Load
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
Rahi Ashokkumar Patel U
Central Florida Business Intelligence User Group
Pig Latin - A Not-So-Foreign Language for Data Processing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Big Data - in Performance Engineering
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 4th – Map/Reduce.
Introduction to Hadoop and Spark
Introduction to Apache
Overview of big data tools
The Idea of Pig Or Pig Concepts
Pig - Hive - HBase - Zookeeper
CSE 491/891 Lecture 21 (Pig).
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big Data, Bigger Data & Big R Data
Big Data Technology: Introduction to Hadoop
Big-Data Analytics with Azure HDInsight
Server & Tools Business
MapReduce: Simplified Data Processing on Large Clusters
Pig and pig latin: An Introduction
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Big Data Technology: Introduction to Hadoop Antonino Virgillito

Motivation The main characterization of Big Data is mostly to be…well… “Big” Intuitive definition: a size that “creates problems” when handled with ordinary tools and methods However, the exact definition of “Big” is a moving target Where do we draw a line? Big Data tools in IT were specifically tailored to handle those cases when common data handling tools fail for some reasons E.g. Google, Facebook…

Motivation Large size that grows continuously and indefinitely Difficult to define a size of the storage that can fit Processing and querying huge data sets require a lot of memory and CPU No matter how much you expand the technical specifications: if data is “Big” you eventually hit the roof…

Is Big Data Big in Official Statistics? Do we really have to handle those massive dimensions? Think about the largest dataset you ever used… Yes The example of scanner data in Istat Maybe We should be ready when it will happen No Big Data technology can still be useful for complex processing of «normal» data sets

Big Data Technology Handling volume -> Distributed platforms The standard: Hadoop Handling variety -> NoSQL databases

Hadoop Open source platform for distributed processing of large data Distributed: works on a cluster of servers Functions: Distribution of data and processing across machine Management of the cluster Distribution is transparent for the programmer-analyst

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs

Hadoop Components HDFS: Hadoop Distributed File System MapReduce Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Hadoop Principle I’m one big data set Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?

MapReduce and Hadoop Hadoop Output is written on HDFS MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category

Hadoop pros & cons Good for Not good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data

Hadoop vs. RDBMS Hadoop RDBMS is not transactional is not optimized for random access does not natively support data updates privileges long-running, batch work RDBMS disk space is more expensive cannot scale indefinitely

Hadoop Distributions Hadoop is an open source project promoted by the Apache Foundation As such, it can be downloaded and used for free However, all the configuration and maintenance of all the components must be done by the user, mainly with command-line tools Software vendors provide Hadoop distributions that facilitate in various ways the use of the platform Distributions are normally free but there is a paid-for support Additional features User interface Management console Installation tools

Common Hadoop Distribution Hortonworks Completely open-source Also have a Windows version Used in: Big Data Sandbox Cloudera Mostly standard Hadoop but extended with proprietary components Highlights: Cloudera Manager (console) and Impala (high-performance query) Used in: Istat Big Data Platform

Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS Hive is treated only in the appendix

Hive Hive is a SQL interface for Hadoop that facilitates queries of data on the file system and the analysis of large datasets stored in Hadoop Hive provides a SQL-like language called HiveQL Well, it is SQL Due its straightforward SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop

Using Hive Files in tabular format stored in HDFS can be represented as tables Sets of typed columns Tables are treated in the traditional way like in a relational database However, a query translates triggers ore or more MapReduce jobs Things can get slow… All common SQL constructs can be used Joins, subqueries, functions

Hive vs. RDBMS Hive works on flat files and does not support indexes and transactions Hive does not support updates and deletes. Rows can only be added incrementally A table is actually a directory in HDFS, so rows are inserted just by adding new files in the directory In this sense, Hive works more as a datawarehouse than as a DBMS

Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Pig: Motivations Pig is another high-level interface to MapReduce Scripts written in PigLatin translate into MapReduce jobs However, working in Pig is much simpler than writing native MapReduce programs

Pig Commands Loading datasets from HDFS users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray);

Pig Commands Filtering data users_1825 = filter users by age>=18 and age<=25;

Pig Commands Join datasets joined = join users_1825 by username, pages by username;

Pig Commands Group records grouped = group joined by url; Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset Filter first n rows sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Pig: Used Defined Functions There are times when Pig’s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udfFilter(post); Load Function Ex: res = load 'file.txt' using udfLoad(); Eval Ex: res = FOREACH bag GENERATE udfEval($1) Choice between several programming languages Java, Python, Javascript

Hive vs. Pig Hive Uses plain SQL so it is straightforward to start with Requires data to be in tabular format Only allow single queries to be issued Pig Requires learning a new language Allows to work on data in a free schema Allows to write scripts with multiple processing steps Both languages can be used for pre-processing and analysis

Interactive Querying in Hadoop Response times of MapReduce are typically slow and makes unsuitable for interactive workloads Hadoop distributions provide alternative solutions for querying data with low latency Hortonworks: Hive-on-Tez Cloudera: Impala The idea is to bypass the MapReduce mechanism and avoid its high latency Great advantage for aggregation queries Plain Hive still makes sense for low-throughput data transformations

Using Hadoop from Statistical Software packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In-Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data

RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

WordCount in R wordcount = wc.reduce = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}