An Open Source Project Commonly Used for Processing Big Data Sets

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
An Introduction to HDInsight June 27 th,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
OMOP CDM on Hadoop Reference Architecture
CS 405G: Introduction to Database Systems
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
Yarn.
Hadoop.
Software Systems Development
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
Big Data - in Performance Engineering
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Apache Hadoop and Spark
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Pig Hive HBase Zookeeper
Presentation transcript:

An Open Source Project Commonly Used for Processing Big Data Sets Hadoop An Open Source Project Commonly Used for Processing Big Data Sets Two sources: 1) ACM webinar on Big Data with Hadoop, July 23, 2014. 2) Big Data and its Technical Challenges HV Jagadish et. al. CACM, July 2014, Vol. 57, No. 7 Copyright © 2014-2017 Curt Hill

Introduction An Apache open source project for distributed storage and distributed processing of large amounts of data Automatically replicates and distributes data over multiple nodes Executes jobs that process that data Using map reduce Tracks progress and results of the multiple jobs Presumes that node failure is possible and needs to be handled Spreadsheet is approximate limit on what can be analyzed by hand JSON = JavaScript Object Notation Metadata is data describing format and purpose of data Copyright © 2014-2017 Curt Hill

Ecosystem Original definition: More recently: A biological community of interacting organisms and their physical environment. More recently: A complex network or interconnected system. Any popular OS is an ecosystem Eclipse IDE is one also Hadoop is as well Copyright © 2014-2017 Curt Hill

Hadoop Pieces The Common YARN Map reduce HDFS Utilities Job scheduling and cluster management framework Map reduce Mechanism for parallel processing large data sets HDFS Hadoop Distributed File System Copyright © 2014-2017 Curt Hill

Map Reduce Software technique for processing large quantities of data over several processors Developed by Google Several steps Map the data into key – value pairs Shuffle the data onto various nodes Reduce all those keys with the same value Both the key and data may be of any size and type Copyright © 2014-2017 Curt Hill

Example The classic example is to count words in very large collection of text Consider Shakespeare’s collected works The key would be the word itself The data could be as simple as the location of the word As complicated as the play, act, scene, speaker, line number If we consider the latter, then we may move from simple counts to much more complicated analysis Copyright © 2014-2017 Curt Hill

Workflow A typical system will chunk the input into pieces Each piece will be distributed to a machine A map script will be run on the pieces On each node The shuffle or sort step will rearrange directing to proper node A reduce script will combine the rearranged mappings Copyright © 2014-2017 Curt Hill

MapReduce Picture Copyright © 2014-2017 Curt Hill

MapReduce vs. RDMS RDMS MapReduce Data Size Gigabyte to Terabyte Petabyte to Hexabyte Updates Write many and read many Write once and read many Access type Interactive, batch Batch Schema Static Dynamic Scaling Worse than linear Linear Integrity ACID – High BASE - Low Copyright © 2014-2017 Curt Hill

Hadoop Again Typically the map and reduce scripts are made in Java Other languages are possible Each script may be written as if it were only to be executed on a single machine Hadoop handles the replication and task tracking Copyright © 2014-2017 Curt Hill

Scale Up Example Suppose that we have an RDMS Three servers that communicate with a SANS Server to server via Ethernet Server to SANS via fiber Very fast for what it can do Any number of problems can disable the whole thing Communication between servers Communication to the SANS Disk failure in the SANS Copyright © 2014-2017 Curt Hill

Scale Out Example Multiple servers Hadoop replicates The data The tasks accessing the data Any one or two failures may slow throughput but the processing may still complete Due to a lack of specialized and high speed hardware this will be slower than the previous But perhaps more available Copyright © 2014-2017 Curt Hill

Apache Hadoop Projects Aside from the basic project Apache has at least 11 projects in the Hadoop ecosystem Several scalable NoSQL databases Cassandra, HBase and Hive Several data-flow utilities Pig is high level dataflow language Tez is data-flow programming framework Copyright © 2014-2017 Curt Hill

Summary Hadoop is an open source system Replicates and distributes the data Uses map and reduce scripts to process Manages the clusters Copyright © 2014-2017 Curt Hill