Charles Tappert Seidenberg School of CSIS, Pace University

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Image taken from: slideshare
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Hadoop.
Apache hadoop & Mapreduce
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Database Systems Summary and Overview
Lecture 16 (Intro to MapReduce and Hadoop)
CSE 491/891 Lecture 24 (Hive).
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Server & Tools Business
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Big Data.
Presentation transcript:

Charles Tappert Seidenberg School of CSIS, Pace University Data Science and Big Data Analytics Chap 10: Adv. Analytics – Tech & Tools: MapReduce and Hadoop Charles Tappert Seidenberg School of CSIS, Pace University

Chapter Contents 10.1 Analytics for Unstructured Data 10.1.1 Use Cases 10.1.2 MapReduce 10.1.3 Apache Hadoop 10.2 The Hadoop Ecosystem 10.2.1 Pig 10.2.2 Hive 10.2.3 HBase 10.2.4 Mahout 10.3 NoSQL Summary

10.1 Analytics for Unstructured Data This chapter presents some key technologies and tools related to the Apache Hadoop software library Hadoop stores data in in a distributed system Hadoop implements a simple programming paradigm known as MapReduce Tools of the Hadoop ecosystem Pig Hive HBase Mahout

10.1 Analytics for Unstructured Data 10.1.1 Use Cases IBM Watson – Jeopardy playing machine To educate Watson, Hadoop was utilized to process data sources Encyclopedias, dictionaries, news wire feeds, literature, Wikipedia, etc. LinkedIn – network of over 250 million users in 200 countries Hadoop is used to process daily transaction logs, examine users’ activities, feed extracted data back to production systems, restructure the data, develop and test analytic models Yahoo! – large Hadoop deployment Search index creation and maintenance, Webpage content optimization, spam filters, etc.

10.1 Analytics for Unstructured Data 10.1.2 MapReduce The MapReduce paradigm breaks a large task into smaller tasks, runs the tasks in parallel, and consolidates the outputs of the individual tasks into the final output Map Applies an operation to a piece of data Provides some intermediate output Reduce Consolidates the intermediate outputs from the map steps Provides the final output Each step uses key/value pairs, denoted <key, value> as input and output

10.1 Analytics for Unstructured Data 10.1.2 MapReduce MapReduce word count example

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop MapReduce is a simple paradigm to understand but not easy to implement Executing a MapReduce job requires Jobs scheduled based on system’s workload Input data spread across cluster of machines Map step spread across distributed system Intermediate outputs collected and provided to proper machines for reduce step Final output made available to another user, another application, or another MapReduce job Next few slides present overview of Hadoop environment

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Hadoop Distributed File System (HDFS) File system that distributes data across a cluster to take advantage of the parallel processing of MapReduce HDFS uses three Java daemons (background processors) NameNode – determines and tracks where various blocks of data are stored DataNode – manages the data stored on each machine Secondary NameNode – performs some of the NameNode tasks to reduce the load on NameNode

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Structuring a MapReduce Job in Hadoop A typical MapReduce Java program has three classes Driver – provides details such as input file locations, names of mapper and reducer classes, location of reduce class output, etc. Mapper – provides logic to process each data block Reducer – reduces the data provided by the mapper

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Structuring a MapReduce Job in Hadoop A file stored in HDFS

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Additional Considerations in Structuring a MapReduce Job Shuffle and Sort

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Using a combiner

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Using a custom partitioner

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Developing and Executing a Hadoop MapReduce Program Common practice is to use an IDE tool such as Eclipse The MapReduce program consists of three Java files Driver code, map code, and reduce code Java code is compiled and stored in a JAR file and executed against the specified HDFS input files

10.1 Analytics for Unstructured Data 10.1.3 Apache Hadoop Yet Another Resource Negotiator (YARN) Hadoop continues to undergo development An important one was to separate the MapReduce functionality from the management of running jobs and the distributed environment This rewrite is sometimes called

10.2 The Hadoop Ecosystem Tools have been developed to make Hadoop easier to use and provide additional functionality and features Pig – provides a high-level data-flow programming language Hive – provides SQL-like access HBase – provides real-time reads and writes Mahout – provides analytical tools

10.2 The Hadoop Ecosystem 10.2.1 Pig Pig consists of A data flow language called Pig Latin An environment to execute the Pig code Example of Pig commands

10.2 The Hadoop Ecosystem 10.2.1 Pig – Built-in Pig Functions

10.2 The Hadoop Ecosystem 10.2.2 Hive The Hive language, Hive Query Language (HiveQL), resembles SQL rather than a scripting language Example Hive code

10.2 The Hadoop Ecosystem 10.2.3 HBase Unlike Pig and Hive, intended for batch applications, HBase can provide real-time read and write access to huge datasets Example - Choosing a shipping address at checkout

10.2 The Hadoop Ecosystem 10.2.4 Mahout Mahout permits the application of analytical techniques within the Hadoop environment Mahout provides Java code that implements the usual classification, clustering, and recommenders/collaborative filtering algorithms Users can download Apache Hadoop directly from www.apache.org or use commercial packages For example, the Pivital company provides Pivital HD Enterprise

10.2 The Hadoop Ecosystem 10.2.4 Mahout Components of Pivotal HD Enterprise

10.3 NoSQL NoSQL = Not only Structured Query Language Four major categories of NoSQL tools Key/value stores – contains data (the value) accessed by the key Document stores – good when the value of the key/value pair is a file Column family stores – good for sparse datasets Graph stores – good for items and relationships between them Social networks like Facebook and LinkedIn

10.3 NoSQL Examples of NoSQL Data Stores