Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Spark: Cluster Computing with Working Sets
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design Copyright 2000 © John Wiley & Sons, Inc. All rights reserved. Slide 1 Key.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
A NoSQL Database - Hive Dania Abed Rabbou.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
ITGS Databases.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
BIG DATA/ Hadoop Interview Questions.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Image taken from: slideshare
Indexes By Adrienne Watt.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop MapReduce Framework
Central Florida Business Intelligence User Group
Powering real-time analytics on Xfinity using Kudu
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
The Basics of Apache Hadoop
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter 17 Designing Databases
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti

Project Goals Creating an Amazon EC2 instance to process queries against large datasets using Hive. Choosing two public data sets available on Amazon Running queries against these datasets Comparing HiveQL and SQL

Definitions Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Amazon Elastic MapReduce is a web service which provides easy access to a Hadoop MapReduce cluster in the cloud. Apache Hive is an open source data warehouse built on top of Hadoop which provides a SQL-like interface over MapReduce.

Starting an interactive hive session Register for an AWS account Create a key pair in EC2 console Launch an EMR job flow to start an interactive Hive session using the key pair Use the domain name and key pair to SSH into the master node of the Amazon EC2 cluster as the user “hadoop” Start hive

Data set 1: DISASTERS WORLDWIDE FROM Available at: – worldwide-from worldwide-from Info – Disaster data from 1900 – 2008, – organized by start and end date, country (and sub-location), disaster type (and sub-type), disaster name, cost, and persons killed and affected by the disaster. Details – Number of rows: – Size: 451 kB (Compressed), 1.5 MB (uncompressed)

Create table CREATE EXTERNAL TABLE emdata ( start string, ende string, country string, locatione string, type string, subtype string, name string, killed string, cost string, affected string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://cmpt592-assignment5/input/';

Q1: Get the total number of disasters occurred in each country Query: SELECT count(distinct id), country FROM emdata GROUP BY country;

Q1: Output:

Q2: Get the number and type of disasters in a given country Query: SELECT count(distinct id), type FROM emdata WHERE country='Afghanistan' GROUP BY type;

Q2: Output:

Q3: Get the number, type, and subtype of disasters in a given country Query: SELECT count(distinct id), type, subtype FROM emdata WHERE country='Afghanistan' GROUP BY type, subtype;

Q3: Output:

Q4: Get the total casualties and type of disasters in a given country only when casualties > 100 Query: SELECT sum(killed), type FROM emdata WHERE country='Afghanistan' and killed>100 GROUP BY type;

Q4: Output:

Q5: Get the total casualties and name of country for a certain type of disaster when casualties > 500 Query: SELECT sum(killed), country FROM emdata WHERE type='Flood' and killed>500 GROUP BY country;

Q3: Output:

Q6: Get the total cost and name of country for a certain type of disaster when cost > Query: SELECT sum(cost), country FROM emdata WHERE type='Flood' and cost> GROUP BY country;

Q3: Output:

Data set 2: Google Books n-grams Available at: s3://datasets.elasticmapreduce/ngrams/books/ Details – Size:2.2 TB – Source: Google Books – Created On: January 5, :11 PM GMT – Last Updated: January 21, :12 AM GMT Processing n-gram data using Amazon Elastic MapReduce and Apache Hive. Calculating the top trending topics per decade –

Data set 1: Google Books n-grams N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence. The yellow dog played fetch. – Would produce the following 2-grams: ["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams: ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."]

Data set 1: Google Books n-grams Two settings to efficiently process data from S3 hive> set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat; hive> set mapred.min.split.size= ; Used to tell – Hive to use an InputFormat that will split a file into pieces for processing, and – Tell it not to split them into pieces any smaller than 128 MB.

Creating input table In order to process any data – Define the source of data Data set used – English 1-grams – Rows : 472,764,897 – Compressed Size : 4.8 GB

Creating input table Statement to define data CREATE EXTERNAL TABLE english_1grams ( gram string, year int, occurrences bigint, pages bigint, books bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/books/ /eng-all/1gram/';

Creating input table Output

Count the number of rows Statement Select count(*) from english_1grams; Output

Normalizing the data The data in its current form is very raw. – It contains punctuation, numbers (typically years), and is case sensitive. First we want to create a table to store the results of the normalization. In the process, we can also drop unnecessary columns. Statement CREATE TABLE normalized ( gram string, year int, occurrences bigint );

Normalizing the data Output

Inserting data into normalized table Read raw data and insert into normalized table Statement INSERT OVERWRITE TABLE normalized SELECT lower(gram), year, occurrences FROM english_1grams WHERE year >= 1890 AND gram REGEXP "^[A-Za-z+'-]+$";

Inserting data into normalized table Output

Finding word ratio by decade More books are printed over time, – so every word has a tendency to have more occurrences in later decades. We only care about the relative usage of that word over time, – so we want to ignore the change in size of corpus. – This can be done by finding the ratio of occurrences of this word over the total number of occurrences of all words.

Finding word ratio by decade Create a table to store this data Statement CREATE TABLE by_decade ( gram string, decade int, ratio double ); Calculate the total number of word occurrences by decade. Then Join this data with the normalized table in order to calculate the usage ratio.

Finding word ratio by decade Statement INSERT OVERWRITE TABLE by_decade SELECT a.gram, b.decade, sum(a.occurrences) / b.total FROM normalized a JOIN ( SELECT substr(year, 0, 3) as decade, sum(occurrences) as total FROM normalized GROUP BY substr(year, 0, 3) ) b ON substr(a.year, 0, 3) = b.decade GROUP BY a.gram, b.decade, b.total;

Finding word ratio by decade Output

Calculating changes per decade With a normalized dataset by decade we can get down to calculating changes by decade. – This can be achieved by joining the dataset on itself. We'll want to join rows where the n-grams are equal and the decade is off by one. – This lets us compare ratios for a given n-gram from one decade to the next.

Calculating changes per decade Statement SELECT a.gram as gram, a.decade as decade, a.ratio as ratio, a.ratio / b.ratio as increase FROM by_decade a JOIN by_decade b ON a.gram = b.gram and a.decade - 1 = b.decade WHERE a.ratio > and a.decade >= 190 DISTRIBUTE BY decade SORT BY decade ASC, increase DESC;

Calculating changes per decade Output

Final output

Hive (Hadoop) vs. SQL (RDBMS) Hive (Hadoop) – Intended for scaling to hundred and thousands of machines. – Optimized for full table scans and jobs incur substantial overhead in submission and scheduling. – Best for data transformation, summarization, and analysis of large volumes of data but not appropriate for applications requiring fast query response times. – Read based and therefore not appropriate for transaction processing requiring write operations. SQL (RDBMS) – Typically run on a single large machine and do not provide support for executing map and reduce functions on the tables. – RDBMS systems are best for when referential integrity are required and frequent small updates are performed. – Tables are indexed and cached so small amounts of data can be retrieved very quickly.

References Disasters worldwide from dataset – URL: http:// Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce – URL: Google Books Ngrams – URL: Apache Hadoop – URL: Amazon Elastic Compute Cloud – URL: Getting started Guide – URL: started-emr-tutorial.htmlhttp://docs.aws.amazon.com/gettingstarted/latest/emr/getting- started-emr-tutorial.html

Acknowledgements Frank Paladino

Acknowledgements Dr Aparna Varde