Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.

Slides:



Advertisements
Similar presentations
Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Advertisements

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Hive: A data warehouse on Hadoop
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hadoop implementation of MapReduce computational model Ján Vaňo.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
Nov 2006 Google released the paper on BigTable.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
This is a free Course Available on Hadoop-Skills.com.
Prediction-Based Multivariate Query Modeling Analytic Queries.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Hadoop Clusters Tess Fulkerson.
Hadoop EcoSystem B.Ramamurthy.
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Introduction to Apache
Overview of big data tools
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big-Data Analytics with Azure HDInsight
Big Data Technology: Introduction to Hadoop
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang

Using what we’ve learned with Big Data So far, we have been working with small datasets What happens when we get to larger datasets? We are talking on a Terabyte, maybe even on a Petabyte scale! Having a single computer to process such a data set is no longer plausible Hadoop to the rescue!

What is ? Imagine you are hired to paint Bill Gates’ house You gather a bunch of friends to help reduce the amount of work needed to be done by one person This technique is called Mapreduce Hadoop A Java based framework that implements Mapreduce But wait… Java? We’ve been using SQL so far Maybe we can’t use Hadoop given what we’ve learned so far...

Hive to the rescue! A higher-level interface built on top of Hadoop Solves some of the issues users were facing with Hadoop A more user-friendly language (Hive’s Query Language: HiveQL) Allows users to “plug-in” custom scripts Native support for “complex data types” Compiler optimizations

Based off of SQL Quickly adopted by many users SELECT [ALL | DISTINCT] select_expr, select_expr,... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] ; Hive Query Language (HiveQL)

Scripts can be written in any language Allows for a more sophisticated analysis of data FROM ( MAP doctext USING ‘python wc_mapper.py’ AS (word, count) FROM docs CLUSTER BY word ) a REDUCE word, count USING ‘python wc_reduce.py’ ; Scripts

Data structures Complex Data Types ARRAY MAP STRUCT

Compiler Optimizations Uses a DAG to determine the optimal execution plan Column pruning Predicate pushdown Partition Pruning Map Side Joins Join reordering User hints!

Some Pitfalls of Hive Can be slowed due to High latency Even with small datasets (e.g. a few hundred megabytes!) Not designed for real-time queries and online transactions Follows the “Write once, read many times” philosophy It wasn’t until ~5 years after Hive was develop did it begin supporting INSERT, DELETE, and UPDATE Is the data already loaded or not? DBMSs are substantially faster than MR systems when data is pre- loaded A tradeoff between power and simplicity

"Apache Hadoop." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar "Apache Hive." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar Ayyamani, Raghav. Hive - A Petabyte Scale Data Warehouse Using Hadoop. N.p.: n.p., n.d. PPT. DeZyre. "MapReduce vs. Pig vs. Hive - Comparison between the Key Tools of Hadoop." DeZyre. N.p., 01 Sept Web. 25 Mar Facebook. Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop. N.p.: n.p., 27 Jan PPT. Krishnan, Sriram, and Eva Tse. "Hadoop Platform as a Service in the Cloud." Web log post. Netflix. N.p., 10 Jan Web. 27 Mar Lu, Jiaheng. Introduction to Cloud Computing. N.p.: n.p., n.d. PPT. "MapReduce." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar NoSQL and Big Data Processing Hbase, Hive and Pig, Etc. N.p.: n.p., n.d. PPT. N.p., n.d. Web. 25 Mar SAS. "What Is Hadoop?" SAS. N.p., n.d. Web. 25 Mar Sources

Sproehnle, Sarah. Hive: SQL for Hadoop. N.p.: Cloudera Inc., n.d. PPT. Stonebraker, Michael, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. "MapReduce and Parallel DBMSs." Communications of the ACM Commun. ACM 53.1 (2010): ACM Digital Library. Web. 27 Mar Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. "Hive - a Petabyte Scale Data Warehouse Using Hadoop." 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) (2010): n. pag. Web. Sources

Questions?