Download presentation
Presentation is loading. Please wait.
Published byAlice Farmer Modified over 8 years ago
1
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang
2
Using what we’ve learned with Big Data So far, we have been working with small datasets What happens when we get to larger datasets? We are talking on a Terabyte, maybe even on a Petabyte scale! Having a single computer to process such a data set is no longer plausible Hadoop to the rescue!
3
What is ? Imagine you are hired to paint Bill Gates’ house You gather a bunch of friends to help reduce the amount of work needed to be done by one person This technique is called Mapreduce Hadoop A Java based framework that implements Mapreduce But wait… Java? We’ve been using SQL so far Maybe we can’t use Hadoop given what we’ve learned so far...
4
Hive to the rescue! A higher-level interface built on top of Hadoop Solves some of the issues users were facing with Hadoop A more user-friendly language (Hive’s Query Language: HiveQL) Allows users to “plug-in” custom scripts Native support for “complex data types” Compiler optimizations
5
Based off of SQL Quickly adopted by many users SELECT [ALL | DISTINCT] select_expr, select_expr,... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] ; Hive Query Language (HiveQL)
6
Scripts can be written in any language Allows for a more sophisticated analysis of data FROM ( MAP doctext USING ‘python wc_mapper.py’ AS (word, count) FROM docs CLUSTER BY word ) a REDUCE word, count USING ‘python wc_reduce.py’ ; Scripts
7
Data structures Complex Data Types ARRAY MAP STRUCT
8
Compiler Optimizations Uses a DAG to determine the optimal execution plan Column pruning Predicate pushdown Partition Pruning Map Side Joins Join reordering User hints!
9
Some Pitfalls of Hive Can be slowed due to High latency Even with small datasets (e.g. a few hundred megabytes!) Not designed for real-time queries and online transactions Follows the “Write once, read many times” philosophy It wasn’t until ~5 years after Hive was develop did it begin supporting INSERT, DELETE, and UPDATE Is the data already loaded or not? DBMSs are substantially faster than MR systems when data is pre- loaded A tradeoff between power and simplicity
10
"Apache Hadoop." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar. 2016.. "Apache Hive." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar. 2016.. Ayyamani, Raghav. Hive - A Petabyte Scale Data Warehouse Using Hadoop. N.p.: n.p., n.d. PPT. DeZyre. "MapReduce vs. Pig vs. Hive - Comparison between the Key Tools of Hadoop." DeZyre. N.p., 01 Sept. 2015. Web. 25 Mar. 2016. Facebook. Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop. N.p.: n.p., 27 Jan. 2010. PPT. Krishnan, Sriram, and Eva Tse. "Hadoop Platform as a Service in the Cloud." Web log post. Netflix. N.p., 10 Jan. 2013. Web. 27 Mar. 2016.. Lu, Jiaheng. Introduction to Cloud Computing. N.p.: n.p., n.d. PPT. "MapReduce." Wikipedia. Wikimedia Foundation, n.d. Web. 25 Mar. 2016.. NoSQL and Big Data Processing Hbase, Hive and Pig, Etc. N.p.: n.p., n.d. PPT. N.p., n.d. Web. 25 Mar. 2016.. SAS. "What Is Hadoop?" SAS. N.p., n.d. Web. 25 Mar. 2016.. Sources
11
Sproehnle, Sarah. Hive: SQL for Hadoop. N.p.: Cloudera Inc., n.d. PPT. Stonebraker, Michael, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. "MapReduce and Parallel DBMSs." Communications of the ACM Commun. ACM 53.1 (2010): 64-67. ACM Digital Library. Web. 27 Mar. 2016. Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. "Hive - a Petabyte Scale Data Warehouse Using Hadoop." 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) (2010): n. pag. Web. Sources
12
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.