Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Advertisements

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Hive: A data warehouse on Hadoop
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Cloud Computing Other High-level parallel processing languages Keke Chen.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Hive Facebook 2009.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP - Column.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Image taken from: slideshare
Big Data is a Big Deal!.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
ITCS-3190.
Big Data A Quick Review on Analytical Tools
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
Data Warehouse.
Central Florida Business Intelligence User Group
Introduction to Spark.
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Server & Tools Business
Introduction to Apache
Overview of big data tools
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter X: Big Data.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도

Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP - Column store 2. Applications & Data Management System - MATLAB - R - Impala - Splunk - HANA 3. Research Trend - Hyper: (ICDE 2011) Combination of OLTP & OLAP - Starfish (CIDR 2011) - Crisis Informatics (ICSE 2011) 2

Contents Part ll 4. Demo1: Statistical Computing - MATLAB - R 5. Demo2: Analytics on Hadoop - Pig - Hive - Impala 6. Demo3: Real-time Analytics - Splunk 3

Statistical Analytics Query Processing Time-series Analytics Data Visualization Open Source MATLAB OXOOX R OXOOO Impala OXO Splunk XOO OX HANA OXX Overview 4

- MATLAB - R Demo 1 Statistical Computing 5

MATLAB Engineering software which provides numerical analytics environment - Matrix manipulations - Plotting of functions and data - Implementation of algorithms - Creation of user interfaces - MATLAB can interfacing with C, C++, Java, Fortran, Python 6

MATLAB Interface 7

MATLAB Too slow to manage large data 8

MATLAB Code example 9

Demo: Plot 10

Demo: Data Linking 11

Demo: Regression 12

Demo: Polynomial Fitting 13

R Programming language for statistical computing and graphics - Widely used among statisticians and data analyist - Can run on Windows, Mac, Lunix - Can use for free - Easily extensible through functions - Provides statistical techniques - Provides high quality graphical techniques - A lot of library from third party 14

R R Language Example 15

R R Studio Example 16

R Vector Example 17

R Matrix Example 18

R Scatter Plot & Visualization Example 19

R Plentiful Library Example 20

R Heatmap Example 21

R Line Graph Example 22

R Linear Regression Example 23

- Pig - Hive - Impala Demo 2 Analytics on Hadoop 24

Analytics on Hadoop 1. Mapper and Reducer programs -Writing Java programs to analyze data at HDFS 2. SQL-like queries -Writing high-level query language like Oracle or MySQL 25

Analytics on Hadoop 1. Mapper and Reducer for word count 26 public class WordCount { public static void main(String[] args) { int res = ToolRunner.run(new WordCount(), args); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "wordcount"); job.setJarByClass(this.getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.c lass); return job.waitForCompletion(true) ? 0 : 1; } public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private long numRecords = 0; private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*"); public void map(LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException { String line = lineText.toString(); Text currentWord = new Text(); for (String word : WORD_BOUNDARY.split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } } public static class Reduce extends Reducer public void reduce(Text word, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } Ref. Cloudera Hadoop Tutorial

Analytics on Hadoop 2. SQL-like queries for word count 27 CREATE TABLE doc( text string ); LOAD DATA LOCAL INPATH '/home/Documents/sentiment/Wikipedia.txt' OVERWRITE INTO TABLE doc; SELECT word, COUNT(*) FROM (SELECT explode(split(text, ' ')) AS word FROM doc) GROUP BY word;

SQL on Hadoop Pig -SQL-like scripting language is called Pig Latin -They are translated into MapReduce jobs Automatically 2. Hive - SQL-like scripting language is called HiveQL(HQL) -They are also translated into MapReduce jobs 3. Impala - Supports most of HiveQL and additional statements - Distributed processing(impalad) instead of MapReduce They enable users to write complex data transformations without knowing Java!

PigHiveImpala Released Year Dev.LanguageJava C++ SQLPig LatinHiveQL Query Processing Tuple-at-a-time (MapReduce) Tuple-at-a-time (MapReduce) Block-at-a-time (Impalad) ODBC/JDBCYes LatencyHigh Low Suitable JobsBatch Real-time SQL on Hadoop 29

Benchmark System Environment 30 Cluster13 Nodes (1 master + 12 slaves) CPUIntel i5 Memory32.0GB (each node) HDD5.0TB each (each node) OSUbuntu Hadoop2.3.0 Pig Hive Impala2.1.1

Benchmark Data Set - Randomly generated 1GB sales transaction from TPC-DS 31 Store_Sales Date_FK Customer_FK Item_FK number cost whole_cost tax Date_FK quater day month year Date_Dim Item Item_FK color company Customer_FK name salutation country Customer

Benchmark Query 1: Average sales cost in first half year 32 SELECT AVG(ss.ss_ext_wholesale_cost) FROM date_dim AS d, store_sales AS ss WHERE d.d_date_sk = ss.ss_sold_date_sk AND d.d_qoy < 3; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 1: Average sales cost in first half year 33 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_qoy < 3; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query1.txt'; Pig

Benchmark Query 2: Average sales cost on Sunday 34 SELECT AVG(s.ss_ext_wholesale_cost) FROM store_sales AS s, date_dim AS d WHERE d.d_date_sk = s.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday'; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 2: Average sales cost on Sunday 35 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_day_name == ‘Sunday’; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query2.txt'; Pig

Benchmark Query 3: Bottom 20 customer’s birth country ordered by average sales cost on Sunday 36 SELECT c.c_birth_country, AVG(ss.ss_ext_wholesale_cost) AS avg_sales FROM store_sales AS ss, customer AS c, date_dim AS d WHERE c.c_customer_sk = ss.ss_customer_sk AND d.d_date_sk = ss.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday' AND c.c_birth_country != '' GROUP BY c.c_birth_country ORDER BY avg_sales LIMIT 20; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 3: Bottom 20 customer’s birth country ordered b y average sales cost on Sunday 37 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray,…, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); c = LOAD '/user/user01/customer.csv' USING PigStorage(',') AS (c_customer_sk:chararray, …, c_last_review_date:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; metadata2 = JOIN ss BY ss_customer_sk, c BY c_customer_sk; result = FILTER metadata2 BY (d.d_day_name == ‘Sunday’) AND (c.c_birth_country != ‘’); grouped = GROUP result BY c.c_birth_country; avg_table = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost) as avg_sales; ordered = ORDER avg_table BY avg_sales; STORE ordered INTO 'query3.txt'; Pig

Benchmark Our results 38

Benchmark Results from Cloudera documents 39 Ref.

- Splunk Demo 3 Real-time Analytics 40

Splunk An engine for real-time machine data - Collection, indexing, analyzing and visualizing machine data to identify problems, patterns, risks and opportunities and drive better decisions for IT and the business Machine data (Unstructured data, No predefined schema) - Logs, Application queries, Records(Billing, Call detail, Events), Click Stream 41

Overview of Splunk Data indexing Search language 42 search | command arguments | command arguments | … sourcetype=syslog [ search login error | return 1user ] error

Splunk demo (1) Simple commands using Windows application logs 43

Splunk demo (2) Foot traffic analytics using Cisco Meraki data 44

Reference [1] Cloudera hadoop tutorial, H5/Hadoop-Tutorial/ht_wordount1_source.html, H5/Hadoop-Tutorial/ht_wordount1_source.html [2] Introduction to HIVE, [3] SQL on Hadoop, Intelligent Data Systems Lab, Seoul Nat’l University. [4] TPC Benchmarks Standard Specification, version 1.3.1, Transaction Processing Performance Council,