BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Distributed Data Processing
The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
Supercharging Analytics on Big Data Announcing MapReduce-ready Advanced Analytic Functions June 21 st
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Decision HPS Performance.
A Fast Growing Market. Interesting New Players Lyzasoft.
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
Data Mining – Intro.
Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,
Lecture-8/ T. Nouf Almujally
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 6) Enhancing.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Appliance-based architectures for high performance data intensive applications Session at Silicon India Rajgopal Kishore Vice President and Global Head.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10/10/2013 (c) Tilmann Rabl - msrg.org 2.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
MapReduce VS Parallel DBMSs
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013.
EXTENDING BIGBENCH Chaitan Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar,
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
B REAKOUT S ESSION - B IG B ENCH - 3rd Workshop on Big Data Benchmarking July Xi‘an, China.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
Data Mining By Dave Maung.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
All about Revolution R Enterprise
Matthew Winter and Ned Shawa
PANEL SENIOR BIG DATA ARCHITECT BD-COE
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Introducing Teradata Aster Discovery Platform Getting Started Ahsan Nabi Khan September 25 th, 2015.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Big Data is a Big Deal!.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop and Analytics at CERN IT
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
Introduction to R Programming with AzureML
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Overview of big data tools
Toolbox Benchmarking data session BDVe Meetup, Sofia May 15, 2018
Charles Tappert Seidenberg School of CSIS, Pace University
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Big Data.
Presentation transcript:

BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen

2 End to end benchmark Based on a product retailer Focus on -Parallel DBMS -MR engines Initial work presented at 1 st WBDB, San Jose Collaboration with Industry & Academia -Teradata, University of Toronto, InfoSizing, Oracle Full spec at 3 rd WBDB, Xian, China BigBench

3 Data Model -Variety, Volume, Velocity Data Generator -PDGF for structured data -Enhancement : Semi-structured & Text generation Workload specification -Main driver: retail big data analytics -Covers : data source, declarative & procedural and machine learning algorithms. Metrics Evaluation -Done on Teradata Aster Outline

4 Big Data is not about size only 3 V’s -Variety Different types of data -Volume Huge size -Velocity Continuous updates Data Model

5 Data Model (Variety)

6 Volume -Based on scale factor -Similar to TPC-DS scaling -Weblogs & product reviews also scaled Velocity -Periodic refreshes for all data -Different velocity for different areas V structured < V unstructured < V semistructured -Queries run with refresh Data Model (continued)

7 "Parallel Data Generation Framework“ PDGF -For the structured part of model -Scale factor similar to TPC-DS Extensions to PDGF -Web logs -Product reviews Web logs: retail customers/guests visiting site -Web logs similar to apache web server logs -Coupled with structured data Product reviews: Customers and guest users -Algorithm based on Markov chain -Real data set sample input -Coupled with structured data Data Generator

8 Input -real review data -Category information -Rating -Review text Initial setup -Parse text Produce tokens Correlation between tokens - Build repository by category API for data generation -Integrated with PDGF -Based on category ID Data Generator TextGen

9 Workload Queries -30 queries -Specified in English -No required syntax Business functions (Adapted from McKinsey+) -Marketing Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences -Merchandising Assortment optimization, Pricing optimization -Operations Performance transparency, Product return analysis -Supply chain Inventory management -Reporting (customers and products) Workload (continued)

10 Technical Functions Data source dimension -Structured -Semi-structured -Un-structured Processing type dimension -Declarative (SQL, HQL) -Procedural -Mix of both Analytic technique dimension -Statistical analysis: correlation analysis, time-series, regression -Data mining: classification, clustering, association mining, pattern analysis and text analysis -Simple reporting: ad hoc queries not covered above Workload (continued)

11 Workload (continued) Business Categories Query Breakdown Business categoryNumber of Queries Percentage Marketing1860% Merchandising517% Operations413% Supply chain27% New business models 13%

12 Workload (continued) Technical Dimensions Breakdown Query processing typeNumber of Queries Percentage Declarative1033% Procedural723% Declarative + Procedural1343%

13 Technical Dimensions Breakdown (continued) Data SourcesNumber of Queries Percentage Structured1860% Semi-structured723% Un-structured517% Analytic techniquesNumber of Queries Percentage Statistics analysis620% Data mining1757% Reporting827%

14 Metrics

15 BigBench proof of concept DBMS -Typically data loaded into tables -Possibly parsing weblogs to get schema -Reviews captured as VARCHAR or BLOB fields -Queries run using SQL + UDF MR engine -Data can be loaded on HDFS -MR, HQL, PigLatin can be used DBMS & MR engine -DBMS with Hadoop connectors -Data can be placed and split among both -Processing can also be split among two Evaluation

16 Teradata Aster -MPP database -Discovery platform -Supports SQL-MR System -Aster node cluster -200 GB of data Evaluation (continued)

17 Data generation -DSDGen produced structured part -PDGF+ produced semi-structured and un-structured Data loaded into tables -Parsed weblogs into Weblogs table -Product reviews table Queries -SQL-MR syntax Evaluation (continued)

18 Example query Product category affinity analysis -Computes the probability of browsing products from a category after customers viewed items from another category. -One form of market basket Business case -Marketing -cross-selling Type of source -Structured (web sales) Processing type -mix of declarative and procedural Analytics type -Data mining Evaluation (continued)

19 SELECT category_cd1 AS category1_cd, category_cd2 AS category2_cd, COUNT (*) AS cnt FROM basket_generator ( ON ( SELECT i. i_category_id AS category_cd, s. ws_bill_customer_sk AS customer_id FROM web_sales s INNER JOIN item i ON s. ws_item_sk = i_item_sk ) PARTITION BY customer_id BASKET_ITEM (' category_cd ') ITEM_SET_MAX (500) ) GROUP BY 1,2 order by 1,3,2; Evaluation (continued)

20 Evaluation (Sample Queries)

21 Evaluation (Processing Type)

22 End to end big data benchmark Data Model -Adding semi-structured and unstructured data -Different velocity for structured, semi-structured and unstructured -Volume based on scale factor Data Generation -Unstructured data using Markov chain model. -Enhancing PDGF for semi-structured and integrating it with unstructured Workload -30 queries based on McKinsey report -Covers processing type, data source and data mining algorithms Evaluation -Aster SQL-MR Conclusion

23 Add late binding to web logs -No schema/keys upfront Finalizing -Data generator -Metric specifications. -Downloadable kit More POC’s -Hadoop ecosystem like HIVE Future Work