Download presentation
Presentation is loading. Please wait.
Published byBrian Gernon Modified over 9 years ago
1
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen
2
2 End to end benchmark Based on a product retailer Focus on -Parallel DBMS -MR engines Initial work presented at 1 st WBDB, San Jose Collaboration with Industry & Academia -Teradata, University of Toronto, InfoSizing, Oracle Full spec at 3 rd WBDB, Xian, China BigBench
3
3 Data Model -Variety, Volume, Velocity Data Generator -PDGF for structured data -Enhancement : Semi-structured & Text generation Workload specification -Main driver: retail big data analytics -Covers : data source, declarative & procedural and machine learning algorithms. Metrics Evaluation -Done on Teradata Aster Outline
4
4 Big Data is not about size only 3 V’s -Variety Different types of data -Volume Huge size -Velocity Continuous updates Data Model
5
5 Data Model (Variety)
6
6 Volume -Based on scale factor -Similar to TPC-DS scaling -Weblogs & product reviews also scaled Velocity -Periodic refreshes for all data -Different velocity for different areas V structured < V unstructured < V semistructured -Queries run with refresh Data Model (continued)
7
7 "Parallel Data Generation Framework“ PDGF -For the structured part of model -Scale factor similar to TPC-DS Extensions to PDGF -Web logs -Product reviews Web logs: retail customers/guests visiting site -Web logs similar to apache web server logs -Coupled with structured data Product reviews: Customers and guest users -Algorithm based on Markov chain -Real data set sample input -Coupled with structured data Data Generator
8
8 Input -real review data -Category information -Rating -Review text Initial setup -Parse text Produce tokens Correlation between tokens - Build repository by category API for data generation -Integrated with PDGF -Based on category ID Data Generator TextGen
9
9 Workload Queries -30 queries -Specified in English -No required syntax Business functions (Adapted from McKinsey+) -Marketing Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences -Merchandising Assortment optimization, Pricing optimization -Operations Performance transparency, Product return analysis -Supply chain Inventory management -Reporting (customers and products) Workload (continued)
10
10 Technical Functions Data source dimension -Structured -Semi-structured -Un-structured Processing type dimension -Declarative (SQL, HQL) -Procedural -Mix of both Analytic technique dimension -Statistical analysis: correlation analysis, time-series, regression -Data mining: classification, clustering, association mining, pattern analysis and text analysis -Simple reporting: ad hoc queries not covered above Workload (continued)
11
11 Workload (continued) Business Categories Query Breakdown Business categoryNumber of Queries Percentage Marketing1860% Merchandising517% Operations413% Supply chain27% New business models 13%
12
12 Workload (continued) Technical Dimensions Breakdown Query processing typeNumber of Queries Percentage Declarative1033% Procedural723% Declarative + Procedural1343%
13
13 Technical Dimensions Breakdown (continued) Data SourcesNumber of Queries Percentage Structured1860% Semi-structured723% Un-structured517% Analytic techniquesNumber of Queries Percentage Statistics analysis620% Data mining1757% Reporting827%
14
14 Metrics
15
15 BigBench proof of concept DBMS -Typically data loaded into tables -Possibly parsing weblogs to get schema -Reviews captured as VARCHAR or BLOB fields -Queries run using SQL + UDF MR engine -Data can be loaded on HDFS -MR, HQL, PigLatin can be used DBMS & MR engine -DBMS with Hadoop connectors -Data can be placed and split among both -Processing can also be split among two Evaluation
16
16 Teradata Aster -MPP database -Discovery platform -Supports SQL-MR System -Aster 5.0 -8 node cluster -200 GB of data Evaluation (continued)
17
17 Data generation -DSDGen produced structured part -PDGF+ produced semi-structured and un-structured Data loaded into tables -Parsed weblogs into Weblogs table -Product reviews table Queries -SQL-MR syntax Evaluation (continued)
18
18 Example query Product category affinity analysis -Computes the probability of browsing products from a category after customers viewed items from another category. -One form of market basket Business case -Marketing -cross-selling Type of source -Structured (web sales) Processing type -mix of declarative and procedural Analytics type -Data mining Evaluation (continued)
19
19 SELECT category_cd1 AS category1_cd, category_cd2 AS category2_cd, COUNT (*) AS cnt FROM basket_generator ( ON ( SELECT i. i_category_id AS category_cd, s. ws_bill_customer_sk AS customer_id FROM web_sales s INNER JOIN item i ON s. ws_item_sk = i_item_sk ) PARTITION BY customer_id BASKET_ITEM (' category_cd ') ITEM_SET_MAX (500) ) GROUP BY 1,2 order by 1,3,2; Evaluation (continued)
20
20 Evaluation (Sample Queries)
21
21 Evaluation (Processing Type)
22
22 End to end big data benchmark Data Model -Adding semi-structured and unstructured data -Different velocity for structured, semi-structured and unstructured -Volume based on scale factor Data Generation -Unstructured data using Markov chain model. -Enhancing PDGF for semi-structured and integrating it with unstructured Workload -30 queries based on McKinsey report -Covers processing type, data source and data mining algorithms Evaluation -Aster SQL-MR Conclusion
23
23 Add late binding to web logs -No schema/keys upfront Finalizing -Data generator -Metric specifications. -Downloadable kit More POC’s -Hadoop ecosystem like HIVE Future Work
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.