Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems Ted J. Wasserman (IBM Corp. / Queen’s University) Pat.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Computer Graphics Recitation 5.
Intro to NLP - J. Eisner1 Words vs. Terms Taken from Jason Eisner’s NLP class slides:
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Performance Evaluation
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Measuring Performance Chapter 12 CSE807. Performance Measurement To assist in guaranteeing Service Level Agreements For capacity planning For troubleshooting.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Lecture II-2: Probability Review
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
CSC271 Database Systems Lecture # 30.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
1 Copyright © 2004, Oracle. All rights reserved. Introduction to Oracle Forms Developer and Oracle Forms Services.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Deep Learning – Fall 2013 Instructor: Bhiksha Raj Paper: T. D. Sanger, “Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network”,
Homogeneous Coordinates (Projective Space) Let be a point in Euclidean space Change to homogeneous coordinates: Defined up to scale: Can go back to non-homogeneous.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Yaomin Jin Design of Experiments Morris Method.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web (Book, chap. 6)
SINGULAR VALUE DECOMPOSITION (SVD)
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Case Study: A Database Service CSCI 8710 September 25, 2008.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Instructor: Mircea Nicolescu Lecture 9
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
CS246 Linear Algebra Review. A Brief Review of Linear Algebra Vector and a list of numbers Addition Scalar multiplication Dot product Dot product as a.
CSE 554 Lecture 8: Alignment
Introduction to Oracle Forms Developer and Oracle Forms Services
Singular Value Decomposition and its applications
Localization for Anisotropic Sensor Networks
Introduction to Oracle Forms Developer and Oracle Forms Services
Introduction to Oracle Forms Developer and Oracle Forms Services
Memory System Characterization of Commercial Workloads
Homogeneous Coordinates (Projective Space)
Data Warehouse.
Troubleshooting Techniques(*)
Autonomic Workload Execution Control Using Throttling
Lecture 13: Singular Value Decomposition (SVD)
Performance And Scalability In Oracle9i And SQL Server 2000
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems Ted J. Wasserman (IBM Corp. / Queen’s University) Pat Martin (Queen’s University) David B. Skillicorn (Queen’s University) Haider Rizvi (IBM Canada)

2 Outline Background and Motivation Workload Characterization Analysis Results Future work

3 What is Sizing? Estimating the amount of physical computing resources needed to support a new workload Processor (CPU), disk, memory Use of simplifying assumptions, extrapolations, estimations, projections, rules-of-thumb, prior experience, etc. Due to lack of available information about new application Sizing is more of an art, than a science

4 Our Sizing Approach 1)Collect the required high-level input data from the customer 2)Cross-check and verify input data, making assumptions and estimates if needed 3)Determine the required system resource demands for each workload class and type 4)Aggregate the different workload types and classes’ resource demands to determine the overall requirements 5)Determine which hardware configurations will meet the required resource demands 6)Produce a ranked list of hardware configurations * For more details, see: Wasserman, T.J., Martin, P., Rizvi, H. Sizing DB2 Servers for Business Intelligence Workloads. In Proc. of CASCON2004, October 2004, Toronto, Canada.

5 Motivation Problem: How does a customer describe the workload of their new application?  May not know the exact queries yet  No production-level measurements available  Only vague, high-level information Solution: Study the characteristics of a proxy workload (TPC-H) and have the customer describe their workload in terms of the approximate performance goals and mix of the different classes of queries inherent in the proxy workload

6 Workload Characterization Analysis Partition queries into a few general classes based on their resource usage  Need to keep simple so that customer can understand and relate to partitions Each class will comprise the queries that are similar to each other based on resource usage and other characteristics

7 Data Collection Data from 5 recent TPC-H benchmark power runs used (*pre-audited runs) Hardware configurations and database scales varied across benchmarks  balanced system configurations were used Data collected using standard OS monitoring tools at 5 second intervals for each query

8 Parameter Selection Query parameters monitored and used in analysis:  Query response time (seconds)  Average (user) CPU utilization  Average MB/second rate  Average IO/second rate  Size of largest n-way table join The above set was sufficient for our analysis

9 Data Normalization Within each benchmark run, data normalized  Each benchmark data set transformed to one with a 0-mean and std. dev. of 1 Normalized query data from each benchmark combined and used for analysis

10 Partitioning Techniques Goal: Partition workload into classes or clusters so that objects within a cluster are similar to each other, but are dissimilar to objects in other clusters Singular Value Decomposition (SVD) & SemiDiscrete Decomposition (SDD)  Matrix decomposition techniques  Unsupervised data mining  Good at revealing underlying or ‘hidden’ factors in data typical of real-world processes

11 Singular Value Decomposition (SVD) A matrix, A, can be decomposed as: A = U S V T where U is n x r, V is r x r, S is an r x r diagonal matrix whose entries are decreasing (the singular values), U and V are orthogonal The singular values indicate how important each new dimension is in representing the structure of A

12 Singular Value Decomposition (2) Can be regarded as transforming the original space to new axes such that as much variation as possible is expressed along the first axis, as much as possible of what remains along the second, and so on

13 SemiDiscrete Decomposition (SDD) The SDD of A is given by A = X D Y where X is n x k, D is a k x k diagonal matrix, and Y is k x m (for arbitrary k) The matrices X and Y have entries that are only -1, 0, or +1

14 SemiDiscrete Decomposition (2) For any decomposition, the product of the ith column of X, the ith diagonal element of D and the ith row of Y is a matrix of the same shape as A In SDD, each of these layer matrices describes a `bump’ in the data, a region (not necessarily contiguous) of large magnitude

15 Analysis Results C1 C4 C2 C3

16 Results Four clusters of queries  Cluster 1: Q1, Q3, Q4, Q5, Q6, Q11, Q14, Q14, Q19  “Moderate Complexity”  Cluster 2: Q2, Q20  “Simple Complexity”  Cluster 3: Q7, Q8, Q9, Q18, Q21  “High Complexity”  Cluster 4: Q10, Q13, Q15, Q16, Q22  “Trivial Complexity”

17 Results (2) Queries appear to scale well across different system architectures and database sizes Attempt to understand meaning of the new “dimensions” of the SVD analysis  U1 – CPU vs. IO-bound queries  U2 – Query Response Times  U3 – Sequential-IO intensive vs. Random-IO Intensive

18 Future Work Perform analysis on larger set of data Use more robust/representative workload Extend to other workload types (e.g. OLTP)

19 Fin Thank you.