Blazing Queries: Using an Open Source Database for High Performance Analytics July 2010.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any.
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Chapter 13: Query Processing
Structured Query Language (SQL)
1 Mixing Public and private clouds a Practical Perspective Maarten Koopmans Nordunet Conference 2009 Maarten Koopmans Nordunet Conference 2009.
ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.
Relational Database and Data Modeling
Abacast - Confidential1 Hybrid Content Delivery Network (CDN) Technologies and Services.
1Abacast - Confidential1 Hybrid Content Delivery Network (CDN) Technologies and Services.
1 The Metro Ethernet Forum Helping Define the Next Generation of Service and Transport Standards Ron Young Chairman of the Board
Dr. Alexandra I. Cristea CS 252: Fundamentals of Relational Databases: SQL5.
Database Systems: Design, Implementation, and Management
Extreme Performance with Oracle Data Warehousing
Govern the Flow of Data: Moving from Chaos to Control
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
1 Lecture 5: SQL Schema & Views. 2 Data Definition in SQL So far we have see the Data Manipulation Language, DML Next: Data Definition Language (DDL)
Analyze/Report from large Volumes of Data
Database Performance Tuning and Query Optimization
Copyright © 2011 by the Commonwealth of Pennsylvania. All Rights Reserved. Load Test Report.
9 Copyright © 2004, Oracle. All rights reserved. Using DDL Statements to Create and Manage Tables.
1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.
Microsoft Confidential. We look at the world... with our own eyes...
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Employer pays monthly fees for all employees Key disadvantage = Pay regardless of usage 100 employees x $15/m = $1,500/month 100 employees x.
Data Warehouse Overview (Financial Analysis) May 02, 2002.
Record Keeping F OR A S MALL B USINESS. RECORD KEEPING 2 Welcome 1. Agenda 2. Ground Rules 3. Introductions.
Database System Concepts and Architecture
DB Relay An Introduction. INSPIRATION Database access is WAY TOO HARD The crux.
Presented by Douglas Greer Creating and Maintaining Business Objects Universes.
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
1. SQL Server 2014 In-Memory by Design Arthur Zubarev June 21, 2014.
Simplify your world Spatial Eye Synergiedag 2012 Ronde tafel parallelsessies 15:10 – 16:00 SpW Business Edition, Ondersteuning van de Asset Manager.
Page 1 GADD Software & GADD Analytics 1.6 Public version, 2015, gaddsoftware.com GADD Analytics.
Big Data Working with Terabytes in SQL Server Andrew Novick
A comparison of MySQL And Oracle Jeremy Haubrich.
RSCTC 2008 Rough Sets in Data Warehousing Infobright Community Edition (ICE)
Maximize WebFOCUS Performance with Hyperstage
Management Information Systems, Sixth Edition
A Fast Growing Market. Interesting New Players Lyzasoft.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA
© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 14 The Second Component: The Database.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Databases & Data Warehouses Chapter 3 Database Processing.
1.
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.
September 2011Copyright 2011 Teradata Corporation1 Teradata Columnar.
Master Thesis Defense Jan Fiedler 04/17/98
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Dan Grady The search for the killer productivity application is over… Copyright 2009, Information Builders. Slide 1.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
“Apps Are In … Data’s In … Now How Do I Get The Info Out!” Russ Proudman ARIS Software, Inc
Louisville User Group Meeting April 25, 2012 Lori Pieper Maximize WebFOCUS Performance with Hyperstage.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
Oracle Business Intelligence Foundation - Commonly Used Features in Repository.
WHAT EXACTLY IS ORACLE EXALYTICS?. 2 What Exactly Is Exalytics? AGENDA Exalytics At A Glance The Exa Family Do We Need Exalytics? Hardware & Software.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Database Growth: Problems & Solutions.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Operation Data Analysis Hints and Guidelines
Blazing-Fast Performance:
Four Rules For Columnstore Query Performance
Dashboard in an Hour Using Power BI
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Blazing Queries: Using an Open Source Database for High Performance Analytics July 2010

AGENDA Common Tuning Techniques Infobright Overview Getting Started Why queries run slowly Common Tuning Approaches A Different Approach Infobright Overview The Company The Technology Performance Results Getting Started

Why queries run slowly Too much data Too many users Poor query design

Common Tuning Approaches Indexing Partitioning More Processors Summary Tables Explain Plans

A Different Approach Infobright uses intelligence, not hardware, to drive query performance: Creates information about the data (metadata) upon load, automatically Uses metadata to eliminate or reduce the need to access data to respond to a query The less data that needs to be accessed, the faster the response What this means to you: No need to partition data, create/maintain indexes or tune for performance Ad-hoc queries are as fast as static queries, so users have total flexibility Ad hoc queries that may take hours with other databases run in minutes; queries that take minutes with other databases run in seconds

Infobright Innovation Strong Momentum & Adoption First commercial open source analytic database Knowledge Grid provides significant advantage over other columnar databases Fastest time-to-value, simplest administration Cool Vendor in Data Management and Integration 2009 Partner of the Year 2009 Infobright: Economic Data Warehouse Choice Strong Momentum & Adoption Release 3.3.2 generally available > 120 customers in 10 Countries > 40 Partners on 6 continents A vibrant open source community > 1 million visitors 40,000 downloads 7,500 community members NOTE – THIS DATA CHANGES FREQUENTLY SO CHECK FOR LATEST VERSION Infobright was founded in 2006 by a team of internationally recognized mathematicians and is run by data management experts. We recognized that the critical need for business intelligence, coupled with the dramatic growth of corporate data, was outpacing the IT resources and budget within many companies. Our mission is to enable companies of all sizes to be able to implement, manage and afford a scalable, analytic database as the foundation for analytic applications and data marts. Our technology does this by delivering a simple to use, scalable and low cost solution that eliminates up to 90% of the work typically required when implementing these kinds of systems. We provide both a Community Edition and Enterprise Edition of our technology to suit the needs of a diverse set of users. The value proposition leverages the fact that our technology is easy to deploy and does not require a great deal of support and customization from IT. In addition our software requires a very small hardware footprint in relation to the size of raw data we support, meaning that IT can quickly delivery a high value application at lower cost than our competitors. Through our integration with MySQL, we provide the highly scalable analytic database solution that is ideal for MySQL users. We also leverage the broad connectivity that MySQL has to BI and ETL tools so users can choose from a wide variety of products for this. Our strong partnership with Sun/MySQL resulted in their making an investment in Infobright this past September. Infobright’s products are in their third major release, each of which has added significant new functionality. We have grown rapidly, with customers and partners around the globe. We have been recognized by a diverse set of industry leaders for technology innovation. Our open source community is also growing rapidly, and our community edition download volume continues to grow.

Infobright Technology: Key Concepts Column orientation Data packs and Compression Knowledge Grid Optimizer Our approach is to use the data within the database in a different way to alleviate problems in all of these areas .Our product, Infobright, is a scalable analytical database. What this means is that we’ve designed a product that provides fast answers to complex questions, a product that deals with massive amounts of information, and a product that does so without any additional burdens on IT. Rather than using a brute-force approach with more hardware power to increase query performance, we use knowledge about the data itself to intelligently isolate the relevant information and return results more quickly. Users are no longer limited to pre-determined queries and reporting structures, and the efforts of IT are reduced to a minimum. So how do we actually do this? Infobright is a column-oriented database, and as each column is read in, it is divided vertically into groups of 65,536 row elements, referred to as Data Packs. So each column of data is defined in terms of a stack of Data Packs. As each Data Pack is loaded into an Infobright table, we extract statistical information about this data which we keep in the Knowledge Grid. The goal of the Knowledge Grid is to minimize the need to access data in order to resolve a query – as it knows from its metadata which data packs are relevant to a particular query and which are nor. As Data Packs are created they are compressed individually using a iterative series of compression algorithms, giving us our industry-leading overall compression ratios of anywhere from 10:1 to 40:1. Data is read in without having to alter a business’s existing data model, and there are no indexes to build or maintain. The Knowledge Grid is build automatically during the load, and doesn’t require any work on the part of the user. And because Knowledge Grid information is generated only relative to the data being loaded, our incremental load speeds are constant, regardless of the growing size of the database.

1. Column vs. Row Orientation - Use Cases ID job dept city # ID job dept city # Row-Based Storage Row Oriented works if… All the columns are needed Transactional processing is required Column Oriented works if… Only relevant columns are needed Reports are aggregates (sum, count, average, etc.) Benefits Very efficient compression Faster results for analytical queries id job dept city # Column-Based Storage id job dept city # Column-Based Storage Each method has its benefits depending on your use case.   Row oriented databases are better suited for transactional environments, such as a call centre where a customer's entire record is required when their profile is retrieved. Column oriented databases are better suited for analytics, where only portions of each record are required. By grouping the data together like this, the database only needs to retrieve columns that are relevant to the query, greatly reducing the overall I/O needed. And by contrast, returning a specific 'record' would require retrieving information from each column store. Infobright is a column oriented database and built for high speed and complex analytical queries that ask questions about the data such as trends and aggregates, rather than questions that retrieve records from the data. 8

2. Data Packs and Compression Each data pack contains 65,536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type and distribution 64K 64K Compression Results vary depending on the distribution of data among data packs A typical overall compression ratio seen in the field is 10:1 Some customers have seen results of 40:1 and higher For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity Patent Pending Algorithms 64K 64K During the load process, each column of data is segmented into Data Packs of 64k, or 65,536 row elements. Then, for each Data Pack, Infobright applies multiple compression algorithms multiple times if necessary to achieve the highest compression ratio for that Data Pack.   The compression algorithms we use are a combination of industry standards and internally developed patent-pending Infobright algorithms and are chosen based on the data within the column. The overall compression ratio for each Data Pack will depend on the data type and the repetitiveness of the data from one Data Pack to the next. By addressing each Data Pack individually, we see that the compression varies within one column from one Data Pack to the next, and of course it varies from one column to the next as well. This means that the overall compression ratio for the whole table can be very high. Typical compression achieved by our customers is 10:1, but we frequently see up to 30:1 and 40:1 compression. Keep in mind that many databases with stated compression ratios of 5:1 and 10:1 will often add significantly to their footprint with indexes or projections, resulting in an overall storage requirement that can be equal to or greater than the original uncompressed data. Our compression method means a significant savings in storage requirements, a lower Total Cost of Ownership, and reduced I/O since smaller volumes of data are being moved around.

3. The Knowledge Grid Knowledge Grid Knowledge Nodes applies to the whole table Knowledge Nodes built for each Data Pack DPN Histogram CMAP Information about the data Column A Col B - INT DP1 DP2 DP3 DP4 DP5 DP6 Col A - INT numeric Col B - CHAR Data Pack Node Built during LOAD DP1 Numerical Histogram Character Map   The Knowledge Grid is a summary of statistical and aggregate information collected about each table as the data is loaded. Its information about the data. For each column and each Data Pack within that column, the Knowledge Grid information is collected automatically and up to 4 different types of Knowledge Nodes are built with no configuration or setup required in advance, . Three of the Knowledge Nodes, called Data Pack Nodes, Numerical Histograms, and Character Maps, are built for each Data Pack during the load; A 4th Knowledge Node, called a Pack-to-Pack Node, is built when a join query is run. Because they contain summary information about the data within the table, the Knowledge Nodes are used as the first step in resolving queries quickly and efficiently by answering the query directly, or by identifying only relevant Data Packs within a table and minimizing decompression. Knowledge Nodes answer the query directly, or Identify only relevant Data Packs, minimizing decompression 10

Q: How are my sales doing this year? 4. Optimizer Type I Result Set Q: How are my sales doing this year? Query Report Knowledge Grid Compressed Data Packs  1% When a query comes in, it’s run through the Infobright optimizer, which looks at the Knowledge Grid first to resolve the query. Because the Knowledge Grid stores aggregate information about the data from the Data Packs, the query can often be answered using only the Knowledge Grid, without having to look at the data specifically. The Knowledge Grid also stores information about the range of values within each Data Pack, so in cases where more detail is required, the Knowledge Grid will narrow the field of search to only those Data Packs relevant to the query, and then only decompress and read these relevant Data Packs. Query received Optimizer iterates on Knowledge Grid Each pass eliminates Data Packs If any Data Packs are needed to resolve query, only those are decompressed Type II Result Set 11

How the Knowledge Grid Works 007 SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; salary age job city Rows 1 to 65,536 65,537 to 131,072 131,073 to …… Find the Data Packs with salary > 50000 Completely Irrelevant Suspect All values match Find the Data Packs that have City = “Toronto’ Find the Data Packs that contain age < 65 Find the Data Packs that have job = ‘Shipping’ All packs ignored Now we eliminate all rows that have been flagged as irrelevant. Only this pack will be decompressed Finally we have identified the data pack that needs to be decompressed Lets see how the knowledge grid is used to evaluate a simple query SELECT DISTINCT CITY FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; First looking at the constraint salary > 50000 we can eliminate a number of Data Packs packs using the min-max in the Data Pack Nodes salary > 50000 – in this case 3 data packs were found to have no values greater than 50000 and 1 data pack was found to have all the values of salary > 50000 age < 65 flags 2 Data Packs as suspect and 2 Data Packs that have all values of age < 65 job = ‘Shipping’ flags 2 Data Packs as suspect City = “Toronto’ eliminates 2 more Data Packs and flags 2 as suspect Now we eliminate all rows that have been flagged as irrelevant and we now only have 1 Data Pack to decompress. Actually if the query had been something like count(*), then no decompression would have been needed at all. The knowledge grid would have been able to answer the question directly. 12

Examples of Performance Statistics Fast query response with no tuning Fast and consistent data load speed as as database grows. Up to 300GB/hour on a single server Customer’s Test Row-based RDBMS Infobright Analytic queries 2+ hours < 10 seconds Query (AND – Left Join) 26.4 secs .02 seconds Oracle query set 10 secs – 15 mins 0.43 – 22 seconds BI report 7 hours 17 seconds Data load 11 hours 11 minutes This data is all from customer testing – not Infobright’s. Performance always depends on many factors including the specific query, database size, datatype, hardware configuration etc. So, mileage will vary! “Infobright is 10 times faster than [Product X] when the SQL statement is more complex than a simple SELECT * FROM some_table. With some more complex SQL statements, Infobright proved to be more than 50 times faster than [Product X].” (from benchmark testing done by leading BI vendor) 13

Real Life Example: Bango Bango’s Need Infobright’s Solution Leader in mobile billing and mobile analytics services, SaaS model Received a contract with a large media provider 150 million rows per month 450GB per month on existing SQL Server solution SQL Server could not support required query performance Needed a database that could scale for much larger data sets, with fast query response Needed fast implementation, low maintenance, cost-effective solution Reduced queries from minutes to seconds Reduced size of one customer’s database from 450GB to 10GB for one month of data Query SQL Server Infobright 1 Month Report (5M events) 11 min 10 secs 1 Month Report (15M events) 43 min 23 secs Complex Filter (10M events) 29 min 8 secs

Bear in Mind The unique attributes of column orientation in Infobright are transparent to developers. The benefits are obvious and immediate to users. Infobright is a relational database Infobright observes and obeys SQL standards Infobright observes and obeys standards-based connectivity Design tools Development tools Administrative tools Query and reporting tools  

Infobright Architected on MySQL “The world’s most popular open source database” Infobright leverages the connectors and interoperability of MySQL standards. We are built within the MySQL architecture. This integration with MySQL allows our solution to tie in seamlessly with any ETL and BI tool that is compatible with MySQL. For current MySQL users who are looking for a highly scalable analytic database, Infobright is ideal – It scales from hundreds of gigabytes to 50 terabytes and more, has the ease of use MySQL users expect, and uses the familiar MySQL administrative interface. Simple scalability path for MySQL users and OEMs No new management interface to learn Enables seamless connectivity to BI tools and MySQL drivers for ODBC, JDBC, C/C++, .NET, Perl, Python, PHP, Ruby, Tcl, etc.

Infobright Development When developing applications, you can use the standard set of connectors and APIs supplied by MySQL to interact with Infobright. Connector/ODBC Connector/NET Connector/J Connector/MXJ Connector/C++ Connector/C C API PHP API Perl API C++ API Python API Ruby APIs Since Infobright supports most of the data types available for a typical MySQL database implementation, if you are migrating an existing data warehouse or BI application, you can simply drop your existing data model into Infobright. On our community web page we have contributed software that help customer migrate to Infobright from either MySQL or other database platforms like SQL Server. Data types that we are still working on include BLOBs and Auto Increment. Customers that have specific needs for these datatypes have found some success by including MyISAM database objects with their implementation. Note: API calls are restricted to the functional support of the Brighthouse engine. (e.g. mysql_stmt_insert_id ) 17

Get Started At infobright.org: At infobright.com Download ICE (Infobright Community Edition) Download an integrated virtual machine from infobright.org ICE-Jaspersoft or ICE-Jaspersoft-Talend Join the forums and learn from the experts! At infobright.com Download a white paper from the Resource library Watch a product video Download a free trial of Infobright Enterprise Edition, IEE