Data Virtualization Tutorial… Semijoin Optimization

Slides:

Advertisements

Similar presentations

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Advertisements

DATA, DATABASES, AND QUERIES Managing Data in Relational Databases CS1100Microsoft Access - Introduction1.

DATA, DATABASES, AND QUERIES Managing Data in Relational Databases CS1100Microsoft Access - Introduction1 Created By Martin Schedlbauer

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.

Database Systems Microsoft Access Practical #3 Queries Nos 215.

Damian Tamayo Tutorial DTM Data Generator Fall 2008 CIS 764.

Views Lesson 7.

SQL 101 – Class 1 Lee Turner. Agenda 1. This is your life – SQL A brief history of SQL What SQL is and what it is not Normalization 2. Some Super Simple.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

Gold – Crystal Reports Introductory Course Cortex User Group Meeting New Orleans – 2011.

IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.

Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.

2 Copyright © 2008, Oracle. All rights reserved. Building the Physical Layer of a Repository.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

Chapter 12 Introducing Databases. Objectives What a database is and which databases are typically used with ASP.NET pages What SQL is, how it looks, and.

Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.

More SQL: Complex Queries, Triggers, Views, and Schema Modification

SQL Server Statistics and its relationship with Query Optimizer

Databases: What they are and how they work

Data Virtualization Demoette… ODBC Clients

Data Virtualization Tutorial… SSL with CIS Web Data Sources

Data Virtualization Tutorial: Custom Functions

Data Virtualization Demoette… Logging in CIS

Data Virtualization Demoette… Packaged Query Single Select Option

Visual Basic 2010 How to Program

Data Virtualization Demoette… Business Directory Custom Properties

Data Virtualization Demoette… Caching – Database – Multi Table

Business Directory REST API

Data Virtualization Tutorial: Introduction to SQL Script

Indexes By Adrienne Watt.

GO! with Microsoft Office 2016

Data Virtualization Demoette… Custom Java Procedures

Data Virtualization Demoette… Flat-File Data Sources

Data Virtualization Demoette… JMeter Load Testing CIS JDBC

Data Virtualization Demoette… ADO.NET Client

Data Virtualization Community Edition

Data Virtualization Tutorial… LDAP Domains in CIS

Physical Changes That Don’t Change the Logical Design

Data Virtualization Community Edition

Data Virtualization Demoette… CIS Rights

Data Virtualization Tutorial… CORS and CIS

Data Virtualization Demoette… Data Lineage Reporting

Data Virtualization Tutorial… OAuth Example using Google Sheets

Data Virtualization Tutorial: XSLT and Streaming Transformations

Data Virtualization Demoette… JDBC Clients

Using Partitions and Fragments

Data Virtualization Demoette… Column-Based Security

GO! with Microsoft Access 2016

Data Virtualization Demoette… Parameterized Queries

Data Virtualization Demoette… Salesforce.com Data Source

Data Virtualization Demoette… DDL Feature

Data Virtualization Tutorial: JSON_TABLE Queries

Data Virtualization Community Edition

CIS 336 strCompetitive Success/tutorialrank.com

CIS 336 str Education for Service-- tutorialrank.com.

Query Optimization Techniques

Lecture 12 Lecture 12: Indexing.

Developing a Model-View-Controller Component for Joomla Part 3

Introduction to Database Programs

Introduction to Database Programs

September 12-14, 2018 Raleigh, NC.

A – Pre Join Indexes.

Query Optimization Techniques

Use of SQL – The Patricia database

New Perspectives on Microsoft

Map Reduce, Types, Formats and Features

Presentation transcript:

Data Virtualization Tutorial… Semijoin Optimization Hello, and welcome to the Demoette series for Cisco Information Server, or CIS. In this Demoette, we discuss CIS’s Semijoin Query Optimization capability.

Agenda What is it and why does it matter? A basic demo Summary Here is our agenda. We begin by defining semijoin optimization and outlining its importance for our customers. Next we walk through a very basic demo of using the semijoin optimization. Finally, we summarize the contents of this demoette.

Agenda What is it and why does it matter? A basic demo Summary Let’s begin by discussing what the semijoin optimization is, and why it’s important for our customers.

What is it? Semijoin Optimization One of several algorithms used by CIS Query Engine Useful when: Table A on Data Source X is small Table B on Data Source Y is large Tables A and B must be joined Select * from TableB where KEY_COL IN (‘Key1’, Key2’,…’KeyZ’) The CIS query engine employs a number of very sophisticated optimization algorithms that enable efficient joins across disparate data sources. The semijoin is one of the most interesting and useful of these algorithms. Semijoins are useful when we want to join a small table from one physical data source with a large table from a different physical data source. Instead of fetching all rows from the large table, CIS uses the SQL IN clause to fetch only those rows that are actually needed for the join.

Why does it matter? Semijoin Optimization Efficient use of physical data sources Efficient use of network Efficient join processing in CIS Demonstrates the power of Data Virtualization The semijoin optimization is important to our customers for four reasons. First, it enables CIS to use underlying physical data sources in the most efficient manner. Queries against the large data source will use minimal resources, especially when the key columns are indexed. Second, the semijoin ensures that CIS is using the network as efficiently as possible. Unneeded rows are never moved from the physical data source to CIS. Third, the semijoin helps CIS perform its join logic as efficiently as possible, since unnecessary rows are not present. Finally, the semijoin algorithm helps customers and prospects understand the power of CIS. Prospective customers sometimes wonder if they can build their own ad hoc data virtualization capabilities. When they see the sophistication of the CIS semijoin, they appreciate the power that CIS offers.

Agenda What is it and why does it matter? A basic demo Summary Next, let’s walk through a very basic demo of the semijoin optimization.

Demo: Here is the business problem… CIS Sales People: 17 rows Sales Orders: 31,465 rows Here is the business problem that we illustrate in this demo. Our data consumers need a join across two physical data sources. Sales representatives are stored in an Excel spreadsheet. There are only 17 sales people. <CLICK> Sales Orders are stored on a SQL Server database, and there are 31,000 of them. Not every order is tied to a sales representative. Of course, 31,000 rows does not qualify as a “big” table in a real production environment. However, the size difference between these two tables will serve for the purposes of this demo. Let’s see how the semijoin optimization can make this query more efficient.

Demo: Before you begin SQL Server 2012 AdventureWorks Database Excel Spreadsheet of Sales People Import .CAR file and re-bind data sources Gather full statistics on both data sources Before you begin this demo, make sure you have installed all necessary resources. Install SQL Server 2012, and import the sample AdventureWorks database from Microsoft. Download the SalesPerson spreadsheet, and import the CAR file from the additional resources folder that accompanies this demo. Re-bind the data sources. Be sure to gather full statistics for both data sources, because the CIS query engine needs statistics in order to automatically choose the semijoin optimization. Complete setup instructions are found in the additional resources that accompany this demo.

Demo: Examine the SQL Server Data Source Our CustomerOrders data source connects to a SQL Server 2012 database. <CLICK> We have introspected one table: SalesOrderHeader. <CLICK> We have also gathered statistics on this data source, so that CIS has all the information it needs to automatically choose the best join optimization for any given situation.

Demo: Examine the Excel Data Source Our SalesPerson data source connects to an Excel spreadsheet. <CLICK> We have introspected one table: SalesPerson. <CLICK> Again, we gathered statistics on this data source, so that CIS has all the information it needs to automatically choose the best join optimization for any given situation.

Demo: Define a View that joins the data Although it is not strictly necessary for this demo, we have defined Physical Views that map to the tables in the physical data sources. This is a best practice, because it decouples the data from the physical data source. <CLICK> We then use these views to create a higher-level view. We join the views on the key field that represents the ID of the sales person.

Demo: Examine the execution plan Now let’s take a look at the execution plan. Specifically, we want to look at the Join step. <CLICK> CIS knows that the left side of the join contains 17 rows. <CLICK> Therefore, the row count of the join will be somewhere between 0 and 31,465, which is the total number of rows on the right side of the join. <CLICK> Given these cardinalities, CIS will use a semijoin optimization at run time. <CLICK> Let’s see how it works. We click Execute and Show Statistics. <CLICK> We want to look at the second Fetch node in the operation. This is the node that gets data from our larger table on SQL Server. Note that this Fetch only returns 3,806 rows. It does not send the entire 31,465 rows across the network to CIS. <CLICK> Here is the reason why. CIS uses an IN clause to fetch only rows that match keys in the smaller table. <CLICK> Here is a better view of the IN clause. It contains 17 values, which correspond to the key fields on our Excel data source. This ensures that we only retrieve rows from SQL Server that are guaranteed to be relevant for our Join operation. The actual maximum number of values in the IN clause is a configurable setting in CIS. If the number of values needed exceeds this maximum, CIS can still perform a semijoin by using a technique called Partitioning. With Partitioning, CIS will generate multiple Fetches against the larger data source. Each Fetch will have an IN clause that contains a different subset of the required values. WHERE [SalesOrderHeader].[SalesPersonID] IN (280.0,281.0,275.0,282.0,283.0,274.0,277.0,284.0,290.0,285.0,276.0,289.0,279.0,286.0,288.0,287.0,278.0)

Demo: Check the results Finally, we can examine our results to see that we have indeed joined the data across the two physical data sources. Thanks to the semijoin optimization, we have been able to use our underlying data sources in the most efficient manner, especially if the larger data source has indexed the key fields we are using. Perhaps even more important, we have prevented unnecessary data from moving across the network between the physical data source and CIS. Finally, we have made CIS itself more efficient because it does not have to waste time discarding unwanted data. Our demo is complete.

Agenda What is it and why does it matter? A basic demo Summary Let’s summarize what we have seen in this presentation.

Summary One of several algorithms used by CIS Query Engine Useful when: Table A on Data Source X is small Table B on Data Source Y is large Tables A and B must be joined Select * from TableB where KEY_COL IN (‘Key1’, Key2’,…’KeyZ’) Efficient use of physical data sources Efficient use of network Efficient join processing in CIS Demonstrates the power of Data Virtualization The CIS query engine employs a number of very sophisticated optimization algorithms that enable efficient joins across disparate data sources. The semijoin is one of the most interesting and useful of these algorithms. Semijoins are useful when we want to join a small table from one physical data source with a large table from a different physical data source. Instead of fetching all rows from the large table, CIS uses the SQL IN clause to fetch only those rows that are actually needed for the join. <CLICK> The semijoin optimization is important to our customers for four reasons. First, it enables CIS to use underlying physical data sources in the most efficient manner. Queries against the large data source will use minimal resources, especially when the key columns are indexed. Second, the semijoin ensures that CIS is using the network as efficiently as possible. Unneeded rows are never moved from the physical data source to CIS. Third, the semijoin helps CIS perform its join logic as efficiently as possible, since unnecessary rows are not present. Finally, the semijoin algorithm helps customers and prospects understand the power of CIS. Prospective customers sometimes wonder if they can build their own ad hoc data virtualization capabilities. When they see the sophistication of the CIS semijoin, they appreciate the power that CIS offers. Thank you.

TOMORROW starts here.