Data Virtualization Tutorial… Semijoin Optimization

Data Virtualization Tutorial… Semijoin Optimization
Hello, and welcome to the Demoette series for Cisco Information Server, or CIS. In this Demoette, we discuss CIS’s Semijoin Query Optimization capability.

Agenda What is it and why does it matter? A basic demo Summary
Here is our agenda. We begin by defining semijoin optimization and outlining its importance for our customers. Next we walk through a very basic demo of using the semijoin optimization. Finally, we summarize the contents of this demoette.

Let’s begin by discussing what the semijoin optimization is, and why it’s important for our customers.

What is it? Semijoin Optimization
One of several algorithms used by CIS Query Engine Useful when: Table A on Data Source X is small Table B on Data Source Y is large Tables A and B must be joined Select * from TableB where KEY_COL IN (‘Key1’, Key2’,…’KeyZ’) The CIS query engine employs a number of very sophisticated optimization algorithms that enable efficient joins across disparate data sources. The semijoin is one of the most interesting and useful of these algorithms. Semijoins are useful when we want to join a small table from one physical data source with a large table from a different physical data source. Instead of fetching all rows from the large table, CIS uses the SQL IN clause to fetch only those rows that are actually needed for the join.

Why does it matter? Semijoin Optimization
Efficient use of physical data sources Efficient use of network Efficient join processing in CIS Demonstrates the power of Data Virtualization The semijoin optimization is important to our customers for four reasons. First, it enables CIS to use underlying physical data sources in the most efficient manner. Queries against the large data source will use minimal resources, especially when the key columns are indexed. Second, the semijoin ensures that CIS is using the network as efficiently as possible. Unneeded rows are never moved from the physical data source to CIS. Third, the semijoin helps CIS perform its join logic as efficiently as possible, since unnecessary rows are not present. Finally, the semijoin algorithm helps customers and prospects understand the power of CIS. Prospective customers sometimes wonder if they can build their own ad hoc data virtualization capabilities. When they see the sophistication of the CIS semijoin, they appreciate the power that CIS offers.

Next, let’s walk through a very basic demo of the semijoin optimization.

Demo: Here is the business problem…
CIS Sales People: 17 rows Sales Orders: 31,465 rows Here is the business problem that we illustrate in this demo. Our data consumers need a join across two physical data sources. Sales representatives are stored in an Excel spreadsheet. There are only 17 sales people. <CLICK> Sales Orders are stored on a SQL Server database, and there are 31,000 of them. Not every order is tied to a sales representative. Of course, 31,000 rows does not qualify as a “big” table in a real production environment. However, the size difference between these two tables will serve for the purposes of this demo. Let’s see how the semijoin optimization can make this query more efficient.

Demo: Before you begin SQL Server 2012 AdventureWorks Database
Excel Spreadsheet of Sales People Import .CAR file and re-bind data sources Gather full statistics on both data sources Before you begin this demo, make sure you have installed all necessary resources. Install SQL Server 2012, and import the sample AdventureWorks database from Microsoft. Download the SalesPerson spreadsheet, and import the CAR file from the additional resources folder that accompanies this demo. Re-bind the data sources. Be sure to gather full statistics for both data sources, because the CIS query engine needs statistics in order to automatically choose the semijoin optimization. Complete setup instructions are found in the additional resources that accompany this demo.

Demo: Examine the SQL Server Data Source
Our CustomerOrders data source connects to a SQL Server 2012 database. <CLICK> We have introspected one table: SalesOrderHeader. <CLICK> We have also gathered statistics on this data source, so that CIS has all the information it needs to automatically choose the best join optimization for any given situation.

Demo: Examine the Excel Data Source
Our SalesPerson data source connects to an Excel spreadsheet. <CLICK> We have introspected one table: SalesPerson. <CLICK> Again, we gathered statistics on this data source, so that CIS has all the information it needs to automatically choose the best join optimization for any given situation.

Demo: Define a View that joins the data
Although it is not strictly necessary for this demo, we have defined Physical Views that map to the tables in the physical data sources. This is a best practice, because it decouples the data from the physical data source. <CLICK> We then use these views to create a higher-level view. We join the views on the key field that represents the ID of the sales person.

Demo: Examine the execution plan
Now let’s take a look at the execution plan. Specifically, we want to look at the Join step. <CLICK> CIS knows that the left side of the join contains 17 rows. <CLICK> Therefore, the row count of the join will be somewhere between 0 and 31,465, which is the total number of rows on the right side of the join. <CLICK> Given these cardinalities, CIS will use a semijoin optimization at run time. <CLICK> Let’s see how it works. We click Execute and Show Statistics. <CLICK> We want to look at the second Fetch node in the operation. This is the node that gets data from our larger table on SQL Server. Note that this Fetch only returns 3,806 rows. It does not send the entire 31,465 rows across the network to CIS. <CLICK> Here is the reason why. CIS uses an IN clause to fetch only rows that match keys in the smaller table. <CLICK> Here is a better view of the IN clause. It contains 17 values, which correspond to the key fields on our Excel data source. This ensures that we only retrieve rows from SQL Server that are guaranteed to be relevant for our Join operation. The actual maximum number of values in the IN clause is a configurable setting in CIS. If the number of values needed exceeds this maximum, CIS can still perform a semijoin by using a technique called Partitioning. With Partitioning, CIS will generate multiple Fetches against the larger data source. Each Fetch will have an IN clause that contains a different subset of the required values. WHERE [SalesOrderHeader].[SalesPersonID] IN (280.0,281.0,275.0,282.0,283.0,274.0,277.0,284.0,290.0,285.0,276.0,289.0,279.0,286.0,288.0,287.0,278.0)

Demo: Check the results
Finally, we can examine our results to see that we have indeed joined the data across the two physical data sources. Thanks to the semijoin optimization, we have been able to use our underlying data sources in the most efficient manner, especially if the larger data source has indexed the key fields we are using. Perhaps even more important, we have prevented unnecessary data from moving across the network between the physical data source and CIS. Finally, we have made CIS itself more efficient because it does not have to waste time discarding unwanted data. Our demo is complete.

Let’s summarize what we have seen in this presentation.

Summary One of several algorithms used by CIS Query Engine
Useful when: Table A on Data Source X is small Table B on Data Source Y is large Tables A and B must be joined Select * from TableB where KEY_COL IN (‘Key1’, Key2’,…’KeyZ’) Efficient use of physical data sources Efficient use of network Efficient join processing in CIS Demonstrates the power of Data Virtualization The CIS query engine employs a number of very sophisticated optimization algorithms that enable efficient joins across disparate data sources. The semijoin is one of the most interesting and useful of these algorithms. Semijoins are useful when we want to join a small table from one physical data source with a large table from a different physical data source. Instead of fetching all rows from the large table, CIS uses the SQL IN clause to fetch only those rows that are actually needed for the join. <CLICK> The semijoin optimization is important to our customers for four reasons. First, it enables CIS to use underlying physical data sources in the most efficient manner. Queries against the large data source will use minimal resources, especially when the key columns are indexed. Second, the semijoin ensures that CIS is using the network as efficiently as possible. Unneeded rows are never moved from the physical data source to CIS. Third, the semijoin helps CIS perform its join logic as efficiently as possible, since unnecessary rows are not present. Finally, the semijoin algorithm helps customers and prospects understand the power of CIS. Prospective customers sometimes wonder if they can build their own ad hoc data virtualization capabilities. When they see the sophistication of the CIS semijoin, they appreciate the power that CIS offers. Thank you.

TOMORROW starts here.

Data Virtualization Tutorial… Semijoin Optimization

Similar presentations

Presentation on theme: "Data Virtualization Tutorial… Semijoin Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Virtualization Tutorial… Semijoin Optimization

Similar presentations

Presentation on theme: "Data Virtualization Tutorial… Semijoin Optimization"— Presentation transcript:

Similar presentations

About project

Feedback