Download presentation
Presentation is loading. Please wait.
Published byKelly Underwood Modified over 8 years ago
1
High Performance Statistical Queries
2
Sponsors
3
Agenda Introduction Descriptive Statistics Linear dependencies Continuous variables Discrete variables Discrete and continuous variables Definite Integration Moving Averages
4
Why Statistical Queries? SQL Server lacks support for statistics Statistics useful in many cases Ad-hoc analysis Advanced reporting Data overview Show the best of the T-SQL improvements in SQL Server 2012
5
Big Role: Data Scientist Data Scientist: The Sexiest Job of the 21st Century (Harvard Business Review) A high-ranking professional with the training and curiosity to make discoveries in the world of structured and big data A solid foundation typically in computer science and applications, modeling, statistics, analytics and math A data scientist explores and examines data from multiple disparate sources Exploring, asking questions, doing what-if analysis, questioning existing assumptions and processes In short: programming and statistics! 5
6
Performance Main issue: complex calculations that need other statistics E.g., StDev uses Avg in formula Goal: calculate everything with minimal number of passes through the data Additionally improve performance with: (Covering) nonclustered indexes Columnstore index Sampling!
7
Minimalizing Passes through the Data How to achieve the minimal number of passes through the data? With new SQL 2012 Window functions Rearrange formulas – use mathematical knowledge Use creativity
8
Frequencies Frequency tables and graphs are the basic representation of discrete variables Value, absolute frequency, absolute percentage, cumulative frequency, cumulative percent and histogram Cars AbsFreq CumFreq AbsPerc CumPerc Histogram 0 4238 4238 23 23 *********************** 1 4883 9121 26 49 ************************** 2 6457 15578 35 84 *********************************** 3 1645 17223 9 93 ********* 4 1261 18484 7 100 *******
9
Solution Pre-SQL 2012: calculating absolute numbers and then using a non-equi self-join for cumulative (running) numbers SQL 2012: calculate absolute numbers and then use aggregate functions with framing and order SQL 2012 with creativity: use window analytic functions PERCENT_RANK calculates the relative rank of a row within a group of rows in percent CUME_DIST Calculates the cumulative distribution of a value CUME_DIST – PERCENT_RANK for the last value in a group equals to the absolute percent of the value
10
Centers Center of a distribution The mode is the most common value in the distribution The median is the value that splits the distribution into two halves The arithmetic mean or the average is the most common measure for the center of the distribution Comparing mode, median and mean gives info about the skewness
11
Solution For mode and mean, use standard aggregate functions For mode, use also TOP 1 WITH TIES Many solutions for median SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytical functions with DISTINCT operator Note: faster solutions exist
12
Spread Range = maximal – minimal value Inter-Quartile Range (IQR) = upper quartile – lower quartile Degrees of freedom: only (n-1) pieces of information help us calculate the spread Variance ( Var) = (1 / (n - 1)) * SUM((X i – Mean(X)) 2 ) If sample (n of cases) is big, then we can use n instead of n-1 (variance for the population – VarP) Standard Deviation (StDev) = SQRT(Var) Relative Standard Deviation or the Coefficient of the Variation (CV) = StDev / Mean
13
Solution For range, variance and standard deviation use standard aggregate functions Many solutions for IQR SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytic functions with DISTINCT operator Note: faster solutions exist
14
Skewness and Kurtosis Skewness describes asymmetry in a random variable’s probability distribution Kurtosis characterizes the relative peakedness or flatness of a distribution
15
Solution Creativity: expand the subtraction of the mean from the current value on the 3 rd and 4 th degree: Mathematics: sum is distributive over product CLR aggregate functions can use the same algorithm
16
Linear Dependencies The deviation of the actual from the expected probabilities is the Covariance: CoVar(X,Y) = SUM((X i – Mean(X)) * (Y i – Mean(Y)) * P(X i,Y i )) Divide the covariance with a product of the standard deviations of both variables and we get the Correlation Coefficient: Correl = CoVar(X,Y) / (StDev(X) * StDev(Y)) Squared correlation coefficient - Coefficient of Determination: CD = SQUARE(Correl) Continuous Variables
17
Solution SQL 2012: use window aggregate functions
18
Contingency Tables Contingency tables do not rely on numeric values The Null Hypothesis: there is no relationship between row and column frequencies So there should be no difference between observed (O) and expected (E) frequencies Observed GenderMarriedSingleTotal F474543889133 M526640859351 Total10011847318484 Expected GenderMarriedSingleTotal F494641879133 M506542869351 Total10011847318484
19
Linear Dependencies Chi-Squared formula: For the Chi-Squared distribution there are already prepared tables with critical points for different degrees of freedom and for a specific confidence level Degrees of freedom = the product of the degrees of freedom for columns and rows Discrete Variables
20
Chi-Squared Critical Points
21
Solution Problem: calculate expected frequencies SQL 2012 and creativity: use window aggregate functions Read the statistical significance from a pre- prepared table
22
Calculating Statistical Significance Why reading from a pre-prepared table - use own table! Calculate the values for a distribution with a definite integral over the distribution function E.g., Gaussian distribution function Standard normal distribution has mean 0 and StDev 1
23
Solution Mathematics: trapezoidal formula for approximate definite integration a b For multiple points:
24
Linear Dependencies ANOVA tests the variance in means between groups Null Hypothesis: the only variance comes from variance within and not between samples Mean squared deviation between a groups, with denoting group mean and denoting the total mean Continuous and Discrete Variables
25
One-Way ANOVA and F-Test Mean squared deviation within a groups, with n i cases in each group F ratio –The bigger the F ratio, the more sure you can reject the Null Hypothesis –Use F tables for critical points
26
Solution (1) Mathematics: understand the ANOVA formula SQL 2012: use aggregate and ranking window functions Creativity:
27
Solution (2) How to get statistical significance, the F value? Not from a table, not with definite integration .NET function Chart.DataManipulator.Statistics.Fdistribution Unfortunately in the System.Windows.Forms.DataVisualization.Charting Not supported by SQL CLR Creativity: use console application + SQLCMD mode
28
Moving Averages Simple moving averages (SMA) Weighted moving averages (WMA) Exponential moving averages (EMA)
29
Solution (1) SMA: SQL 2012 aggregate window functions WMA: SQL 2012 aggregate and analytic window functions
30
Solution (2) EMA: SQL 2012 aggregate and analytic window functions EMA: creativity and mathematics – transform EMA formula to include values only, not previous EMA
31
Solution (3) Would be much simpler with access to the interim calculations when SQL Server does window calculations Not exposed yet – future versions? Other possibilities Recursive CTE Good old cursor!
32
Q & A Thank you!
33
Please give feedback to us http://speakerscore.com/sqlsaturday376 Thank you!
34
Sponsors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.