Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Statistical Queries. Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete.

Similar presentations


Presentation on theme: "High Performance Statistical Queries. Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete."— Presentation transcript:

1 High Performance Statistical Queries

2 Sponsors

3 Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete variables  Discrete and continuous variables  Definite Integration  Moving Averages

4 Why Statistical Queries?  SQL Server lacks support for statistics  Statistics useful in many cases  Ad-hoc analysis  Advanced reporting  Data overview  Show the best of the T-SQL improvements in SQL Server 2012

5 Big Role: Data Scientist  Data Scientist: The Sexiest Job of the 21st Century (Harvard Business Review)  A high-ranking professional with the training and curiosity to make discoveries in the world of structured and big data  A solid foundation typically in computer science and applications, modeling, statistics, analytics and math  A data scientist explores and examines data from multiple disparate sources  Exploring, asking questions, doing what-if analysis, questioning existing assumptions and processes  In short: programming and statistics! 5

6 Performance  Main issue: complex calculations that need other statistics  E.g., StDev uses Avg in formula  Goal: calculate everything with minimal number of passes through the data  Additionally improve performance with:  (Covering) nonclustered indexes  Columnstore index  Sampling!

7 Minimalizing Passes through the Data  How to achieve the minimal number of passes through the data?  With new SQL 2012 Window functions  Rearrange formulas – use mathematical knowledge  Use creativity

8 Frequencies  Frequency tables and graphs are the basic representation of discrete variables  Value, absolute frequency, absolute percentage, cumulative frequency, cumulative percent and histogram Cars AbsFreq CumFreq AbsPerc CumPerc Histogram 0 4238 4238 23 23 *********************** 1 4883 9121 26 49 ************************** 2 6457 15578 35 84 *********************************** 3 1645 17223 9 93 ********* 4 1261 18484 7 100 *******

9 Solution  Pre-SQL 2012: calculating absolute numbers and then using a non-equi self-join for cumulative (running) numbers  SQL 2012: calculate absolute numbers and then use aggregate functions with framing and order  SQL 2012 with creativity: use window analytic functions  PERCENT_RANK calculates the relative rank of a row within a group of rows in percent  CUME_DIST Calculates the cumulative distribution of a value  CUME_DIST – PERCENT_RANK for the last value in a group equals to the absolute percent of the value

10 Centers  Center of a distribution  The mode is the most common value in the distribution  The median is the value that splits the distribution into two halves  The arithmetic mean or the average is the most common measure for the center of the distribution  Comparing mode, median and mean gives info about the skewness

11 Solution  For mode and mean, use standard aggregate functions  For mode, use also TOP 1 WITH TIES  Many solutions for median  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytical functions with DISTINCT operator  Note: faster solutions exist

12 Spread  Range = maximal – minimal value  Inter-Quartile Range (IQR) = upper quartile – lower quartile  Degrees of freedom: only (n-1) pieces of information help us calculate the spread  Variance ( Var) = (1 / (n - 1)) * SUM((X i – Mean(X)) 2 )  If sample (n of cases) is big, then we can use n instead of n-1 (variance for the population – VarP)  Standard Deviation (StDev) = SQRT(Var)  Relative Standard Deviation or the Coefficient of the Variation (CV) = StDev / Mean

13 Solution  For range, variance and standard deviation use standard aggregate functions Many solutions for IQR  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytic functions with DISTINCT operator  Note: faster solutions exist

14 Skewness and Kurtosis  Skewness describes asymmetry in a random variable’s probability distribution Kurtosis characterizes the relative peakedness or flatness of a distribution

15 Solution  Creativity: expand the subtraction of the mean from the current value on the 3 rd and 4 th degree: Mathematics: sum is distributive over product CLR aggregate functions can use the same algorithm

16 Linear Dependencies  The deviation of the actual from the expected probabilities is the Covariance: CoVar(X,Y) = SUM((X i – Mean(X)) * (Y i – Mean(Y)) * P(X i,Y i ))  Divide the covariance with a product of the standard deviations of both variables and we get the Correlation Coefficient: Correl = CoVar(X,Y) / (StDev(X) * StDev(Y))  Squared correlation coefficient - Coefficient of Determination: CD = SQUARE(Correl) Continuous Variables

17 Solution  SQL 2012: use window aggregate functions

18 Contingency Tables  Contingency tables do not rely on numeric values  The Null Hypothesis: there is no relationship between row and column frequencies  So there should be no difference between observed (O) and expected (E) frequencies Observed GenderMarriedSingleTotal F474543889133 M526640859351 Total10011847318484 Expected GenderMarriedSingleTotal F494641879133 M506542869351 Total10011847318484

19 Linear Dependencies Chi-Squared formula: For the Chi-Squared distribution there are already prepared tables with critical points for different degrees of freedom and for a specific confidence level Degrees of freedom = the product of the degrees of freedom for columns and rows Discrete Variables

20 Chi-Squared Critical Points

21 Solution  Problem: calculate expected frequencies  SQL 2012 and creativity: use window aggregate functions  Read the statistical significance from a pre- prepared table

22 Calculating Statistical Significance  Why reading from a pre-prepared table - use own table!  Calculate the values for a distribution with a definite integral over the distribution function  E.g., Gaussian distribution function  Standard normal distribution has mean 0 and StDev 1

23 Solution  Mathematics: trapezoidal formula for approximate definite integration a b For multiple points:

24 Linear Dependencies  ANOVA tests the variance in means between groups  Null Hypothesis: the only variance comes from variance within and not between samples  Mean squared deviation between a groups, with denoting group mean and denoting the total mean Continuous and Discrete Variables

25 One-Way ANOVA and F-Test  Mean squared deviation within a groups, with n i cases in each group F ratio –The bigger the F ratio, the more sure you can reject the Null Hypothesis –Use F tables for critical points

26 Solution (1)  Mathematics: understand the ANOVA formula  SQL 2012: use aggregate and ranking window functions  Creativity:

27 Solution (2)  How to get statistical significance, the F value?  Not from a table, not with definite integration .NET function Chart.DataManipulator.Statistics.Fdistribution  Unfortunately in the System.Windows.Forms.DataVisualization.Charting  Not supported by SQL CLR  Creativity: use console application + SQLCMD mode

28 Moving Averages  Simple moving averages (SMA)  Weighted moving averages (WMA)  Exponential moving averages (EMA)

29 Solution (1)  SMA: SQL 2012 aggregate window functions  WMA: SQL 2012 aggregate and analytic window functions

30 Solution (2)  EMA: SQL 2012 aggregate and analytic window functions  EMA: creativity and mathematics – transform EMA formula to include values only, not previous EMA

31 Solution (3)  Would be much simpler with access to the interim calculations when SQL Server does window calculations  Not exposed yet – future versions?  Other possibilities  Recursive CTE  Good old cursor!

32 Q & A  Thank you!

33 Please give feedback to us  http://speakerscore.com/sqlsaturday376  Thank you!

34 Sponsors


Download ppt "High Performance Statistical Queries. Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete."

Similar presentations


Ads by Google