High Performance Statistical Queries. Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete.

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

Statistical Techniques I EXST7005 Start here Measures of Dispersion.
IB Math Studies – Topic 6 Statistics.
QUANTITATIVE DATA ANALYSIS
Descriptive Statistics A.A. Elimam College of Business San Francisco State University.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Methods and Measurement in Psychology. Statistics THE DESCRIPTION, ORGANIZATION AND INTERPRATATION OF DATA.
Analysis of Research Data
Chapter 19 Data Analysis Overview
Summary of Quantitative Analysis Neuman and Robson Ch. 11
IB Math Studies – Topic 6 Statistics.
Grouped Data Calculation
F-Test ( ANOVA ) & Two-Way ANOVA
Statistical Analysis I have all this data. Now what does it mean?
Chapter 13: Inference in Regression
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 14 Analysis.
Describing Data: Numerical
Statistical Analysis Statistical Analysis
Chapter 3 Statistical Concepts.
1 Tests with two+ groups We have examined tests of means for a single group, and for a difference if we have a matched sample (as in husbands and wives)
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
Statistical Techniques I EXST7005 Review. Objectives n Develop an understanding and appreciation of Statistical Inference - particularly Hypothesis testing.
Statistics Definition Methods of organizing and analyzing quantitative data Types Descriptive statistics –Central tendency, variability, etc. Inferential.
Statistical Analysis I have all this data. Now what does it mean?
Psyc 235: Introduction to Statistics Lecture Format New Content/Conceptual Info Questions & Work through problems.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Recap of data analysis and procedures Food Security Indicators Training Bangkok January 2009.
Psychology 101. Statistics THE DESCRIPTION, ORGANIZATION AND INTERPRATATION OF DATA.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
Numerical Measures of Variability
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Chapter Eight: Using Statistics to Answer Questions.
Chap 18-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 18-1 Chapter 18 A Roadmap for Analyzing Data Basic Business Statistics.
Describing Data Descriptive Statistics: Central Tendency and Variation.
Chapter 6: Analyzing and Interpreting Quantitative Data
Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8
STATISTICS FOR SCIENCE RESEARCH (The Basics). Why Stats? Scientists analyze data collected in an experiment to look for patterns or relationships among.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Numerical descriptions of distributions
1 Day 1 Quantitative Methods for Investment Management by Binam Ghimire.
CHAPTER 2: Basic Summary Statistics
Introduction Dispersion 1 Central Tendency alone does not explain the observations fully as it does reveal the degree of spread or variability of individual.
Statistics -Descriptive statistics 2013/09/30. Descriptive statistics Numerical measures of location, dispersion, shape, and association are also used.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Data analysis and basic statistics KSU Fellowship in Clinical Pathology Clinical Biochemistry Unit
7 th Grade Math Vocabulary Word, Definition, Model Emery Unit 4.
Chapter 18 Data Analysis Overview Yandell – Econ 216 Chap 18-1.
Review 1. Describing variables.
Numerical descriptions of distributions
STATISTICS FOR SCIENCE RESEARCH
Analyzing and Interpreting Quantitative Data
Descriptive Statistics
Part Three. Data Analysis
Numerical Descriptive Measures
MEASURES OF CENTRAL TENDENCY
Basic Statistical Terms
Data analysis and basic statistics
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Descriptive Statistics
CHAPTER 2: Basic Summary Statistics
Ticket in the Door GA Milestone Practice Test
Skills 5. Skills 5 Standard deviation What is it used for? This statistical test is used for measuring the degree of dispersion. It is another way.
Presentation transcript:

High Performance Statistical Queries

Sponsors

Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete variables  Discrete and continuous variables  Definite Integration  Moving Averages

Why Statistical Queries?  SQL Server lacks support for statistics  Statistics useful in many cases  Ad-hoc analysis  Advanced reporting  Data overview  Show the best of the T-SQL improvements in SQL Server 2012

Big Role: Data Scientist  Data Scientist: The Sexiest Job of the 21st Century (Harvard Business Review)  A high-ranking professional with the training and curiosity to make discoveries in the world of structured and big data  A solid foundation typically in computer science and applications, modeling, statistics, analytics and math  A data scientist explores and examines data from multiple disparate sources  Exploring, asking questions, doing what-if analysis, questioning existing assumptions and processes  In short: programming and statistics! 5

Performance  Main issue: complex calculations that need other statistics  E.g., StDev uses Avg in formula  Goal: calculate everything with minimal number of passes through the data  Additionally improve performance with:  (Covering) nonclustered indexes  Columnstore index  Sampling!

Minimalizing Passes through the Data  How to achieve the minimal number of passes through the data?  With new SQL 2012 Window functions  Rearrange formulas – use mathematical knowledge  Use creativity

Frequencies  Frequency tables and graphs are the basic representation of discrete variables  Value, absolute frequency, absolute percentage, cumulative frequency, cumulative percent and histogram Cars AbsFreq CumFreq AbsPerc CumPerc Histogram *********************** ************************** *********************************** ********* *******

Solution  Pre-SQL 2012: calculating absolute numbers and then using a non-equi self-join for cumulative (running) numbers  SQL 2012: calculate absolute numbers and then use aggregate functions with framing and order  SQL 2012 with creativity: use window analytic functions  PERCENT_RANK calculates the relative rank of a row within a group of rows in percent  CUME_DIST Calculates the cumulative distribution of a value  CUME_DIST – PERCENT_RANK for the last value in a group equals to the absolute percent of the value

Centers  Center of a distribution  The mode is the most common value in the distribution  The median is the value that splits the distribution into two halves  The arithmetic mean or the average is the most common measure for the center of the distribution  Comparing mode, median and mean gives info about the skewness

Solution  For mode and mean, use standard aggregate functions  For mode, use also TOP 1 WITH TIES  Many solutions for median  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytical functions with DISTINCT operator  Note: faster solutions exist

Spread  Range = maximal – minimal value  Inter-Quartile Range (IQR) = upper quartile – lower quartile  Degrees of freedom: only (n-1) pieces of information help us calculate the spread  Variance ( Var) = (1 / (n - 1)) * SUM((X i – Mean(X)) 2 )  If sample (n of cases) is big, then we can use n instead of n-1 (variance for the population – VarP)  Standard Deviation (StDev) = SQRT(Var)  Relative Standard Deviation or the Coefficient of the Variation (CV) = StDev / Mean

Solution  For range, variance and standard deviation use standard aggregate functions Many solutions for IQR  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytic functions with DISTINCT operator  Note: faster solutions exist

Skewness and Kurtosis  Skewness describes asymmetry in a random variable’s probability distribution Kurtosis characterizes the relative peakedness or flatness of a distribution

Solution  Creativity: expand the subtraction of the mean from the current value on the 3 rd and 4 th degree: Mathematics: sum is distributive over product CLR aggregate functions can use the same algorithm

Linear Dependencies  The deviation of the actual from the expected probabilities is the Covariance: CoVar(X,Y) = SUM((X i – Mean(X)) * (Y i – Mean(Y)) * P(X i,Y i ))  Divide the covariance with a product of the standard deviations of both variables and we get the Correlation Coefficient: Correl = CoVar(X,Y) / (StDev(X) * StDev(Y))  Squared correlation coefficient - Coefficient of Determination: CD = SQUARE(Correl) Continuous Variables

Solution  SQL 2012: use window aggregate functions

Contingency Tables  Contingency tables do not rely on numeric values  The Null Hypothesis: there is no relationship between row and column frequencies  So there should be no difference between observed (O) and expected (E) frequencies Observed GenderMarriedSingleTotal F M Total Expected GenderMarriedSingleTotal F M Total

Linear Dependencies Chi-Squared formula: For the Chi-Squared distribution there are already prepared tables with critical points for different degrees of freedom and for a specific confidence level Degrees of freedom = the product of the degrees of freedom for columns and rows Discrete Variables

Chi-Squared Critical Points

Solution  Problem: calculate expected frequencies  SQL 2012 and creativity: use window aggregate functions  Read the statistical significance from a pre- prepared table

Calculating Statistical Significance  Why reading from a pre-prepared table - use own table!  Calculate the values for a distribution with a definite integral over the distribution function  E.g., Gaussian distribution function  Standard normal distribution has mean 0 and StDev 1

Solution  Mathematics: trapezoidal formula for approximate definite integration a b For multiple points:

Linear Dependencies  ANOVA tests the variance in means between groups  Null Hypothesis: the only variance comes from variance within and not between samples  Mean squared deviation between a groups, with denoting group mean and denoting the total mean Continuous and Discrete Variables

One-Way ANOVA and F-Test  Mean squared deviation within a groups, with n i cases in each group F ratio –The bigger the F ratio, the more sure you can reject the Null Hypothesis –Use F tables for critical points

Solution (1)  Mathematics: understand the ANOVA formula  SQL 2012: use aggregate and ranking window functions  Creativity:

Solution (2)  How to get statistical significance, the F value?  Not from a table, not with definite integration .NET function Chart.DataManipulator.Statistics.Fdistribution  Unfortunately in the System.Windows.Forms.DataVisualization.Charting  Not supported by SQL CLR  Creativity: use console application + SQLCMD mode

Moving Averages  Simple moving averages (SMA)  Weighted moving averages (WMA)  Exponential moving averages (EMA)

Solution (1)  SMA: SQL 2012 aggregate window functions  WMA: SQL 2012 aggregate and analytic window functions

Solution (2)  EMA: SQL 2012 aggregate and analytic window functions  EMA: creativity and mathematics – transform EMA formula to include values only, not previous EMA

Solution (3)  Would be much simpler with access to the interim calculations when SQL Server does window calculations  Not exposed yet – future versions?  Other possibilities  Recursive CTE  Good old cursor!

Q & A  Thank you!

Please give feedback to us   Thank you!

Sponsors