Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables.

Similar presentations


Presentation on theme: "Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables."— Presentation transcript:

1 Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables Reading: Helsel and Hirsch, Chapter 1

2 How is new knowledge discovered? By deduction from existing knowledge By experiment in a laboratory By observation of the natural environment After completing the Handbook of Hydrology in 1993, I asked myself the question: how is new knowledge discovered in hydrology? I concluded:

3 Deduction – Isaac Newton Deduction is the classical path of mathematical physics – Given a set of axioms – Then by a logical process – Derive a new principle or equation In hydrology, the St Venant equations for open channel flow and Richard’s equation for unsaturated flow in soils were derived in this way. (1687) Three laws of motion and law of gravitation http://en.wikipedia.org/wiki/Isaac_Newton

4 Experiment – Louis Pasteur Experiment is the classical path of laboratory science – a simplified view of the natural world is replicated under controlled conditions In hydrology, Darcy’s law for flow in a porous medium was found this way. Pasteur showed that microorganisms cause disease & discovered vaccination Foundations of scientific medicine http://en.wikipedia.org/wiki/Louis_Pasteur

5 Observation – Charles Darwin Observation – direct viewing and characterization of patterns and phenomena in the natural environment In hydrology, Horton discovered stream scaling laws by interpretation of stream maps Published Nov 24, 1859 Most accessible book of great scientific imagination ever written

6 Mean Annual Flow

7 Is there a relation between flow and water quality? Total Nitrogen in water

8 Are Annual Flows Correlated?

9 CE 397 Statistics in Water Resources, Lecture 2, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 9

10 Key Themes Statistics – Parametric and non-parametric approach Data Visualization Distribution of data and the distribution of statistics of those data Reading: Helsel and Hirsch p. 17-51 (Sections 2.1 to 2.3 Slides from Helsel and Hirsch (2002) “Techniques of water resources investigations of the USGS, Book 4, Chapter A3. 10

11 Characteristics of Water Resources Data Lower bound of zero Presence of “outliers” Positive skewness Non-normal distribution of data Data measured with thresholds (e.g. detection limits) Seasonal and diurnal patterns Autocorrelation – consecutive measurements are not independent Dependence on other uncontrolled variables e.g. chemical concentration is related to discharge 11

12 Normal Distribution From Helsel and Hirsch (2002) 12

13 Lognormal Distribution From Helsel and Hirsch (2002) 13

14 Method of Moments From Helsel and Hirsch (2002) 14

15 Statistical measures Location (Central Tendency) – Mean – Median – Geometric mean Spread (Dispersion) – Variance – Standard deviation – Interquartile range Skewness (Symmetry) – Coefficient of skewness Kurtosis (Flatness) – Coefficient of kurtosis 15

16 Histogram From Helsel and Hirsch (2002) 16 Annual Streamflow for the Licking River at Catawba, Kentucky 03253500

17 Quantile Plot From Helsel and Hirsch (2002) 17

18 Plotting positions i = rank of the data with i = 1 is the lowest n = number of data p = cumulative probability or “quantile” of the data value (its percentile value) 18

19 Normal Distribution Quantile Plot From Helsel and Hirsch (2002) 19

20 Probability Plot with Normal Quantiles (Z values) q z From Helsel and Hirsch (2002) 20

21 Annual Flows From HydroExcel 21 Annual Flows produced using Pivot Tables in Excel

22 22

23 CE 397 Statistics in Water Resources, Lecture 3, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 23

24 Key Themes Using HydroExcel for accessing water resources data using web services Descriptive statistics and histograms using Excel Analysis Toolpak Reading: Chapter 11 of Applied Hydrology by Chow, Maidment and Mays 24

25 CE 397 Statistics in Water Resources, Lecture 4, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 25

26 Key Themes Frequency and probability functions Fitting methods Typical distributions Reading: Chapter 4 of Helsel and Hirsh pp. 97- 116 on Hypothesis tests 26

27 27

28 Method of Moments 28

29 Maximum Likelihood 29

30 CE 397 Statistics in Water Resources, Lecture 5, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 30

31 Key Themes Using Excel to fit frequency and probability distributions Chi Square test and probability plotting Beginning hypothesis testing Reading: Chapter 3 of Helsel and Hirsh pp. 65- 97 on Describing Uncertainty Slides from Helsel and Hirsch Chap. 4 31

32 32

33 Statistics in Water Resources, Lecture 6 Key theme – T-distribution for distributions where standard deviation is unknown – Hypothesis testing – Comparing two sets of data to see if they are different Reading: Helsel and Hirsch, Chapter 6 Matched Pair Tests

34 Chi-Square Distribution http://en.wikipedia.org/wiki/Chi-square_distribution

35 t-, z and ChiSquare Source: http://en.wikipedia.org/wiki/Student's_t-distribution

36 Normal and t-distributions Normal t-dist for ν = 1 t-dist for ν = 30t-dist for ν = 5 t-dist for ν = 3 t-dist for ν = 2 t-dist for ν = 10

37 Standard Normal z – X 1, …, X n are independently distributed (μ,σ), and – then is normally distributed with mean 0 and std dev 1 Standard Normal and Student - t Student’s t-distribution – Applies to the case where the true standard deviation σ is unknown and is replaced by its sample estimate S n

38 38 p-value is the probability of obtaining the value of the test-statistic if the null hypothesis (H o ) is true If p-value is very small (<0.05 or 0.025) then reject H o If p-value is larger than α then do not reject H o

39 One-sided test

40 Two-sided test

41 Statistics in WR: Lecture 7 Key Themes – Statistics for populations and samples – Suspended sediment sampling – Testing for differences in means and variances Reading: Helsel and Hirsch Chapter 8 Correlation

42 Estimators of the Variance Maximum Likelihood Estimate for Population variance Unbiased estimate from a sample http://en.wikipedia.org/wiki/Variance

43 Bias in the Variance Common sense would suggest to apply the population formula to the sample as well. The reason that it is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations. This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations from the sample mean will often be smaller than the deviations from the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population.

44 Suspended Sediment Sampling http://pubs.usgs.gov/sir/2005/5077/

45 T-test with same variances

46 T-test with different variances

47 Statistics in WR: Lecture 8 Key Themes – Replication in Monte Carlo experiments – Testing paired differences and analysis of variance – Correlation Reading: Helsel and Hirsch Chapter 9 Simple Regression

48 Statistics of Mean of Replicated Series

49 Patterns of data that all have correlation between x and y of 0.7

50 Monotonic nonlinear correlation Linear correlation Non-monotonic correlation

51 Statistics in WR: Lecture 9 Key Themes – Using SAS to compute cross-correlation between two data series – Using Excel to compute autocorrelation of a single data series – Correlation length and influence of data interval on that – Lagged Cross-correlation between rainfall and flow Reading: Helsel and Hirsch Chapter 12 Trend Analysis

52 Correlation Correlation (or cross-correlation) measures the association between two sets of data (x, y) Autocorrelation measures the correlation of a dataset with lagged or displace values of itself (either in time or space), e.g x(t) with x(t – L) where L is the lag time Lagged cross-correlation measures the association between one series y(t), and lagged values of another series x(t – L)

53 Statistics in WR: Lecture 10 Key Themes – Trend analysis using Simple Linear Regression – Characterization of outliers – Multiple Linear Regression Reading: Helsel and Hirsch Chapter 11 Multiple Linear Regression Slides are from Helsel and Hirsch, Chapter 9

54 H&H p.222

55 H&H p.226 Regression Formulas

56 H&H p.227 Regression Formulas

57 Statistics in WR: Lecture 11 Key Themes – Simple Linear Regression – Derivation of the normal equations – Multiple Linear Regression Reading: Helsel and Hirsch Chapter 7 Comparing several independent groups Reading: Barnett, Environmental Statistics Chapter 10 Time series methods Slides are from Helsel and Hirsch, Chapter 9

58 Regression Assumptions

59 Formulas used in the derivation of the normal equations

60 (1a) Plot the Data: TDS vs LogQ

61 (2) Interpret Regression Statistics

62 A good set of Residuals

63 Multiple Linear Regression

64 Simple vs Complex regression models

65 F-distribution http://en.wikipedia.org/wiki/F-test “If U is a Chisquare random variable with m degrees of freedom, V is a Chisquare random variable with n degrees of freedom, and if U and V are independent, then the ratio [(U/m)/V/n) has an F-distribution with (m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology, p.122 The values of the F-statistic are tabulated at: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm

66 Statistics in WR: Lecture 12 Key Themes – Regression y|x and x|y – Adjusted R 2 – Time series and seasonal variations

67 SUMMARY OUTPUT Regression Statistics Multiple R0.950344 R Square0.9031540.903154347 Adjusted R Square0.8985430.89854265 Standard Error159033.1 Observations23 ANOVA dfSSMSF Significance F Regression14.95309E+12 195.83994.07E-12 Residual (error)215.31122E+1125291521454 Total (y)225.48421E+12 R 2 and Adjusted R 2

68 Time Series Trend: Tide Levels at San Diego http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA

69 One harmonic

70 Five harmonics http://en.wikipedia.org/wiki/Fourier_series

71 Statistics in WR: Lecture 13 Key Themes – ANOVA for sediment data – Fourier series for diurnal cycles – Fourier series for seasonal cycles

72 Analysis of Variance (ANOVA) Assumptions There are several variants (one factor, two factor, two factor with replication). We will deal just with One Factor ANOVA

73 Single Factor ANOVA

74

75 ANOVA Formulas

76 Single Factor ANOVA

77 TWDB Mean 189,000 Ton/yr USGS2 Mean 97,000 Ton/yr USGS1 Mean 218,000 Ton/yr Groups of Sediment Load Data (Ex3) Overall Mean 183,000 Ton/yr Zero 3.5 x 10 6 5.5 x 10 6 480,000


Download ppt "Statistics in WR: Lecture 1 Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables."

Similar presentations


Ads by Google