Presentation is loading. Please wait.

Presentation is loading. Please wait.

K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department.

Similar presentations


Presentation on theme: "K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department."— Presentation transcript:

1 K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007

2 “Translation” of Climate Info Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X) Climate Forecast / Projection Forecast / Projection Translation Process Models Distribution of Outcomes

3 Translation 28.5 … 12.4 23.1 … 10.2 29.1 … 11.4 25.8 … 9.7 … Historical Data Synthetic series Process model Frequency distribution of outcomes

4 Why Simulation? Limited historical data –cannot capture the full range of variability –electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Need – tool to generate ‘scenarios’ that capture the historical statistical properties Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) –These are cumbersome, restrictive (in their assumptions) Re-sampling techniques are simple and robust –Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.

5 Why Simulation? Limited historical data –cannot capture the full range of variability –electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Need – tool to generate ‘scenarios’ that capture the historical statistical properties Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) –These are cumbersome, restrictive (in their assumptions) Re-sampling techniques are simple and robust –Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.

6 Re-sampling Techniques Drawing cards from a well shuffled deck –Selecting a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Drawing card from a biased deck –Selecting a (single or a set of) historical years with unequal chance. E.g., selecting only El Nino years Conditional bootstrap K-Nearest Neighbor Bootstrap – “pattern matching” –Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’ –Select one of the K neighbors at random –Repeat to produce an ensemble –

7 Examples Ensemble Weather Generation –Scenario generation –Forecast Argentina - Pampas Region Water Quality Modeling (Boulder Water Utility)

8 Two Step Weather Generator 10011000100----- Probability of Dry and Wet Days Dry dayWet day 0.60 (p d ) 0.40 (p w ) Transition Prob (p ij ) Dry dayWet day Dry day 0.70 (p dd ) 0.30 (p dw ) Wet day 0.80 (p wd ) 0.20 (p ww ) Generated Precipitation State time series Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month Generate Precipitation State time series using Markov Chain Suppose we need weather simulation for January 5 th - January 4 th is a wet day Get Neighbors from a 7-day window (7*50) centered on January 4 th Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors” Calculate the distances between weather variables of current day feature vector and the potential neighbors Select the K-nearest neighbors Assign them weights Year January February 1234567--11234-- 120030200--xxxx-- 203200040--xxxx-- 330020300--xxxx-- 400600000--xxxx-- -------------------- ---------------- ---------------- 002030023--xxxx-- Pick a day from k-NN using the weight function – say, Jan 1 st 1953 The simulated weather for Jan 5 th is Jan 2 nd 1953. Repeat

9 Single Site Simulation Pergamino, Argentina –Daily weather variables 1931-2003 Precipitation Max. Temperature Min. Temperature 100 simulations of 73 year length (as length of record) Statistics of simulated and historical data are compared

10 Spell Properties Pergamino, Argentina

11 wet and dry spell statistics

12 Moments (wet month - Jan)

13 Moments (dry month - July)

14 Conditional K-NN Re-sampling Conditioned on IRI seasonal forecast Get the prediction (A:N:B=40:35:25) Divide historical (seasonal) total into 3 tercile categories Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories Apply the two-step weather generator on this sample.

15 Conditional Weather Generation (results)

16 Multi-site extension Same procedure as single site is used but –Calculate the Average time series – “single site virtual weather data” –Apply the two-step generator –Select the weather at all the locations on the picked day – to obtain multi-site simulation Stations in Pampus region, Argentina Stations in Pampus region, Argentina Pergamino Pergamino Junin Junin Nueve de Julio Nueve de Julio

17 wet and dry spell Statistics Pergamino, Argentina Multisite Case

18 Basic Distribution Properties

19 Spatial Correlation

20 Influent Water Quality Finished Water Quality Water Treatment Plant Motivation TOC TSUVA Alkalinity pH Turbidity Temperature Finished water must comply with a given regulation

21 Motivation Distribution InputOutput ComplyNon-Compliance Uncertainty helps us to understand the risk of non-compliance with a given regulation WTP

22 Monitoring effort mandated by USEPA Large public water systems Water quality and operating data - Disinfection by-products (DBPs) and microorganisms to support rulemakings Most comprehensive view of large drinking water systems to date Data Set Information Collection Rule (ICR)

23 18 months (Jul. 1997 – Dec. 1998) 458 continental US locations Data Set ICR

24 Data Set Water Quality –Influent –Intermediate –Finished –Distribution system Chemical Additions ICR Database

25 Influent water quality has significant variability due to - climate - geology - water management practices Characterize Variability Source Water TOC TSUVA Alkalinity pH Turbidity Temperature Total Hardness

26 Examine influent water quality for surface waters (SWs) –Spatial variability –Temporal variability Focus on total organic carbon (TOC) –TOC is a precursor in formation of DBPs –Methods extend to other water quality parameters Variability

27 Spatial Variability Variability Local polynomial approach Find best K and P combination Contour estimates

28 Spatial Variability SW Average Annual TOC (mg/L) Variability

29 Spatial Variability Variability Similar spatial patterns found for Finished water TOC (lower) Distribution system DBPs –TTHM ( total trihalomethanes) –HAA5 ( five haloacetic acids )

30 Spatial Variability Variability Alkalinity Bromide Spatial patterns consistent with previous research for other influent water quality variables

31 Variability Temporal Variability J F M A M J J A S O N D Influent TOC (mg/L) 0 1 2 3 4 City of Boulder’s Betasso Water Treatment Plant (CO)

32 Variability Temporal Variability Some locations exhibited seasonal trends, others did not Month to month variations should be considered

33 Inherent variability in water quality contributes to uncertainty How can we quantify uncertainty? Variability

34 Simulate “ensembles” of influent water quality (Monte Carlo) Quantify Uncertainty Observed data Ensembles

35 Normal Lognormal Fit a probability density function (pdf) to the data -Normal, Lognormal, etc. Simulate from pdf Quantify Traditional Method

36 Limitations - What if the pdf is not a good fit? - What if you don’t have enough data to make the pdf? ex. 18 months/location in ICR database Quantify

37 Skip fitting a pdf to the data Simulate by bootstrapping Randomly sample data with replacement Expand bootstrapping pool to include “similar” locations (nearest neighbors) What is limited in time is available in space Space-Time Bootstrapping Method Quantify

38 Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest Feature vector includes: - Average Annual Concentration - Latitude - Longitude Quantify

39 Average annual concentration helps finds neighbors that are similar but may not be geographically nearby. Average annual TOC (mg/L) for Ohio surface waters Geographically close, but not good “neighbors” for bootstrapping Quantify

40 Sample monthly TOC values based on feature vector Conditional probability

41 Simulation Algorithm 1) User inputs their location and their average annual TOC concentration 2) The ICR database is queried for all eligible entries Quantify

42 Algorithm- cont. 3) Calculate distances, d, between the x user vector and the x ICR vector Quantify

43 Algorithm- cont. 3) Calculate distances using weighted Mahalanobis equation Quantify

44 Algorithm- cont. Quantify Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance

45 Algorithm- cont. Quantify By including S, covariance matrix, components of the feature vector do not have to be scaled (Davis 1986 )

46 Algorithm- cont. Quantify Weights are assigned as

47 Quantify Weights offer flexibility in neighbor selection (a)(b) (c)(d)

48 4) Obtain observed monthly data for each nearest neighbor Algorithm- cont. Quantify

49 5) Bootstrap x NN using a weight function Algorithm- cont. Quantify Increases likelihood of picking nearer neighbors

50 Apply algorithm to quantify uncertainty in influent TOC concentration City of Boulder’s Betasso Water Treatment Plant (CO) Boulder SWs only, N = 334 Quantify

51 Red dot is the Boulder plant being simulated Empty black dots are the “neighbors” to be bootstrapped Identify nearest neighbors - Include Boulder in pool for bootstrapping Quantify

52 Box plot each monthly bootstrap ensemble (100 values) Median 5 th Percentile 95 th Percentile 25 th Percentile 75 th Percentile Outliers

53 Uncertainty quantified for Boulder Influent TOC (mg/L) 0 1 2 3 4 5 J F M A M J J A S O N D Ann Quantify Simulates seasonal trends Provides rich variety of uncertainty

54 Overlay recent data Simulations capture recent data Influent TOC (mg/L) 0 1 2 3 4 5 J F M A M J J A S O N D Ann Quantify

55 City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations

56 City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations

57 City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations

58 J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables

59 J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables

60 J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables

61 K-NN resampling technique provides a simple and robust alternative to generating ‘scenarios’. –Quantify Uncertainty – Ensemble forecast Very general – can be easily applied to a variety of situations. Weather generation Water Quality Streamflow (Colorado River Basin) Summary & Conclusions

62 Can readily be extended to generate ‘scenarios’ under climate change or decadal variability modify the ‘feature vector’ to include the climate variability information Rajagopalan and Lall (1999); Yates et al. (2003), Apipattanavis et al. (2007) - all papers in Water Resources Research balajir@colorado.edu

63 AwwaRF project 3115 “Decision Tool to Help Utilities Develop Simultaneos Compliance Strategies” Utilities City of Boulder’s Betasso Water Treatment Plant (CO) City of Birmingham’s Carson Filter Plant (AL) New Jersey American Water Swimming River Treatment Plant (NJ) Greater Cincinnati (OH) Water Works Richard Miller Water Treatment Plant Acknowledgements

64 Questions “It is better to be roughly right than precisely wrong.” -John Maynard Keynes (1883-1946)


Download ppt "K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department."

Similar presentations


Ads by Google