Download presentation
Presentation is loading. Please wait.
Published byAlexia Campbell Modified over 9 years ago
1
K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007
2
“Translation” of Climate Info Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X) Climate Forecast / Projection Forecast / Projection Translation Process Models Distribution of Outcomes
3
Translation 28.5 … 12.4 23.1 … 10.2 29.1 … 11.4 25.8 … 9.7 … Historical Data Synthetic series Process model Frequency distribution of outcomes
4
Why Simulation? Limited historical data –cannot capture the full range of variability –electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Need – tool to generate ‘scenarios’ that capture the historical statistical properties Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) –These are cumbersome, restrictive (in their assumptions) Re-sampling techniques are simple and robust –Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.
5
Why Simulation? Limited historical data –cannot capture the full range of variability –electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Need – tool to generate ‘scenarios’ that capture the historical statistical properties Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) –These are cumbersome, restrictive (in their assumptions) Re-sampling techniques are simple and robust –Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.
6
Re-sampling Techniques Drawing cards from a well shuffled deck –Selecting a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method Drawing card from a biased deck –Selecting a (single or a set of) historical years with unequal chance. E.g., selecting only El Nino years Conditional bootstrap K-Nearest Neighbor Bootstrap – “pattern matching” –Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’ –Select one of the K neighbors at random –Repeat to produce an ensemble –
7
Examples Ensemble Weather Generation –Scenario generation –Forecast Argentina - Pampas Region Water Quality Modeling (Boulder Water Utility)
8
Two Step Weather Generator 10011000100----- Probability of Dry and Wet Days Dry dayWet day 0.60 (p d ) 0.40 (p w ) Transition Prob (p ij ) Dry dayWet day Dry day 0.70 (p dd ) 0.30 (p dw ) Wet day 0.80 (p wd ) 0.20 (p ww ) Generated Precipitation State time series Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month Generate Precipitation State time series using Markov Chain Suppose we need weather simulation for January 5 th - January 4 th is a wet day Get Neighbors from a 7-day window (7*50) centered on January 4 th Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors” Calculate the distances between weather variables of current day feature vector and the potential neighbors Select the K-nearest neighbors Assign them weights Year January February 1234567--11234-- 120030200--xxxx-- 203200040--xxxx-- 330020300--xxxx-- 400600000--xxxx-- -------------------- ---------------- ---------------- 002030023--xxxx-- Pick a day from k-NN using the weight function – say, Jan 1 st 1953 The simulated weather for Jan 5 th is Jan 2 nd 1953. Repeat
9
Single Site Simulation Pergamino, Argentina –Daily weather variables 1931-2003 Precipitation Max. Temperature Min. Temperature 100 simulations of 73 year length (as length of record) Statistics of simulated and historical data are compared
10
Spell Properties Pergamino, Argentina
11
wet and dry spell statistics
12
Moments (wet month - Jan)
13
Moments (dry month - July)
14
Conditional K-NN Re-sampling Conditioned on IRI seasonal forecast Get the prediction (A:N:B=40:35:25) Divide historical (seasonal) total into 3 tercile categories Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories Apply the two-step weather generator on this sample.
15
Conditional Weather Generation (results)
16
Multi-site extension Same procedure as single site is used but –Calculate the Average time series – “single site virtual weather data” –Apply the two-step generator –Select the weather at all the locations on the picked day – to obtain multi-site simulation Stations in Pampus region, Argentina Stations in Pampus region, Argentina Pergamino Pergamino Junin Junin Nueve de Julio Nueve de Julio
17
wet and dry spell Statistics Pergamino, Argentina Multisite Case
18
Basic Distribution Properties
19
Spatial Correlation
20
Influent Water Quality Finished Water Quality Water Treatment Plant Motivation TOC TSUVA Alkalinity pH Turbidity Temperature Finished water must comply with a given regulation
21
Motivation Distribution InputOutput ComplyNon-Compliance Uncertainty helps us to understand the risk of non-compliance with a given regulation WTP
22
Monitoring effort mandated by USEPA Large public water systems Water quality and operating data - Disinfection by-products (DBPs) and microorganisms to support rulemakings Most comprehensive view of large drinking water systems to date Data Set Information Collection Rule (ICR)
23
18 months (Jul. 1997 – Dec. 1998) 458 continental US locations Data Set ICR
24
Data Set Water Quality –Influent –Intermediate –Finished –Distribution system Chemical Additions ICR Database
25
Influent water quality has significant variability due to - climate - geology - water management practices Characterize Variability Source Water TOC TSUVA Alkalinity pH Turbidity Temperature Total Hardness
26
Examine influent water quality for surface waters (SWs) –Spatial variability –Temporal variability Focus on total organic carbon (TOC) –TOC is a precursor in formation of DBPs –Methods extend to other water quality parameters Variability
27
Spatial Variability Variability Local polynomial approach Find best K and P combination Contour estimates
28
Spatial Variability SW Average Annual TOC (mg/L) Variability
29
Spatial Variability Variability Similar spatial patterns found for Finished water TOC (lower) Distribution system DBPs –TTHM ( total trihalomethanes) –HAA5 ( five haloacetic acids )
30
Spatial Variability Variability Alkalinity Bromide Spatial patterns consistent with previous research for other influent water quality variables
31
Variability Temporal Variability J F M A M J J A S O N D Influent TOC (mg/L) 0 1 2 3 4 City of Boulder’s Betasso Water Treatment Plant (CO)
32
Variability Temporal Variability Some locations exhibited seasonal trends, others did not Month to month variations should be considered
33
Inherent variability in water quality contributes to uncertainty How can we quantify uncertainty? Variability
34
Simulate “ensembles” of influent water quality (Monte Carlo) Quantify Uncertainty Observed data Ensembles
35
Normal Lognormal Fit a probability density function (pdf) to the data -Normal, Lognormal, etc. Simulate from pdf Quantify Traditional Method
36
Limitations - What if the pdf is not a good fit? - What if you don’t have enough data to make the pdf? ex. 18 months/location in ICR database Quantify
37
Skip fitting a pdf to the data Simulate by bootstrapping Randomly sample data with replacement Expand bootstrapping pool to include “similar” locations (nearest neighbors) What is limited in time is available in space Space-Time Bootstrapping Method Quantify
38
Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest Feature vector includes: - Average Annual Concentration - Latitude - Longitude Quantify
39
Average annual concentration helps finds neighbors that are similar but may not be geographically nearby. Average annual TOC (mg/L) for Ohio surface waters Geographically close, but not good “neighbors” for bootstrapping Quantify
40
Sample monthly TOC values based on feature vector Conditional probability
41
Simulation Algorithm 1) User inputs their location and their average annual TOC concentration 2) The ICR database is queried for all eligible entries Quantify
42
Algorithm- cont. 3) Calculate distances, d, between the x user vector and the x ICR vector Quantify
43
Algorithm- cont. 3) Calculate distances using weighted Mahalanobis equation Quantify
44
Algorithm- cont. Quantify Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance
45
Algorithm- cont. Quantify By including S, covariance matrix, components of the feature vector do not have to be scaled (Davis 1986 )
46
Algorithm- cont. Quantify Weights are assigned as
47
Quantify Weights offer flexibility in neighbor selection (a)(b) (c)(d)
48
4) Obtain observed monthly data for each nearest neighbor Algorithm- cont. Quantify
49
5) Bootstrap x NN using a weight function Algorithm- cont. Quantify Increases likelihood of picking nearer neighbors
50
Apply algorithm to quantify uncertainty in influent TOC concentration City of Boulder’s Betasso Water Treatment Plant (CO) Boulder SWs only, N = 334 Quantify
51
Red dot is the Boulder plant being simulated Empty black dots are the “neighbors” to be bootstrapped Identify nearest neighbors - Include Boulder in pool for bootstrapping Quantify
52
Box plot each monthly bootstrap ensemble (100 values) Median 5 th Percentile 95 th Percentile 25 th Percentile 75 th Percentile Outliers
53
Uncertainty quantified for Boulder Influent TOC (mg/L) 0 1 2 3 4 5 J F M A M J J A S O N D Ann Quantify Simulates seasonal trends Provides rich variety of uncertainty
54
Overlay recent data Simulations capture recent data Influent TOC (mg/L) 0 1 2 3 4 5 J F M A M J J A S O N D Ann Quantify
55
City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations
56
City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations
57
City of Birmingham’s Carson Filter Plant (AL) J F M A M J J A S O N D Ann Influent TOC (mg/L) 0 1 2 3 4 Quantify Portable Across Locations
58
J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables
59
J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables
60
J F M A M J J A S O N D Ann Influent Alkalinity (as mg/L CaCO 3 ) 0 10 20 30 40 50 60 70 New Jersey American Water Swimming River Treatment Plant (NJ) Quantify Applies to Other Variables
61
K-NN resampling technique provides a simple and robust alternative to generating ‘scenarios’. –Quantify Uncertainty – Ensemble forecast Very general – can be easily applied to a variety of situations. Weather generation Water Quality Streamflow (Colorado River Basin) Summary & Conclusions
62
Can readily be extended to generate ‘scenarios’ under climate change or decadal variability modify the ‘feature vector’ to include the climate variability information Rajagopalan and Lall (1999); Yates et al. (2003), Apipattanavis et al. (2007) - all papers in Water Resources Research balajir@colorado.edu
63
AwwaRF project 3115 “Decision Tool to Help Utilities Develop Simultaneos Compliance Strategies” Utilities City of Boulder’s Betasso Water Treatment Plant (CO) City of Birmingham’s Carson Filter Plant (AL) New Jersey American Water Swimming River Treatment Plant (NJ) Greater Cincinnati (OH) Water Works Richard Miller Water Treatment Plant Acknowledgements
64
Questions “It is better to be roughly right than precisely wrong.” -John Maynard Keynes (1883-1946)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.