Geog 458: Map Sources and Errors Uncertainty January 23, 2006
Outlines 1. Defining uncertainty 2. How to calculate uncertainty? 1) Nominal case: Confusion matrix 2) Interval/ratio case: RMSE 3. How to validate uncertainty? 1) Internal validation: MAUP 2) External validation: Conflation
1. Defining uncertainty Definition of uncertainty Definition of uncertainty Discrepancy between reality and its representation Discrepancy between reality and its representation Different kinds of uncertainty Different kinds of uncertainty Vagueness: representation is not well accommodated into the essence of reality (e.g. representing cities as a point layer, soil as crisp boundary) better human conceptualization needed Vagueness: representation is not well accommodated into the essence of reality (e.g. representing cities as a point layer, soil as crisp boundary) better human conceptualization needed Ambiguity: representation is not unilaterally agreed by users (e.g. placenames, occupation classification, indicator of environmental health) standardization needed Ambiguity: representation is not unilaterally agreed by users (e.g. placenames, occupation classification, indicator of environmental health) standardization needed Accuracy vs. precision Accuracy vs. precision Accuracy: difference between true values and those in DB Accuracy: difference between true values and those in DB Precision: amount of detail present in data Precision: amount of detail present in data
Questions Your diagnostics among {uncertainty, precision, positional accuracy, attribute accuracy, vagueness, ambiguity} and what are your prescriptions? Your diagnostics among {uncertainty, precision, positional accuracy, attribute accuracy, vagueness, ambiguity} and what are your prescriptions? Longitude values in decimal degree are stored as an integer Longitude values in decimal degree are stored as an integer Contour lines derived from DEM is not well lined up with DRG Contour lines derived from DEM is not well lined up with DRG The map indicates this road is bidirectional, but it turns out to be one-way The map indicates this road is bidirectional, but it turns out to be one-way Implementing intelligent geocoding system based on preposition in English (e.g. across, at, over) for international users Implementing intelligent geocoding system based on preposition in English (e.g. across, at, over) for international users Is the boundary of Mt. Everest well delineated? Is this polygon boundary a good representation of Mt. Everest? Is the boundary of Mt. Everest well delineated? Is this polygon boundary a good representation of Mt. Everest? Which is broadest? How would you communicate these errors in your data quality report? Which is broadest? How would you communicate these errors in your data quality report?
2. Calculating accuracy Nominal case Nominal case Confusion matrix (a.k.a. misclassification matrix) Confusion matrix (a.k.a. misclassification matrix) Interval/Ratio case Interval/Ratio case Root Mean Square Error (RMSE) Root Mean Square Error (RMSE) Confusion matrix is widely used to report on attribute accuracy when measured at a nominal scale RMSE is widely used to report on position accuracy when measured at a numeric scale (e.g. x, y coordinates are metric)
Confusion Matrix Table 6.2 (p. 138): evaluating classification of land parcel there are five land use code A to E Table 6.2 (p. 138): evaluating classification of land parcel there are five land use code A to E Rows and columns in misclassification matrix Rows and columns in misclassification matrix Row corresponds to the class as recorded in the database Row corresponds to the class as recorded in the database Column corresponds to the class as recorded in the field Column corresponds to the class as recorded in the field Correctly classified vs. incorrectly classified Correctly classified vs. incorrectly classified Diagonal entries represent agreement between database and field Diagonal entries represent agreement between database and field Off-diagonal entries represent disagreement between database and field Off-diagonal entries represent disagreement between database and field So how accurate would you say about this data? So how accurate would you say about this data? Since 206 (sum of diagonal entries) is correctly classified out of 304, it would be 206/304 = 68.6% Since 206 (sum of diagonal entries) is correctly classified out of 304, it would be 206/304 = 68.6%
Confusion matrix: exercise Let’s say you decide to write a test report on attribute accuracy of land use map Let’s say you decide to write a test report on attribute accuracy of land use map 100 reference points are selected to represent three classes, 49 points from natural, 28 points from agricultural, and 23 points from urban land use in your data 100 reference points are selected to represent three classes, 49 points from natural, 28 points from agricultural, and 23 points from urban land use in your data Field checks resulted in 41 points confirmed to be natural, 21 points confirmed to be agricultural, and 19 points confirmed to be urban. Field checks resulted in 41 points confirmed to be natural, 21 points confirmed to be agricultural, and 19 points confirmed to be urban. What is overall accuracy of your data? What is overall accuracy of your data?
Root Mean Square Error RMSE = RMSE = where c i is observed value and a i is true value where c i is observed value and a i is true value RMSE is the square root of sum of squared difference between observed value (ci) and its corresponding true value (ai) RMSE is the square root of sum of squared difference between observed value (ci) and its corresponding true value (ai) Indicates how much observed value is deviated from true values Indicates how much observed value is deviated from true values In the case of positional accuracy, ai will be derived from data with source in higher accuracy In the case of positional accuracy, ai will be derived from data with source in higher accuracy
RMSE: exercise Let’s say you decide to write a test report on positional accuracy of NHPN data Let’s say you decide to write a test report on positional accuracy of NHPN data You obtain data of sources with a higher positional accuracy such as geodetic points You obtain data of sources with a higher positional accuracy such as geodetic points 7 points (intersections) are selected to be compared to 7 corresponding control points 7 points (intersections) are selected to be compared to 7 corresponding control points Distances for 7 pairs are calculated as follows Distances for 7 pairs are calculated as follows What is RMSE? What is RMSE?
3. Validating accuracy Internal validation Internal validation Examines likely impacts of uncertainty upon operation results within GIS Examines likely impacts of uncertainty upon operation results within GIS What would be effects of different data aggregation schemes on operation results?: MAUP What would be effects of different data aggregation schemes on operation results?: MAUP External validation External validation Validates accuracy of test data in reference to external data sources Validates accuracy of test data in reference to external data sources How much is this data set accurate relative to reference data?: Conflation How much is this data set accurate relative to reference data?: Conflation
Modifiable Areal Unit Problem Quite simply, different aggregations yield different results Quite simply, different aggregations yield different results From Openshaw From Openshaw Because sometimes geography does not have a natural unit of analysis Because sometimes geography does not have a natural unit of analysis Population, vegetation Population, vegetation Remember census unit is artificial boundary for the purpose of enumeration Remember census unit is artificial boundary for the purpose of enumeration Space is used as a sampling scheme Space is used as a sampling scheme Question of optimal unit of analysis Question of optimal unit of analysis Urban center boundary for analyzing urban activities Urban center boundary for analyzing urban activities Metropolitan area for analyzing spatial labor market Metropolitan area for analyzing spatial labor market
Conflation Describes the range of functions that attempt to overcome differences between datasets or merge their contents as with rubber-sheeting Describes the range of functions that attempt to overcome differences between datasets or merge their contents as with rubber-sheeting Visual inspection of spatial overlay of TIGER file over GPS measurements Visual inspection of spatial overlay of TIGER file over GPS measurements Lab2: working with data of different sources, conflating test data with data of independent source (higher accuracy), visual inspection of positional accuracy, summarizing positional accuracy of test data with RMSE Lab2: working with data of different sources, conflating test data with data of independent source (higher accuracy), visual inspection of positional accuracy, summarizing positional accuracy of test data with RMSE