Download presentation
Presentation is loading. Please wait.
1
Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab
2
Background sporadic use of KDD techniques in civil infrastructure relative youth of data mining research difficult to systematically apply KDD process KDD process tools (CRISP-DM) still under development KDD process highly domain dependent time consuming to teach data mining analysts domain knowledge
3
Research Objectives develop a framework for systematically applying KDD process to civil infrastructure data analysis needs –set of guidelines for inexperienced analysts –checklist for more experienced analysts describe intersection of KDD process characteristics and civil infrastructure –what problems are well-suited to KDD? –what characteristics are unique to infrastructure?
4
Summary increased data collection => increased need to intelligently analyze data KDD process as a “power tool” for analyzing data for high-level knowledge civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results proposed framework will help researchers to systematically apply KDD process to their data analysis problems
5
Data Quality What is it? –in this talk, “accuracy” –how close is the observed value to the true value? –“ground truth” is rare –look for anomalous patterns Why is it important? –poor quality data may taint analyses –patterns of poor quality data may overwhelm data mining/machine learning algorithms
6
Mn/ROAD Data weigh-in-motion data –axle spacings and weights, speed, lane, error codes derived quantities –equivalent standard axle loads (ESALs) –FHWA vehicle type –gross vehicle weight –total vehicle length trucks only (type >= 4) Jan 1 ‘98 to Dec 31 ’00 about 3 million vehicles courtesy Mn/ROAD
7
Sample Data
8
Overview of Approach use statistical analysis and data mining algorithms to separate anomalies from normal data –clustering –regression –physical constraints –statistical properties focus on differences between anomalies and normal data to help discover causation
9
Clustering group data into “natural classes” anomalies separated from normal data used Autoclass clustering algorithm
10
Clustering Results
11
Regression confidence interval of 95% R-square (fit) = 0.923 if error > 15% then identify as anomaly ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813
12
Regression Results
13
Binary Constraints (1) constraint# violations (3,068,384 total) offscale hit error61,129 (1.99%) significant weight difference error 11,107 (0.36%) different axle counts error69,521 (2.27%) tailgating10,211 (0.33%) speed >= 64.37 km/h51,114 (1.86%) speed <= 128.74 km/h 3,723 (0.12%)
14
Binary Constraints (2) constraint# violations (3,068,384 total) gross weight <= 45,359kg 24,897 (0.81%) length <= 22.86 m 79,454 (2.59%) unknown vehicle type 190,191 (6.20%) number of axles != 0 47 (0.00%) number of axles <= 8 57,114 (1.86%)
15
Constraint Interactions c1c2% interactions slow speedlength over limit63.5% length over limitslow speed45.7% tailgatingunknown type31.7% high speedunknown type28.7% overweightdiff axle counts25.2% tailgatingslow speed21.1% tailgatinglength over limit15.2%
16
Distribution Constraints use a goodness-of- fit test to compare distributions from the same day of week –length –gross weight –ESALs –lane
17
Anomaly Identification identify days with higher than normal concentrations of binary constraint violations identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane
18
Binary Constraints Results
19
Distribution Constraints Results
20
A Quick Refresher used four different procedures to detect anomalies –clustering –regression –binary (physical) constraints –distribution constraints next up –what is causing the anomalies? –can we fix them?
21
Gross Vehicle Weight
22
Lane
23
What Happened? two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle lightweight vehicles are tailgating cars –cars not supposed to be in database –mis-classified because of tailgating –this causes the “high” vehicle counts very heavy vehicles are tailgating trucks lane 1 (right-hand side) data is missing for all “low” vehicle count days
24
Can It Be Fixed? (1) removed all tailgating cars –lightweight –short –2 or 3 axles –error code “halved” all tailgating trucks –very long –very heavy –more than 9 axles –error code
25
Can It Be Fixed? (2) inserted lane 1 vehicles from same time period in 2000 “shifted” days to make sure day of week was constant –Tuesday Sept 8 1998 => Tuesday Sept 5 2000
26
Summary statistical analysis and data mining algorithms can be used to detect systematic anomalies in data –focus on differences between anomalies and normal data to discover differences –need domain knowledge to understand causation
27
Current Progress/Future Work integrate algorithms into data quality assessment program == automation –physical constraints –distribution constraints –other statistical characteristics of data –clustering –regression, neural networks will support infrastructure-related data collection activities use algorithms to identify and “clean” anomalies
28
Acknowledgements Minnesota Department of Transportation, especially Maggi Chalkline based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.