Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab.

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab

Background sporadic use of KDD techniques in civil infrastructure relative youth of data mining research difficult to systematically apply KDD process KDD process tools (CRISP-DM) still under development KDD process highly domain dependent time consuming to teach data mining analysts domain knowledge

Research Objectives develop a framework for systematically applying KDD process to civil infrastructure data analysis needs –set of guidelines for inexperienced analysts –checklist for more experienced analysts describe intersection of KDD process characteristics and civil infrastructure –what problems are well-suited to KDD? –what characteristics are unique to infrastructure?

Summary increased data collection => increased need to intelligently analyze data KDD process as a “power tool” for analyzing data for high-level knowledge civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results proposed framework will help researchers to systematically apply KDD process to their data analysis problems

Data Quality What is it? –in this talk, “accuracy” –how close is the observed value to the true value? –“ground truth” is rare –look for anomalous patterns Why is it important? –poor quality data may taint analyses –patterns of poor quality data may overwhelm data mining/machine learning algorithms

Mn/ROAD Data weigh-in-motion data –axle spacings and weights, speed, lane, error codes derived quantities –equivalent standard axle loads (ESALs) –FHWA vehicle type –gross vehicle weight –total vehicle length trucks only (type >= 4) Jan 1 ‘98 to Dec 31 ’00 about 3 million vehicles courtesy Mn/ROAD

Sample Data

Overview of Approach use statistical analysis and data mining algorithms to separate anomalies from normal data –clustering –regression –physical constraints –statistical properties focus on differences between anomalies and normal data to help discover causation

Clustering group data into “natural classes” anomalies separated from normal data used Autoclass clustering algorithm

Clustering Results

Regression confidence interval of 95% R-square (fit) = 0.923 if error > 15% then identify as anomaly ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813

Regression Results

Binary Constraints (1) constraint# violations (3,068,384 total) offscale hit error61,129 (1.99%) significant weight difference error 11,107 (0.36%) different axle counts error69,521 (2.27%) tailgating10,211 (0.33%) speed >= 64.37 km/h51,114 (1.86%) speed <= 128.74 km/h 3,723 (0.12%)

Binary Constraints (2) constraint# violations (3,068,384 total) gross weight <= 45,359kg 24,897 (0.81%) length <= 22.86 m 79,454 (2.59%) unknown vehicle type 190,191 (6.20%) number of axles != 0 47 (0.00%) number of axles <= 8 57,114 (1.86%)

Constraint Interactions c1c2% interactions slow speedlength over limit63.5% length over limitslow speed45.7% tailgatingunknown type31.7% high speedunknown type28.7% overweightdiff axle counts25.2% tailgatingslow speed21.1% tailgatinglength over limit15.2%

Distribution Constraints use a goodness-of- fit test to compare distributions from the same day of week –length –gross weight –ESALs –lane

Anomaly Identification identify days with higher than normal concentrations of binary constraint violations identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane

Binary Constraints Results

Distribution Constraints Results

A Quick Refresher used four different procedures to detect anomalies –clustering –regression –binary (physical) constraints –distribution constraints next up –what is causing the anomalies? –can we fix them?

Gross Vehicle Weight

What Happened? two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle lightweight vehicles are tailgating cars –cars not supposed to be in database –mis-classified because of tailgating –this causes the “high” vehicle counts very heavy vehicles are tailgating trucks lane 1 (right-hand side) data is missing for all “low” vehicle count days

Can It Be Fixed? (1) removed all tailgating cars –lightweight –short –2 or 3 axles –error code “halved” all tailgating trucks –very long –very heavy –more than 9 axles –error code

Can It Be Fixed? (2) inserted lane 1 vehicles from same time period in 2000 “shifted” days to make sure day of week was constant –Tuesday Sept 8 1998 => Tuesday Sept 5 2000

Summary statistical analysis and data mining algorithms can be used to detect systematic anomalies in data –focus on differences between anomalies and normal data to discover differences –need domain knowledge to understand causation

Current Progress/Future Work integrate algorithms into data quality assessment program == automation –physical constraints –distribution constraints –other statistical characteristics of data –clustering –regression, neural networks will support infrastructure-related data collection activities use algorithms to identify and “clean” anomalies

Acknowledgements Minnesota Department of Transportation, especially Maggi Chalkline based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab.

Similar presentations

Presentation on theme: "Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab.

Similar presentations

Presentation on theme: "Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab."— Presentation transcript:

Similar presentations

About project

Feedback