Presentation is loading. Please wait.

Presentation is loading. Please wait.

PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte.

Similar presentations


Presentation on theme: "PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte."— Presentation transcript:

1 PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

2 Road Map Initial Data Cleaning (more cleaning later)  Set null values as appropriate  Set Data Quality Flags based on row-level data integrity violations Aggregate Calculations  Calculate 5-minute aggregates  Calculate 15-min, 1-hour aggregates Portal Output Calculations  Timeseries  Grouped Data  Performance

3 Road Map: Initial Data Cleaning Set NULL values Data Quality Flags

4 Set Null Values as Appropriate Case 1: No Data Received from Detector  ODOT indicates vol = occ = speed = -1, status = 0 means no data received from detector ** (status 0 is inhibited)  We set volume, occupancy and speed to NULL in this case Case 2: No Activity at Detector  ODOT indicates vol = occ = speed = 0, status = 0 means no activity at detector **  We set speed to NULL in this case Case 3: Disabled (?)  Also get readings: vol = occ = speed = -1, status = 1 (status 1 is disabled)  Similar to Case 1, except status = 1, so assume this does mean disabled  We set volume, occupancy and speed to NULL in this case  Should verify the meaning of vol = occ = speed = -1, status = 1 with ODOT ** as of 2-14-05; fields were different before No -999’s in data received live; there are -999s in 2001, 2002 data

5 Data Quality Flags Detect data inconsistencies and erroneous values Row-level/Reading-level flags  Flags are set for each (vol, occ, speed) reading we get  For each reading (row), a flag is set to indicate if that row fails a particular criteria Examples  Max Speed, set if speed > 100 mph  Max Occ, set if occ > 95% Adapted from TTI (see wiki for link)

6 Data Quality Flags – Single Value Flags Maximum Volume (DQ_MAXVOL)  Condition: volume > 17 (20-sec data)  Comments: Equivalent to 3060 vph; may be caused by motorcycles or lane changes for example Maximum occupancy (DQ_MAXOCC)  Condition: occupancy > 95%  Comments: detectors may be “stuck-on” Maximum speed (DQ_MAXSPD)  Condition: speed > 100 mph  Comments: based on empirical evidence Minimum speed (DQ_MINSPD)  Condition: speed < 5 mph  Comments: based on empirical evidence

7 Data Quality Flags – Multi-variate Consistency Multi-variate consistency #1 (DQ_MVC_S0VGT0)  Speed = 0 and Volume > 0  Indicates potential problem with: speed  Is this necessary? We already eliminate speed < 5 mph Multi-variate consistency #2 (DQ_MVC_SGT0V0)  Speed > 0 and Volume = 0  Indicates potential problem with: speed Multi-variate consistency #3 (DQ_MVC_OGT0V0)  Occupancy > 0 and Volume = 0  Note: alternative: set occupancy = 0  Indicates potential problem with: occupancy

8 Road Map: Aggregate Calculations 5-minute aggregates 15-min, 1-hr aggregates

9 Missing Values in 5-min Aggregates of Vol, Speed, Occ Raw Aggregates  Sum/avg data with no imputation, rescaling or cleaning Processed Aggregates  Two issues:  #1 Is there enough data to calculate an aggregate?  #2 If there is enough data, how to handle the fact that we are missing values?

10 Missing Values in 5-min Aggregates of Vol, Speed, Occ, Cont’d Issue #1: Determining if a time period has enough valid data to calculate aggregates for the time period  Aggregate vol, speed, occ set to null if less than X% of the readings are valid  Do sensitivity analysis to pick X (start with 75% - set 5-min aggregate to null if we have less than 75% of the data)

11 Bad Data Threshold Rob - In general, throughout the development of the aggregation process, we consistently encountered the issue of how to determine when there is enough data to create an aggregate vs. when an aggregate value should be declared null because of lack of data. For starters, we’ve chosen a threshold of 75%. If 75% or more of the 20- second readings in a 5-minute interval are valid, we calculate an aggregate for that 5-minute interval. The same issue arises in the following situations:  Calculating 5-min aggregates from 20-second data  Calculating 15-min aggregates from 5-min data  Calculating 1-hr aggregates from 15-min data  Grouped Data: grouping over multiple days (group by hour)  Grouped Data: aggregating 1-hr data to daily data for group by day of week and week of year Related issues:  Aggregating over lanes (how to handle the situation when data is missing from one lane, note that data may be missing from a lane for only a portion of a day) We wonder if studying this percentage (data sensitivity) would make a good paper topic.

12 #2 Handling Missing Values Two basic methods for dealing with missing values  1) Use preceding/following values (replace missing volume with preceding volume reading for that detector)  2) Aggregate values in a 5 minute range and scale based on # of readings Since we are aggregating, suggest method 2  We are aggregating, so interpolation may not be necessary  Think 2 will perform similarly to 1 in cases when only a few readings are missing, when lots of readings missing, should report that data is not available (null)  Method 2 is well-defined (simple to define and express/explain)  Method 2 is easier to implement than 1

13 5-min Aggregates (vol, spd, occ) Volume  For each detector, sum all valid volume readings in 5- minute period (eliminate any readings with non-zero flags) and rescale by # of missing readings Speed, Occupancy  For each detector, calculate the volume-weighted average of all valid readings in a 5-minute period  Assume 20-second readings are space-mean speed  Weight by volume (to be technically correct, we need to weight by density, but that requires estimating density)

14 5-Min Aggregate – Other Measures Vehicle Miles Traveled (VMT)  5-min VMT = 5-min volume * length of midpoint influence area of associated detector Vehicle Hours Traveled (VHT)  5-min VHT = 5-min volume * (length of midpoint influence area / 5-min space-mean speed) Travel Time  5-min travel time = length of midpoint influence area / (5-min space-mean speed)

15 5-min Aggregates – Other – Cont’d Total Delay (in veh-mins)  5-min delay = 5-min volume * (travel time – (influence area / free-flow speed))  Where: free-flow speed = min {assumed free-flow speed, actual 5-min speed}  Note: Free-flow speed is computed in this way in order to avoid negative delay Average Delay/Vehicle (in minutes)  5-min delay = Total Delay/5-min volume Should we report Average or Total Delay or both?

16 Summary: Which values/measures affect which calculations? 20-sec volume 20-sec speed 20-sec occupancy 5-min volume 5-min speed 5-min occupancy 5-min VMT 5-min Travel Time 5-min Delay 5-min VHT ** * May be needed, depends on method used ** Need 5-min speed to calculate free-flow speed *

17 Road Map: 15-min, 1-hr Aggregates Issues:  Calculated from 5-min aggregates or from 20- second data? We chose to use 5-min data  What should the threshold be for % of bad data? Chose to toss 15-min reading if we are missing more than one 5-min reading For now, toss 1 hr reading if we have less than 75% of 5-min readings (or less than 9 readings) Note the “bad data” threshold issue arises here, also see next page.

18 Aggregation over Larger Intervals 15-min aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than one 5- minute reading 1-hr aggregates are calculated from 5- min aggregates, set aggregate value to null if missing more than three 5- minute readings

19 Aggregation over Time, Cont’d Volume  Sum 5-min volumes Occupancy  Volume-weighted average of 5-min occupancies Speed  Volume-weighted average of 5-min speeds Vehicle Miles Traveled  sum of 5-min VMTs Vehicle Hours Traveled  sum of 5-min VHTs Travel Time  Volume-weighted average of 5-min travel times Delay  Total Delay: Sum of 5-min delays  Average Delay: Volume-weighted average of 5-min delays

20 Road Map: PORTAL Output We have described how to do the 5-min, 15-min and 1-hr aggregations For each page in PORTAL, analyze what calculations are needed and how those calculations should be done

21 Timeseries – Line Graphs Timeseries  Options: graph by station or lane, 7 basic measures, 5-min, 15-min, 1-hr granularity of input data  For Lane Graph These values are in aggregate table (calculation described in previous slides) Requires 5-min (15-min, 1-hr) aggregates of each value for each detector If data is missing, put blank space in the graph  For Station Graph: Requires aggregation over lanes Calculation described in subsequent slides If data is missing from one or more lanes, show blank space on graph

22 Timeseries – Contour Plots Contour Plot (Surface Plot?) for a Highway  Requires same calculations as for station plot on previous slide

23 Aggregation over Lanes Volume  Sum Occupancy  Volume-weighted average of by-lane occupancies Speed  Volume-weighted average of by-lane speeds Vehicle Miles Traveled  Sum over lanes Vehicle Hours Traveled  Sum over lanes Travel Time  Volume-weighted average of by-lane travel times Average Delay  Volume-weighted average of by-lane delays Total Delay  Sum of by-lane delays

24 Grouped Data – Line Plots Line Plots by Station/Detector  Group by: 5 minute (minute of day) Input is 5-minute aggregations Simple arithmetic average, min, max, and std dev of data from appropriate 5-min period Result shows variation between days  Group by: Hour of day Input is 1-hr aggregations As above, simple arithmetic average, min, max and std dev of data from appropriate 1-hr periods Result shows variation between days What to do if we are missing data? What if we are missing 50% of the values for 8:00-9:00, for example? We may be missing different %s for different hour intervals??

25 Grouped Data – Line Plots, Cont’d Line Plots by Station/Detector  Group by: Day of Week Input must be 1-hr aggregations Aggregate 1-hr data to daily data What if one or more hours in a day is/are missing?? (What if we are missing a peak hour?) Then simple arithmetic avg, min, max, std dev of aggregates Result shows variation within day of week  Group by: Week of Year Similar to day of week Note: “bad data threshold” issue again. Seems reasonable to rescale when going from 20-sec data to 5-min data, and again 5-min to 1-hr seems OK. But neither interpolating or rescaling seem really acceptable in this situation.

26 Aggregation to Day of Week Volume  Sum 1-hr volumes Occupancy  ?? Volume-weighted average of 5-min occupancies Speed  (  1-hr VMTs)/(  1-hr VHTs)  ensure for VMT and VHT - either are both null or neither are null Vehicle Miles Traveled   1-hr VMTs  Rescale??? Vehicle Hours Traveled   1-hr VHTs Travel Time  Volume-weighted average of 5-min travel times Delay  Total Delay: Sum of 5-min delays  Average Delay: Volume-weighted average of 5-min delays

27 WARNING: STOP HERE Following slides are under construction (hard hats required)

28 Grouped Data – Highway Plots Highway Plots  What does this do anyway???  Group by: 5 minute (minute of day) Requires input to be 5-minute aggregate data Need to group 5-minute periods from different days together Use appropriate aggregation-by-time method  Group by: Hour of day Input can be 5-min, 15-min or 1-hr For 1hr input need to group 1-hr periods from different days together  Use appropriate aggregation-by-time method 5-min, 15-min input – Should we restrict this to 1-hr input??  Just use appropriate aggregation-by-time method??

29 Grouped Data – Highway Plots, Cont’d Highway Plots  Group by: Day of Week Input is 5-min, 15-min, 1-hr Need to group 5-min data into 5-min groups, 15- min data into 15-min groups, etc. Use appropriate aggregation by time method??  Group by: Week of Year Same as Day of Week, I think

30 Grouped Data – All Highways All Highways Plots

31 Performance Reports MONTHLY FIGURES Avg TT Avg TS Avg VS FFTT FFTS 95th% TT 95th% TS 166.11 Mins 52.05 MPH 53.79 MPH 145.81 Mins 59.29 MPH 232.16 Mins 37.24 MPH VMT VHT % Cngst Hwy Len Days Used 211,298,304 4,118,289 15.09% 144.09 Miles 31 WEATHER NUMBER OF DAYS Rain ≥ 0.5" Rain ≥ 1.0" Frzn Precip Fog 8 2 0 1 TIME WINDOW FIGURES (primarily for daily report – BI, TTI are new) Avg TT Avg TS 95th% TT 95th% TS 166.11 Mins 52.05 MPH 232.16 Mins 37.24 MPH VMT VHT % Cngst TTI BI 211,298,304 4,118,289 15.09% 1.59 1.97

32 Future Flags Calibration Error (not in this presentation, see paper) Multi-variate consistency  Volume > n max when Occupancy = 0  This should be tested after correcting for calibration error Maximum reduction in speed  Speed n+1 > 0.45*Speed n Identical records  If more than 8 consecutive records (volume, occupancy, and speed) are identical

33 Extra Slides

34 Questions No vehicles present Speed = 0 when Volume = 0Set Speed = null Am not doing this because of the -1, -1, -1, and 0, 0, 0 tests At what point do we call a 5-minute interval invalid?? 5 missing readings? 8? 9? 10? Aggregation over space – we need something for vol and Aggregation over space – need to deal with missing data Need aggregation over time (i.e. combining data from different days together)

35 ODOT SWARM Flags 0 - inhibited 1 - disabled 2 - ok 3 - suspect 4 - softfail 5 - hardfail

36 Data Notes Status 0 Received Data As of Feb 14, 2005, ODOT reports vol -1, occ -1, speed -1, status 0, for detectors with no controller communications (no data received from detector), and vol 0, occ 0, speed 0, status 0 for detectors reporting no traffic data (no activity) (info from David Gaarsoe). See Data Fidelity and minus 1 and minus 999 for more info.Data Fidelityminus 1 and minus 999 Data Cleaning The nightly scripts "clean" the -1 and -999 data as follows:  Sets volume = 0, speed = NULL, occupancy = 0 where volume = -999  Sets volume = NULL, speed = NULL, occupancy = NULL, where volume = -1  I don’t belive we receive -999 data anymore (KT) Unknown  We seem to also get vol = 0, speed = 0, status = 2 (or other status), with occ > 0. What does this mean?  What was that weird thing I noted in the file on badger?

37 Steve H’s notes from emails from ODOT -999 When we received blocks of data from 2001 and 2002, the 20-second data included records where volume=-999. This indicated that no cars passed. As of July 14, 2004 when PORTAL went live, we no longer received volume= - 999. -999 are no longer an issue. -1 From July 14, 2004 to November 17, 2004, we were receiving records with volume=-1. In all cases, when volume=-1, the SWARM algorithm variable status=2, which is the code for "OK." As I understand it, this is supposed to represent a communication error. From November 18, 2004 to February 13, 2005, we did not receive any records where volume = -1. This is due to "Transport ATMS build 2.1, distributed on Nov 17, included many internal changes to traffic sensor and detector station data, (including station level data smoothing). Among other changes it started sending a 0 detector volume during Communications Failure with the Wapiti controllers, as noted by Robert Bertini." -David Gaarsoe During this time period, it is impossible to distinguish between communication errors and periods where no cars pass. Beginning on February 14, 2005, it appears that Transport ATMS build 2.3 was released. "Transport ATMS build 2.3 will... once again report a -1 vol, -1 occupancy, and -1 speed, Inhibited status (0) for Traffic Detectors with no controller communications, and a 0 vol, 0 occupancy, 0 speed, Inhibited status for detectors of reporting controllers with no traffic data (no activity)." -David Gaarsoe It appears that, beginning on February 14, 2005 we are now able to accurately distinguish between communication errors and periods where no vehicles pass a detector. http://wiki.its.pdx.edu/wiki/portal/moin.cgi/ODOT_-1_Email

38 Data Quality Flags - Implementation Queries or scripts analyze the data and set flags when criteria are met We will be able to tell which data values failed which criteria

39 Aggregation over Lanes/Corridors Volume  Over lanes: Sum  (average volume over detector stations doesn’t seem useful) Occupancy  Over lanes: Weighted average of by-lane occupancies by volumes  (average occupancy over detector stations doesn’t seem useful) Speed  Over lanes: Weighted average of by-lane speeds by volumes  Over detector stations: Weighted average of speeds by volumes (?)

40 Aggregation over Space, Cont’d Vehicle Miles Traveled  Sum over lanes or loop detector stations Vehicle Hours Traveled  Sum over lanes or loop detector stations Travel Time  Over lanes: Weighted average of travel time by volumes  Over loop detector stations: Sum of segment-wide travel times Delay  Over lanes: Weighted average of travel time by volumes  Over loop detector stations: Sum of segment-wide travel times


Download ppt "PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte."

Similar presentations


Ads by Google