Evaluation of UMD Object Tracking in Video University of Maryland
VACE Phase I Evaluations Multiple teams presented algorithms for various analysis tasks Text Detection and Tracking Face Detection and Tracking People Tracking Evaluation was handled by UMD/LAMP and PSU: Penn State devised metrics and ran evaluations. UMD generated ground truth, and implemented metrics ViPER was adapted for new evaluations
Penn State Developed Metrics Evaluations should provide a comprehensive, multifaceted view of challenges with detection and tracking. Tracking Methodologies Developed Pixel Level Frame Analysis Object Level Aggregation
PSU Frame Evaluations Look at the results for each frame, one at a time. For each frame, apply a set of evaluation metrics, independent of the “identity” of each object (i.e. find the best match) These include: Object count precision and recall. Pixel precision and recall over all objects in frame. Individual object pixel precision and recall measures.
PSU Frame Evaluation
PSU Object Aggregation for Tracking Assume object matching has already been done (first frame correspondence) For the life of the object, aggregate some set of metrics. A set of distances for each frame. Average over life of object, etc.
But… Frame metrics throw away tracking information since there is no frame to frame correspondence Aggregated tracking metrics require a known matching. Does not require unique labeling of objects to track Confusion can occur with multiple objects in the Most participants did multi frame detection, not tracking Even with the known matching, does not handle tracking adequately, to include things like confusion and occlusion. The in both cases, the metrics simply sum over all frames. There is no unified metric across time and space exists
UMD Maximal Optimal Matching Compute score for each possible object match. Find the optimal correspondence. One-to-one Match: For each ground truth object, get the list of result objects that minimize the total cost over all possible correspondences. Multiple Match: For each disjoint subset of ground truth objects, get the disjoint subset of output objects that minimizes the total cost. Compute the overall precision and recall. For S = size of matching: Precision = S / size(candidates) Recall = S / size(targets) The maximum one to one matching is currently found. This uses the Hungarian algorithm, which has a running time of O(n3). For the multiple matching, the algorithm uses some heuristics, including assumptions about monotonicity and the triangle inequality, that make it work reasonably well for most data sets. (Basically, it takes the best 1-1 match and looks for lost ones, adding them one at a time. Tends to end up with a few clumps.) Multiple matches also has a similar formulation for precision and recall, but it has Scandidates and Stargets. The user can specify the parameters for the metrics to be used.
Maximal Optimal Matching Advantages Takes into account both space and time. Can be generalized to make no assumptions about space and time. Optimal 1-1 matching has many nice properties. Can handle many-to-many matching. By pruning data to only compute on sequences that overlap in time, matching can be made tractable.
Object Matching
Object Matching Truth Data .45 .9 .57 .6 Result Data
Experimental Results We reran the tracking experiments using the Add description of data Add description of algorithms used for static and moving camera Show graphs for our stuff vs PSU
Example: Tracking Text: Frame
Example: Tracking Text: Tracking There are three metrics for tracking: size, position and angularity. Since text is given in bboxes, there is no angularity measure for box tracking, and the size metrics failed for this example. There are better examples of these on slide 16.
Example: Tracking Text: Object
Example: Person Tracking: Frame No Tracking metrics could be generated, as Marti did not use the first frame data.
Example: Person Tracking: Object The bottom graph would be more informative with several more lines. It is the distance of all matches sorted by distance. The curve is a good visual way of showing how three or more algorithms perform on the same set of data. However, it throws out the information about what object was matched, so a scatter plot is a better representation for one or two evaluations.
Claims Metrics provide for true tracking evaluation (not just aggregated detection) Tolerances can still be set on various components of the distance measure. Provides a single point of comparison
Fin Dr. David Doermann Dr. Rangachar Kasturi David Mihalcik Ilya Makedon & many others JinHyeong Park Felix Suhenko +more
Tracking Graphs
Object Level Matching Most obvious solution: many-many matching. Allows matching on any data type, at a price.
Pixel-Frame-Box Metrics Look at each frame and ask a specific question about its contents. Number of pixels correctly matched. Number of boxes that have some overlap. Or overlap greater than some threshold. How many boxes overlap a given box? (Fragmentation) Look at all frames and ask a question: Number of frames correctly detected. Proper number of objects counted.
Individual Box Tracking Metrics Mostly useful for the retrieval problem, this solution looks at pairs of ground truth boxes and a result box. Metrics are: Position Size Orientation
Questions: Ignoring Ground Truth Assume the evaluation routine is given a set of objects to ignore (or rules for determining what type of object to ignore). How does this effect the output? For pixel measures, just don’t count pixels on ignored regions. This works for Tracking and Frame evaluations. For object matches, do the complete match; when finished, ignore result data that matches ignored truth. For example, only want to evaluate text that has a chance of being OCRed correctly, while not punishing detection of illegible text.
Questions: Presenting the Results Have some basic built in graphs. Line graphs for individual metrics Bar charts showing several metrics For custom graphs, you have to do it yourself. ROC Curves Scatter Plots