Evaluating multiple object tracking accuracy and performance metrics in a real-time setting



Tracking multiple objects in a real-world environment is a hot research topic, and various technologies focused on several features and algorithms have been proposed— these object tracking techniques range from video surveillance to insightful virtual settings.
This blog post aims to compile a list of commonly used object tracking accuracy metrics like MOTA (multiple object tracking accuracy) and MOTP (multiple objects tracking precision). Also, we cover the commonly used tracking metrics to compare different tracking algorithms.

written for:

Engineers and project managers who can understand and assess the best People/human Detection method for a specific application and use-cases.


Tracking Multiple Objects (persons, dogs, cats, etc.) is about who/which are presented across a series of video frames over a while. Let’s say you need to monitor the place which is prohibited from visiting. You own a shop, and you want to know the footfall. You cannot keep an eye on everything all of the time, so you’ll need an intelligent system to keep track of it without losing it in the real world- That’s where tracking algorithms come into play.

Rise of multiple objects tracking in a real-time setting

The basic idea of what happens in the background of this multiple object tracking is, there will be a Detection model (Deep learning or Traditional approach) which finds and classifies all the objects presented in a frame. And there will be an algorithm to do the tracking with the information it got from the detection model. These tracking algorithms will assign unique tracking IDs to every object presented in that frame and try to maintain the same ID in subsequent frames using some sorts of correspondences.
As we know that the AI & Computer Vision community is growing day by day, tracking multiple objects is an active research field with applications in many domains. These range from video surveillance, over automatic indexing, to intelligent interactive environments. As a result of rapid research and advancements, tons of different algorithms are out there for Multiple Object tracking, which performs exceptionally well across varied environments and lighting conditions.
Yes, we got incredible algorithms to implement and build some excellent products! But wait. How do we select the perfect algorithm for our application? How do we measure these algorithms? What is the scale to measure the tracking accuracy or detection accuracy?

MOT Metrics

MOT metrics are metrics used to evaluate the accuracy of tracking algorithms. There are two primary metrics that experts consider while evaluating tracking algorithms,

1. MOTP (Multiple Object Tracking Precision):

It measures the accuracy of localization of detection boxes. It’s much similar to the mAP metrics.

Ground Truth:
Ground Truth
Model Output:
In the above two pictures, the detection box coordinates are not the same. There you can see slight changes in coordinates, right? To find how well the model localizes the object concerning ground truth, the MOTP is used. It deals with only detection output from the model and not doing anything with tracker output.

2. MOTA (Multiple Object Tracking Accuracy):

It measures the overall accuracy of both the tracker and detection. It deals with both tracker output and detection output.


Before getting into details about these two metrics, let’s see how these are practically implemented. As per CLEAR MOT 2008, 3 errors should be taken into consideration for tracker evaluation.
  1. Miss Detection: Object present in ground truth but not got detected by the detection algorithm.
  2. False-positive: Object not present in the ground truth, but the detection algorithm detects it as objects (False detections).
  3. Mismatch error: Object in ground truth is falsely related to some other object due to false tracking.
If you can see in the image above, the objects are mismatched with one another.
These errors would be calculated for every single frame (t) in the series of frames. The cumulated sum of the error value is taken to calculate the final MOT value.


To compute the MOT values, we need to have a standard approach to find the correspondence between Ground Truth (GT) and Prediction.
There are two commonly used distances out there,

1. Intersection over Union (IOU)

Intersection over Union (IOU) is a metric to evaluate the object detector’s accuracy. It is calculated between two bounding boxes’ (Ground truth & Predicted bounding boxes) overlap area with union area.

Instance segmentation

2. Squared Euclidean distance:

Euclidean pixel distance is the line segment’s length between the two points in pixel coordinates. The distance between the center point of GT and the Prediction box will be calculated.

The Squared Euclidean pixel distance between two points (x1, y1) and (x2, y2) is,

d = (x1 – x2)2 + (y1 – y2)2

Note: The most preferred method for calculating correspondence is the IOU method.

1. The correspondence will be made between Ground truth and predicted objects to ensure that those are correct matches.

2. If the Ground Truth object has NO MATCH in detection output, then the count of MISS(FNt) will be incremented by 1(+1).

3. If the object in detection output has NO MATCH in the real world (Ground Truth), it will be considered false detection. In that case, the count of FALSE POSITIVES(FPt) will be incremented by 1(+1).

Consider in Frame 1, and the correspondence is made between Ground Truth person(P1) with detection model output person (PD3) the pair will be saved as (1, 3). In a case in the next frame ground truth person (P1) is paired with a different person who is in detection output (PD4), then the MISMATCH ERROR(IDSt) will be incremented by 1(+1). The pair (1,3) will be deleted, and the pair (1,4) will be considered in future frames.

4. As mentioned earlier, these errors will be calculated for a series of frames. In the end, we calculate MOTA with the following formula.
4. As mentioned earlier, these errors will be calculated for a series of frames. In the end, we calculate MOTA with the following formula.
FN – False negative / MISS IDS – ID switch / MISMATCH ERRORS FP – FALSE POSITIVES GT – Ground truth object count
5. MOTP will be calculated with the following formula

dt – Distance between the localization of objects in the ground truth and the detection output

ct – total matches made between ground truth and the detection output

MOTA Calculation for a Sample Frame

To calculate these two metrics, we need two data. The Ground truth file consists of the actual object positions with the ID of objects in a series of frames and the other with Detection and tracker output information over the series of frames.
Ground Truth Detection
Predicted Dtection
From the above pictures, we can calculate Miss, False positives & Mismatch Error as follows,

MOTA = 1 – ((1 + 1 + 1) / 4) = 0.25

NOTE: This is illustrated for understanding, but in real-time, Misses, False positives & Mismatch Errors will be calculated for every frame of the video, and finally, MOTA will be calculated

MOTP Calculation for a Sample Frame

Let’s consider the IOU method to find the correspondence.
In this frame, the calculated IOU between Ground truth and the Predicted bounding box is 0.78.

Distance, dt = 1 – IOU = 0.22
Total match, ct = 1
MOTP = dt/ct = 0.22 / 1 = 0.22.

NOTE: This is illustrated for understanding, but in real-time, Distances & total matches will be calculated for every frame of the video, and finally, MOTP will be calculated

What these values signify?

1. MOTP ranges from 0 – 1 (multiply with 100 if you need the MOTP in percentage)
2. If the MOTP value is 1, then the precision of the system is poor. And if it is close to zero, then the precision of the system is good.
3. For example: let’s say, IOU of the predicted and ground truth detection box is 1. Then, the distance (1 – IOU) is 0. therefore, the MOTP value is zero, which is a good detection system. Similarly, if IOU is 0, then MOTP would become 1.
4. MOTA ranges from –inf to 1. (multiply with 100 if you need the MOTA in percentage)
5. If MOTA is 1, then the Accuracy of the system is Good. If the MOTA is around zero or less than zero, then the system’s accuracy is poor.
6. For example, if there are no errors in the system, then the system’s accuracy (MOTA = 1 – total error ratio) is 1 – 0 = 1.

Other Metrics

MOTA gives an overall tracking metric, but we don’t get more delicate details of the tracking accuracy. There are other tracking parameters available to get finer details of the tracking accuracy. Some of the metrics are listed here with a brief description.

1. Mostly Tracked – Accuracy based on several objects tracked for at least 80 percent of lifespan.

2. Partially Tracked – Accuracy based on several objects tracked between 20 and 80 percent of lifespan.

3. Mostly Lost – Accuracy based on several objects tracked less than 20 percent of lifespan.


Today, the definition of object tracking is considered as a mature field of study, but there is a concerning lack of consistency about how the group communicates its findings. We can increase target detection accuracy by practicing on datasets based on different use-cases in the coming days.

Enable vTrack For Human/People and Object Detection.

VisAI Labs has understood the need for decisive edge optimized Human/people detection algorithms that can satisfy a wide range of use cases and comes pre-tested across various testing scenarios so that you could concentrate on building your product/solution.
VisAI Labs Human/People detection algorithms are built for People counting, Congestion detection, Dwell Time detection, Intrusion detection, Social distancing checker, and Tracking across buildings.

To know more, Feel free to reach us at sales@visailabs.com

Share this post
Need help to accelerate your Computer Vision journeys?

Thank you