An overview of people/human tracking methodologies

Street View


People/human tracking is a relatively new advancement gaining much traction in real-time tracking applications like surveillance, human-robot interaction, driving assistance, remote surgery, traffic control, and many more.
The fundamental goal of this advancement is to transfer human visual processing ability to computers. Several types of researches have been carried out to open up a new horizon in artificial intelligence by using 2D and 3D visual features of the scene.
This blog post addresses general tracking techniques and techniques suitable for real-time People/human tracking.

Written for:

Engineers and project managers who are in the process of determining the best People Detection system for their specific product/use case.


Tracking real-time and the edge is the need of the hour, as most surveillance and analytics cameras need to process the data in the edge rather than streaming it to the cloud for processing. On a top-level, a tracking problem can be classified into two: Single Object Tracking and Multi-Object Tracking.
Single object tracking is primarily based on appearance and motion models. It works with object position in 1st frame as input it tracks the object in subsequent frames. Single object tracking doesn’t need consecutive detections to assist the Tracker. It only uses the first frame detections as its input. In contrast, multi-object tracking can work either with detection input or without detection input.

With Detection:

It functions as a detection-based tracking system. Depending on the detection, the number of objects to track varies. Tracking performance is often affected by detection accuracy to some degree.

Without Detection:

It is more of an extension of Single Object Tracking to multiple objects. A number of objects to track is fixed. The input of multiple objects bounding boxes for the first frame is required.
Now, We’ll be talking more about detection-based Multi-Object Tracker in this Blog as Person tracking in real-time falls under it.
Multi-object Tracker is also further classified based on the mode of processing,
  • Online tracking,
  • Offline Tracking.
Online tracking relies on present and past data to perform tracking, while offline tracking relies on future frames’ data to perform tracking. We’ll talk about online monitoring in this blog because we need to track the objects in real-time.
This blog is intended for high-speed real-time tracking of objects dependent on detection; it essentially associates detection with trackers.

Tracking Approaches

There are 4 significant models generally used in high-speed multi-object tracking use cases,
  1. Spatial Model
  2. Appearance Model
  3. Motion Model
  4. Interaction or Social Model
A general tracking algorithm uses at least 2 of the above models, and we can use all 4 models for better accuracy and based on computing capability. Though there are several approaches under each model, we discuss some of each model’s basic techniques.


Before getting into details of each model it is good to learn some basics; generally, tracking algorithms works based on Tracklets/Tracker list. The tracker list consists of details about the trackers currently active, and the position of the object in each frame is maintained in the Tracker list. For the first frame, Tracker will be initialized for all the detection boxes present, and on consecutive frames, each detection in that frame is mapped to the existing Tracker in the Tracker list, or a new Tracker is created based on our tracking algorithm. The term “Tracker” represents tracker items present in the Tracker list.

Spatial Model

– IOU matching: IoU score is computed for the current detection with all the trackers. The best possible matching object is selected based on the highest IoU score.

– Distance thresholding: Distance computation is done between the current detection box center and other trackers objects, object with the least distance is considered the best match.

– Object size: The object’s size can also be considered for filtering multiple objects within the threshold limit.

Top view Opposite directions

Figure 1: In the above figure, Frames 1 and 2 have two persons (P1 & P2) in the top view and moving in opposite directions. For the spatial model, distance thresholding P1 will be tracked as the person moves fast and the IoU overlap is zero. For the other person P2, you’ll get IoU overlap and be tracked based on both IoU and distance threshold.

Appearance Model

Appearance model tracking is based on the object’s visual appearance, I.e., color, and structure-based tracking, built during the training phase.
human backside detection
human face detection

Figure 2: Here, the same person is detected in different frames; the appearance model works mainly based on the attire color. The limitation of the appearance model is that it’ll also consider the background color as well. If the detection bounding box is big, then the background might be influencing the decision, which may result in the wrong association of the person.

Some of the basic techniques which are used in real-time tracking are,

1. Correlation: Comparison of color data of the detection box with the trackers past set of data.

2. Histogram of colors: Comparison of color histogram data with trackers past set of data.

3. HOG (Histogram of oriented gradients): Histogram of oriented gradients can be used for shape comparison. However, performing this technique in real-time with multiple objects is computationally intensive.

Though the first 2 approaches mentioned above are not computationally heavy, comparing tracker 1 with n detections is not optimal, so it is always good to get a possible subset from the spatial model to perform appearance comparison. This way, you can minimize the time spent on the appearance model, but in this case, the accuracy of the appearance model depends on the spatial model’s accuracy.
In some use cases, the appearance model may not help due to appearance invariance or less variance across objects. For example, as shown in Figure 1, we target a person’s head from the top-view, which might be similar in appearance for different persons so, while choosing the appearance model, we need to be careful such that the object should be differentiable at least for the majority of the cases if not all.

Motion Model

Motion model estimates the object’s future position based on the previous set of position information.

1. Moving average to predict the next position: Compute the average x and y velocity for a period and predict the next position based on the average velocity. We can also introduce a weighted average by giving more weightage to the recent data rather than old data. Especially tracking people requires a weighted approach as the movement (velocity and acceleration) can vary frequently.

2. Kalman filter to predict next position: Estimate velocity and acceleration based on historical data. Kalman filter is good at removing noisy data and give an optimal estimate. More specifically, it filters out Gaussian noise, but the noise in a real-world environment is unpredictable. This becomes the limitation of the Kalman approach.

These models are lightweight such that we can run them in real-time without any change in the throughput.

Interaction or Social Model

Interaction or social model deals with the influence of one object on the other object. In this sense, a person walking in the street would adjust the speed or direction based on other persons or objects to avoid a collision. The interaction model considers these factors to improve the tracking. This model is helpful in a crowded scenario as it deals with multiple object’s influences. This model is especially helpful in tracking objects with temporal occlusion. The object that is causing the occlusion also needs to be tracked to use the social model, so if you are tracking people and occlusion due to a vehicle or some different object that is not tracked, then the social model is not a helpful case.


We have discussed some basic tracking techniques that are widely used for object tracking. Tracking objects in real-time is complex as it requires a high-speed algorithm that limits the number of options. Tracking objects like people is even more strenuous as the movement is unpredictable/random.

Is it difficult to integrate people tracking, detection, and recognition features into your products? Contact us for more information.

VisAI Platform – Human/People tracking and detection Solution

VisAI Labs understands the fundamental necessity of edge-optimized human/people tracking and detection algorithms that satisfy a variety of use-cases and are pre-tested in different testing scenarios so that you can focus on the development of your product/solution.

VisAI Labs Human/People tracking and Detection Algorithms are built for People Counting, Congestion Detection, Intrusion Detection, Social Distance Checker, Dwell Time Detection, and Building Tracking.

Feel free to reach us at

Share this post
Need help to accelerate your Computer Vision journeys?

Thank you