Most existing multiple object tracking (MOT) methods that solely rely on appearance features struggle in tracking highly deformable objects. Other MOT methods that use motion clues to associate identities across frames have difficulty handling egocentric videos effectively or efficiently. In this work, we propose DETracker, a new MOT method that jointly detects and tracks objects, to obtain high quality localization of deformable objects in egocentric videos. DETracker uses three novel modules, namely the motion disentanglement network (MDN), the patch association network (PAN) and the patch memory network (PMN), to explicitly tackle the difficulties caused by severe ego motion and fast morphing target objects. DETracker is end-to-end trainable and achieves near real-time speed. We also present DogThruGlasses, a large-scale deformable multi-object tracking dataset, with 150 videos and 73K annotated frames, collected by smart glasses. DETracker outperforms existing state-of-the-art method on the DogThruGlasses dataset and the widely used YouTube-Hand dataset.
Comparsion with SOTAs.
@inproceedings{huang_etal_cvpr23,
author = {Mingzhen Huang and Xiaoxing Li and Jun Hu and Honghong Peng and Siwei Lyu},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
title = {Tracking Multiple Deformable Objects in Egocentric Videos},
address = {Vancouver, Canada},
year = {2023},
}