Actor-identified Spatiotemporal Action Detection — Detecting Who Is Doing What in Videos

Fan Yang


The success of deep learning on video Action Recognition (AR) has motivated researchers to progressively promote the AR task from the coarse level to the fine-grained level. Compared with conventional AR that only predicts an action label for the entire video, Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos. Taking TAD a step further, Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos. However, we have noticed, who performs the action is generally ignored in SAD while identifying the actor could also be important. To this end, we propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD), to bridge the gap between existing SAD studies and the new demand of identifying actors. In ASAD, we not only detect the spatiotemporal boundary for each instance-level action but also assign the unique ID to each actor. To approach ASAD, Multiple Object Tracking (MOT) and Action Classification (AC) are two fundamental elements. Based on the motion and appearance similarity, cross-frame detections of an actor are associated with a tracklet by MOT. That is, each tracklet only corresponds to a unique actor so that we can identify the actor. Based on the obtained tracklet, AC can estimate action labels for an actor within the corresponding spatiotemporal boundary. In this thesis, we studied ASAD in three parts. First, we systematically explored the data association strategies in MOT, aiming to boost the MOT performance and reduce the annotation costs. Second, we attempted to make AC (e.g., skeleton-based AC) module smaller, faster, and better. Last, we proposed an efficient method to integrate MOT and AC together for ASAD, and also proposed an evaluation standard for ASAD task. We separately experienced our MOT and AC modules on their related datasets to demonstrate the effectiveness of our proposals. By integrating the MOT module and the AC module into the ASAD framework, we gave a baseline and illustrated how to evaluate the ASAD performance on our proposed evaluation standard.