Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.
MATR consists of four parts: feature extractor, memory-augmented video encoder, instance decoding module, and prediction heads. The current segment of the input streaming video is given as a unit input, and its frame-level features, referred to as segment features, are extracted by a video backbone network and a linear projection layer. The segment features are then fed to the memory-augmented video encoder, which encodes temporal context between frames in the current segment and stores the segment features into the memory. The instance decoding module localizes action instances via two Transformer decoders: the end decoder and the start decoder. Specifically, the end decoder references the encoded segment features to locate the action end around the current time, and then the start decoder refers to the memory queue to find the action start based on the past information stored in the memory queue. Queries for each instance consist of a class query for action classification and a boundary query for action localization. The outputs of the instance decoding module are used as inputs to the prediction heads, which consist of end prediction head, start prediction head, and action classification head. The entire model is trained in an end-to-end manner.
Our method outperforms all previous On-TAL methods in both benchmarks by a substantial margin: 4.9%p of average mAP in THUMOS14 and 0.7%p of average mAP in MUSES.
We conduct an ablation study to verify the effectiveness of each component of our model. The results show that all design choices of the proposed components are important for action instance localization. Adddtionally, as shown in Table 3, employing the memory queue enhances the performance compared to the without memory case, and generally the performance improves as the memory queue size increases.
BibTex Code Will be updated soon