● Open Access

Figures from this paperAbstractFigures from this paperAbstractFigures from this paperAbstractFigures from this paper

International Journal of Applied Mathematics in Control Engineering

Vol. , No.
Year 
Pages 
Published 
DOI 

Abstract

Spatio-Temporal action localization is a task of classifying the corresponding action category in a sequence frame and locating its position in each frame. You only watch once (YOWO) is an excellent one-stage algorithm for action classification and location based on 2D CNN and 3D CNN, which has fast inference speed. However, its identification precision needs to be further improved in some practical applications. In order to improve the identification precision. We introduce the multi-scale idea of the feature pyramid. In terms of sampling method, we propose a method that randomly select the key frame to obtain local features in consecutive frames, and jointly predict the each frame with global features extracted from the 3D network. The proposed algorithm’s Frame-mAP can achieve 81.6% and 87.0% on jhmdb-21 and UCF101-24 respectively, which has an impressive improvement of 7.2% and 6.6% compared with YOWO algorithm. Besides, the inference speed can reach 15 fps only on the GPU of GTX1660Ti. Compared with other state-of-the-art architectures, our method also achieves competitive performance.

Keywords:YOWOSpatio-temporal action localizationOne-stageFeature pyramid

Figures from this paper

Figure 1
Fig. 1
Figure 2
Fig. 2
Figure 3
Fig. 3