TY - JOUR
T1 - Binary dense sift flow based two stream CNN for human action recognition
AU - Park, Sang Kyoo
AU - Chung, Jun Ho
AU - Kang, Tae Koo
AU - Lim, Myo Taeg
N1 - Funding Information:
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grants No. NRF-2016R1D1A1B01016071 and NRF-2019R1A2C108974211).
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2021/11
Y1 - 2021/11
N2 - Two-stream CNN is a widely-used network for human action recognition. Two-stream CNN consists of a spatial stream and a temporal stream. The spatial stream, through which the RGB image passes, extracts the shape features of human motion. The temporal stream, through which the optical flow images pass, extracts the sequence features of the listed motions. However, because of the constraints of the optical flow, such as brightness, constancy, and piecewise smoothness, there are limitations to the performance of two-stream CNN. One of the efficient methods to solve this problem is to expand the network model to a three-stream network, fuse it with LSTM, and add a modified pooling layer. This method improves the performance of the model but it increases the computational cost. Besides, the limitations of the optical flow are still present. In this paper, without extending the network model, a binary dense SIFT flow-based two-stream CNN is used instead of the optical flow. Unlike the optical flow, binary dense SIFT flow, which is a feature-based matching flow field is robust in brightness, constancy and piecewise smoothness. To evaluate the binary dense SIFT flow-based two-stream CNN, the UCF-101 dataset was selected for human action recognition. Furthermore, to evaluate the robustness of its brightness constancy and piecewise smoothness, a custom dataset was made up of classes that were extracted from UCF-101. Finally, the proposed method was compared with the state-of-the-art, which uses an optical flow-based two-stream CNN.
AB - Two-stream CNN is a widely-used network for human action recognition. Two-stream CNN consists of a spatial stream and a temporal stream. The spatial stream, through which the RGB image passes, extracts the shape features of human motion. The temporal stream, through which the optical flow images pass, extracts the sequence features of the listed motions. However, because of the constraints of the optical flow, such as brightness, constancy, and piecewise smoothness, there are limitations to the performance of two-stream CNN. One of the efficient methods to solve this problem is to expand the network model to a three-stream network, fuse it with LSTM, and add a modified pooling layer. This method improves the performance of the model but it increases the computational cost. Besides, the limitations of the optical flow are still present. In this paper, without extending the network model, a binary dense SIFT flow-based two-stream CNN is used instead of the optical flow. Unlike the optical flow, binary dense SIFT flow, which is a feature-based matching flow field is robust in brightness, constancy and piecewise smoothness. To evaluate the binary dense SIFT flow-based two-stream CNN, the UCF-101 dataset was selected for human action recognition. Furthermore, to evaluate the robustness of its brightness constancy and piecewise smoothness, a custom dataset was made up of classes that were extracted from UCF-101. Finally, the proposed method was compared with the state-of-the-art, which uses an optical flow-based two-stream CNN.
KW - Action recognition
KW - Binary dense SIFT flow
KW - Binary descriptor
KW - Two-Stream CNN
UR - http://www.scopus.com/inward/record.url?scp=85107787734&partnerID=8YFLogxK
U2 - 10.1007/s11042-021-10795-2
DO - 10.1007/s11042-021-10795-2
M3 - Article
AN - SCOPUS:85107787734
SN - 1380-7501
VL - 80
SP - 35697
EP - 35720
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 28-29
ER -