Abstract
KU-ISPL model for TRECVID 2018 Video-to-Text (VTT) is presented in this paper. A stack of two LSTM with attention mechanism is structured in the VTT architecture. We employ a sequence-to-sequence model to deal with sequential input and output. The encoder in our model encodes video frames into visual representations and the decoder decodes the visual representations into textual words. Attention mechanism is exploited for best use of contextually pertinent frames in input video. The model pays attention to the hidden states of 2nd LSTM in the encoder to obtain efficient hidden states in the decoder. Visual feature, acoustic feature and detection result of videos are extracted from deep learning models and the resulting features are subsequently concatenated into one. It is used for an input descriptor of the model. The stacked LSTM and attention weights are jointly trained and the whole model is an end-to-end trainable network. We proceed by making 4 runs for our model by combining various types of features to explore how the information impacts the performance of sentence generation. The sentence matching method is based on the fusion score of Meteor and Bleu. Because the TRECVID VTT task is open domain, the sentence generation and sentence matching system are trained by various database such as MSVD, MVAD, and MSR-VTT. Experimental results show that the proposed model performs better than the model without attention mechanism.
Original language | English |
---|---|
Publication status | Published - 2020 |
Event | 2018 TREC Video Retrieval Evaluation, TRECVID 2018 - Gaithersburg, United States Duration: 2018 Nov 13 → 2018 Nov 15 |
Conference
Conference | 2018 TREC Video Retrieval Evaluation, TRECVID 2018 |
---|---|
Country/Territory | United States |
City | Gaithersburg |
Period | 18/11/13 → 18/11/15 |
ASJC Scopus subject areas
- Information Systems
- Signal Processing
- Electrical and Electronic Engineering