Abstract
KU-ISPL system for TRECVID 2017 Video to Text (VTT) is presented in this paper. The main method of the system is a stacked LSTM model for sentence generation. Input descriptors of the system consist of various deep learning-based features and multi-object detection results to obtain diversity of characteristics and key information from videos. We choose mid-level features of VGGnet and SoundNet as major features to capture multimodality about image and acoustic. Additionally, the visual attribution about objects and places is used for high-level feature. Finally, visual syntax detection is fine-tuned by sigmoid loss function for finding key words. We make 4 runs for the stacked LSTM model by combining various types of features to see how the information impacts the performance of sentence generation. Word2Vec is adopted for effective encoding of sentences. The embedded words by Word2Vec are used at state value and target of the LSTM. On the other side, the sentence matching method is based on the fusion score of Meteor, Bleu and the detection. The output of detection represents the probability that a word exists. Because the TRECVID VTT task is open domain, the sentence generation and sentence matching system is trained by various database such as MSVD, MPII-MD, MVAD, MSR-VTT, and TRECVID-VTT 2016.
Original language | English |
---|---|
Publication status | Published - 2017 |
Event | 2017 TREC Video Retrieval Evaluation, TRECVID 2017 - Gaithersburg, United States Duration: 2017 Nov 13 → 2017 Nov 15 |
Conference
Conference | 2017 TREC Video Retrieval Evaluation, TRECVID 2017 |
---|---|
Country/Territory | United States |
City | Gaithersburg |
Period | 17/11/13 → 17/11/15 |
ASJC Scopus subject areas
- Electrical and Electronic Engineering
- Information Systems
- Signal Processing