Abstract
Video prediction which maps a sequence of past video frames into realistic future video frames is a challenging task because it is difficult to generate realistic frames and model the coherent relationship between consecutive video frames. In this paper, we propose a hierarchical sequence-to-sequence prediction approach to address this challenge. We present an end-to-end trainable architecture in which the frame generator automatically encodes input frames into different levels of latent Convolutional Neural Network (CNN) features, and then recursively generates future frames conditioned on the estimated hierarchical CNN features and previous prediction. Our design is intended to automatically learn hierarchical representations of video and their temporal dynamics. Convolutional Long Short-Term Memory (ConvLSTM) is used in combination with skip connections so as to separately capture the sequential structures of multiple levels of hierarchy of features. We adopt Scheduled Sampling for training our recurrent network in order to facilitate convergence and to produce high-quality sequence predictions. We evaluate our method on the Bouncing Balls, Moving MNIST, and KTH human action dataset, and report favorable results as compared to existing methods.
| Original language | English |
|---|---|
| Article number | 8288 |
| Pages (from-to) | 1-14 |
| Number of pages | 14 |
| Journal | Applied Sciences (Switzerland) |
| Volume | 10 |
| Issue number | 22 |
| DOIs | |
| Publication status | Published - 2020 Nov 2 |
Bibliographical note
Publisher Copyright:© 2020 by the authors. Licensee MDPI, Basel, Switzerland.
Keywords
- Convolutional neural network
- Hierarchical features
- Long short-term memory
- Recurrent neural network
- Video prediction
ASJC Scopus subject areas
- General Materials Science
- Instrumentation
- General Engineering
- Process Chemistry and Technology
- Computer Science Applications
- Fluid Flow and Transfer Processes