PhD Proposal: Full Length Video Understanding with Deep Networks

Talk
Joe Yue-Hei Ng
Time: 
12.03.2015 15:30 to 17:00
Location: 

AVW 4424

Video understanding is one of the fundamental problems in computer vision. Videos provide more information to the image recognition task by adding a temporal component through which motion and other information can be additionally used. Encouraged by the success of deep convolutional neural networks (CNNs) on image classification, we extend the deep convolutional networks to video understanding by modeling both spatial and temporal information. In particular, we are interested in full length video understanding, which requires modeling of both short term motion and long term context.
To effectively utilize deep networks, we need comprehensive understanding on convolutional neural networks. We first study the network on the domain of image retrieval. Previous work has assumed that the last layers would give the best performance, as they do in classification. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks, and adopt VLAD encoding to encode features into a single vector for each image. Our work provides guidance for transferring deep convolutional networks to other tasks.
In the second part of the proposal, we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Experiments show that these networks improve performance significantly by capturing information over long videos.
In the future, we plan to extend these models for activity temporal localization in videos.
Examining Committee:
Committee Chair: - Dr. Larry Davis
Dept's Rep. - Dr. Hal Daume III
Committee Member(s): - Dr. Rama Chellappa