Mimicking the human ability to understand visual data (images or videos) is a long-standing goal of computer vision. To achieve visual content understanding in a computer, many recent works attempt to connect visual and natural language data including object labels and descriptions. This attempt is important not only for visual understanding but also for broad applications such as content-based visual data retrieval and automatic description generation to help visually impaired people.
In the presentation, we will describe the attempt to develop cross-modal representations, which enable us to associate videos with natural language. We explorer two directions for constructing cross-modal representations: hand-crafted representations and data-driven representation learning. The experiments demonstrate the proposed representations can be applied to a wide range of practical applications including query-focused video summarization and content-based video retrieval with natural language queries.
We first introduce a hand-crafted representation that encodes objects in videos and noun words in sentences. The object-based representation is applied to video summarization based on user text. Next, cross-modal representation learning is explored. We introduce deep models that map videos and sentences to a common feature space. Different from the object-based representation, this approach incorporates various concepts including objects, actions, and attributes. The performance of the learned representation is evaluated on several tasks: unsupervised video summarization, content-based video and sentence retrieval, and video captioning. Lastly, we propose a learning method for sequential representations for videos. The proposed model is designed to capture dynamics of content within a video. We alleviate the lack of training data by synthesizing training examples from existing video-description datasets. We evaluate the sequential video representation to a task of content-based video retrieval. The experimental results demonstrate that our cross-modal representation is useful to find video content relevant to a sentence query.