PhD Defense: Towards Robust Spatial Perception from Visual and Linguistic Information

Talk
Ang Li
Time: 
07.14.2017 11:00 to 13:00
Location: 

AVW 4172

Perceiving the spatial information (e.g., location and relationship) of objects, scenes, or even language descriptions is an important aspect of computer vision, which is inherently involved in most of the applications such as face recognition, image matching, and multimodal image retrieval. The essential problem is to develop an efficient robust visual representation that effectively encodes both visual appearance and spatial information. Many existing approaches suffer from problems in real applications due to (a) their high sensitivity to data noise and uncertainty, and (b) weak generalizability to the large variety of natural language concepts. The thesis proposes to address these issues from the perspectives of developing effective low-level, mid-level and high-level representations.
Low-level features are effective in matching images but their performance can be sensitive to noises especially under spatial transformations. In the first part, we propose to model the uncertainty of low-level features, such as line segments and key-point appearance descriptors, and propose to equip these feature representations with uncertainty modeling so that they could be robustly applied in real scenarios such as image based geolocation and active face authentication in mobile devices.
The mission of visual understanding is to express and describe the image content which is essentially relating images to human language. That typically involves finding a common representation inferable from both domains of data. In the second part, we propose a framework to extract a mid-level spatial representation directly from language descriptions and match such spatial layouts to the detected object bounding boxes for retrieving indoor scene images from user text queries.
Most high-level visual features are learned from supervised datasets, whose scalability is largely limited by the requirement of dedicated human annotation. In the last part, we propose to learn visual representations from large-scale weakly supervised data for a large number of natural language-based concepts, i.e., n-gram phrases. We propose the differentiable Jelinek-Mercer smoothing loss and train a deep convolutional neural network from images with associated user comments.
Examining Committee:
Chair: Dr. Larry S. Davis
Dean's rep: Dr. Rama Chellappa
Members: Dr. Hal Daume III
Dr. Ramani Duraiswami
Dr. Thomas Goldstein