Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Yu Y. Sun, Ph.D.

Committee Member

Shaun Canavan, Ph.D.

Committee Member

Heather Culbertson, Ph.D.

Committee Member

John Licato, Ph.D.

Committee Member

Kyle Reed, Ph.D.


Calorie Estimation, Cooking Video Understanding, Ingredient Recognition, Knowledge Representation, Video Understanding, Deep Learning


In this dissertation, we discuss our work on analyzing cooking content for the ultimate goal ofautomatic robotic manipulation. For a robot to perform a cooking task, it will need to both have an understanding of the scene and utilize prior knowledge. We will explore two main sub-problems: knowledge extraction and inference, and visual understanding of the scene in this dissertation. Visual understanding of a scene, requires algorithms that can visually infer information from a single image or video. Many algorithms in the area of image classification, object detection, or activity recognition can be used in this area. Although great advances has been achieved by the emergence of deep learning, state-of-the art algorithms in this area have limitations. To attempt to overcome this lack of performance, we propose to use structured knowledge representations combined with state of the art deep learning techniques for visual understanding of cooking videos. Besides objects, and motions, we recognize that states of objects are also very important in interpreting the scene and therefore extensively explore the problem of states in visual cooking content. We introduce the state identification challenge in cooking applications and collect a dataset for research in the area of ingredient state analysis. We further look into the problem of simultaneous knowledge extraction from a single image and extracting information about ingredients, their states, the inter-connection between different objects in the scene and the motion-object interconnections. This problem requires an algorithm that can model the correlation of various concepts in a single image simultaneously. Using deep algorithms that can take as input multiple inputs and generate multiple outputs are fit for this problem. Therefore we propose to incorporate auto-regressive self-attention based mechanisms to extract knowledge from a single image. We show that the knowledge acquired from a single image can be used for calorie estimation. We suggest that total knowledge extraction from a single image can be used in future work for task graph inference.