Graduation Year

2025

Document Type

Thesis

Degree

M.S.Cp.

Degree Name

MS in Computer Engineering (M.S.C.P.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Sudeep Sarkar, Ph.D.

Committee Member

Shaun Canavan, Ph.D.

Committee Member

Mauricio Pamplona Segundo, Ph.D.

Keywords

Transformers, Online Machine Learning, Edge Computing, Event Understanding, Representation Learning

Abstract

Event segmentation is the practice of autonomously detecting the boundaries of semantically connected sequences of actions within a video. It is connected to many areas of computer vision, including action recognition and event understanding. Nearly all current work requires the entire video to be stored in memory to be processed in multiple passes. This takes up valuable resources, increases processing time, prevents the use of low-power, low-memory devices, and precludes long-form content and live videos from undergoing event segmentation.

This thesis introduces EDGE-STREAMER, a real-time, lightweight, self-supervised, transformer architecture capable of performing event segmentation in a single pass. EDGE-STREAMER uses techniques from reinforcement learning models to store representative information about the past to predict events, rather than the entire video. It takes advantage of weight sharing to reduce model size, further reducing its memory footprint and its training requirements. During training the model processes unlabeled video data in a single pass, producing event boundaries in real time.

On Epic-Kitchens EDGE-STREAMER achieves the same MoF performance as STREAMER within 2\%. On an NVIDIA Jetson Orin Nano, it sustains over 16 fps, showing that it is capable of real-time performance on an edge device.

The contributions of this thesis are 1. Four low-memory event segmentation architectures. 2. A thorough analysis comparing the accuracies of EDGE-STREAMER to the state-of-the-art. 3. Experimental data showing the viability of EDGE-STREAMER on edge hardware. These results open the door for embedded device-based cameras, such as those used for surveillance, autonomous drones, and wearables, and provide possibilities for future work, including multi-modal streaming and embedded-device continual learning.

Share

COinS