Start Date
8-5-2025 9:15 AM
End Date
8-5-2025 10:30 AM
Document Type
Full Paper
Keywords
Image Segmentation, Scene Understanding, Computer Vision, Text-to-Image Model
Description
Image segmentation is essential for navigation and scene understanding in autonomous systems, particularly in un- structured outdoor environments. This study investigates the seg- mentation capabilities of DALL-E 3, a generative text-to-image model, that is not explicitly trained for semantic segmentation. A custom segmentation pipeline was developed to evaluate and refine DALL-E 3 outputs on outdoor images from the RELLIS- 3D dataset. The post-processing workflow includes morphological operations with varied structure elements to enhance segmenta- tion accuracy. Segmentation accuracy was assessed using mean Intersection over Union (mIoU) across selected terrain classes. Results show that the raw DALL-E 3 outputs were improved after developed post-processing refinement, and resulting accuracy values are competitive with supervised models, HRNet+OCR and GSCNN. These results demonstrate that text-to-image models, when paired with domain-aware post-processing, offer a promis- ing alternative for flexible, rapid-deployment segmentation for universal robotics without requiring labeled training data. These efforts contribute to our research team’s broader goal of enabling intelligent mobile robots capable of autonomous perception and decision-making in complex environments.
DOI
https://doi.org/10.5038/AJIC8315
Text-to-Image Model-based Image Segmentation for Scene Understanding in Autonomous Robot Navigation
Image segmentation is essential for navigation and scene understanding in autonomous systems, particularly in un- structured outdoor environments. This study investigates the seg- mentation capabilities of DALL-E 3, a generative text-to-image model, that is not explicitly trained for semantic segmentation. A custom segmentation pipeline was developed to evaluate and refine DALL-E 3 outputs on outdoor images from the RELLIS- 3D dataset. The post-processing workflow includes morphological operations with varied structure elements to enhance segmenta- tion accuracy. Segmentation accuracy was assessed using mean Intersection over Union (mIoU) across selected terrain classes. Results show that the raw DALL-E 3 outputs were improved after developed post-processing refinement, and resulting accuracy values are competitive with supervised models, HRNet+OCR and GSCNN. These results demonstrate that text-to-image models, when paired with domain-aware post-processing, offer a promis- ing alternative for flexible, rapid-deployment segmentation for universal robotics without requiring labeled training data. These efforts contribute to our research team’s broader goal of enabling intelligent mobile robots capable of autonomous perception and decision-making in complex environments.