Start Date

5-12-2025 12:00 PM

End Date

5-12-2025 1:00 PM

Description

As generative visuomotor models become increasingly capable of representing large-scale robotic skills, they are beginning to transition from controlled laboratory environments into shared human-robot workspaces. While these models enable competent manipulation, their opaque, black-box decision-making processes introduce coordination challenges that can negatively impact safety, efficiency, and trust during collaboration. In safety-critical tasks such as medication dispensing, successful teamwork depends not only on physical capability but also on timing, predictability, and mutual awareness.

This project investigates how two state-of-the-art generative visuomotor policies, an Action-Chunking Transformer (ACT) and a Vision-Action Diffusion policy, perform in a real-world collaborative medication-dispensing task. We compare these policies against a human-human baseline to systematically analyze coordination failures. Across 288 trials involving 32 adult participants in a counterbalanced within-subjects design, we developed and validated a taxonomy of eight collaborative failure modes, including Passive Wait, Redundant Retrieval, Missed Grab, Safety Conflict, Safety Avoidance, Slippage, Capability Miscalibration, and Task Model Uncertainty. Four independent coders achieved substantial agreement (Cohen's κ = 0.80), ensuring reliability of the failure classification process.

Our findings reveal that the dominant failure modes were timing-related rather than purely mechanical. Passive Wait and Redundant Retrieval together accounted for nearly 60% of all observed failures, indicating that the primary limitation of current visuomotor models lies in synchronization and intent communication rather than manipulation accuracy. ACT demonstrated more confident but occasionally premature grasping behavior, resulting in higher Missed Grab events. Diffusion exhibited smoother but more hesitant motions, contributing to increased Passive Wait events. Although safety-critical failures were less frequent, instances of Safety Conflict and Safety Avoidance highlight the risks associated with insufficient transparency in robot intent signaling.

To support safer and more predictable collaborative robotics, we release both an annotated failure taxonomy and an interactive Bayesian diagnostic knowledge network that models relationships between failure types and task outcomes. This work contributes to human-centered robot design by identifying how opacity-driven failures degrade performance and by providing actionable insights for improving intent communication in next-generation generative robot controllers. Ultimately, improving transparency and predictability in visuomotor policies is essential for deploying collaborative robots in real-world, safety-critical environments.

Share

COinS
 
Dec 5th, 12:00 PM Dec 5th, 1:00 PM

Analyzing Human–Robot Collaboration Failures in State-of-Art Generative Visuomotor Models

As generative visuomotor models become increasingly capable of representing large-scale robotic skills, they are beginning to transition from controlled laboratory environments into shared human-robot workspaces. While these models enable competent manipulation, their opaque, black-box decision-making processes introduce coordination challenges that can negatively impact safety, efficiency, and trust during collaboration. In safety-critical tasks such as medication dispensing, successful teamwork depends not only on physical capability but also on timing, predictability, and mutual awareness.

This project investigates how two state-of-the-art generative visuomotor policies, an Action-Chunking Transformer (ACT) and a Vision-Action Diffusion policy, perform in a real-world collaborative medication-dispensing task. We compare these policies against a human-human baseline to systematically analyze coordination failures. Across 288 trials involving 32 adult participants in a counterbalanced within-subjects design, we developed and validated a taxonomy of eight collaborative failure modes, including Passive Wait, Redundant Retrieval, Missed Grab, Safety Conflict, Safety Avoidance, Slippage, Capability Miscalibration, and Task Model Uncertainty. Four independent coders achieved substantial agreement (Cohen's κ = 0.80), ensuring reliability of the failure classification process.

Our findings reveal that the dominant failure modes were timing-related rather than purely mechanical. Passive Wait and Redundant Retrieval together accounted for nearly 60% of all observed failures, indicating that the primary limitation of current visuomotor models lies in synchronization and intent communication rather than manipulation accuracy. ACT demonstrated more confident but occasionally premature grasping behavior, resulting in higher Missed Grab events. Diffusion exhibited smoother but more hesitant motions, contributing to increased Passive Wait events. Although safety-critical failures were less frequent, instances of Safety Conflict and Safety Avoidance highlight the risks associated with insufficient transparency in robot intent signaling.

To support safer and more predictable collaborative robotics, we release both an annotated failure taxonomy and an interactive Bayesian diagnostic knowledge network that models relationships between failure types and task outcomes. This work contributes to human-centered robot design by identifying how opacity-driven failures degrade performance and by providing actionable insights for improving intent communication in next-generation generative robot controllers. Ultimately, improving transparency and predictability in visuomotor policies is essential for deploying collaborative robots in real-world, safety-critical environments.