Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking

Brave API

Vision-Language-Action (VLA) models have demonstrated significant progress in robotic manipulation by leveraging large-scale vision-language models (VLMs) to learn generalizable policies from multimodal data . Models such as RT-1, RT-2, OpenVLA, and $$\pi_0$$ exemplify this paradigm, showing capabilities in zero-shot generalization and basic reasoning over novel objects and instructions . However, despite these advances, recent studies highlight that VLA policies often exhibit brittle generalization when deployed in out-of-distribution (OOD) scenarios, struggling with unseen objects, novel environments, and minor environmental changes such as lighting or object placement .

A key limitation identified in current VLA policies is their reliance on spurious correlations—such as object-location co-occurrences—rather than robust, causal grounding between language, perception, and action . This brittleness is particularly evident in multi-object settings, where models may fail to correctly interpret instructions when distractors are introduced or spatial configurations vary. For instance, experiments in cluttered environments with 4–8 objects show that successful grasp execution is tightly constrained by grounding accuracy, as incorrect visual grounding leads directly to task failure .

To address these limitations, recent work explores intermediate representations that provide stronger spatial grounding. Grounding masks, for example, offer fine-grained spatial guidance by highlighting target objects and placement areas, while also conveying shape and size information . The RoboGround framework leverages such masks generated by vision-language models to guide policy networks, demonstrating improved generalization in complex scenes with distractors . Similarly, COPL uses CLIP-based confidence maps to ground language instructions in visual space, enabling zero-shot object-level generalization in reinforcement learning by integrating object-grounded maps into policy inputs and reward functions .

Other approaches investigate mid-level representations—such as object-centric, pose-aware, and depth-aware features—to bridge perception and action with greater precision . These representations are shown to improve performance on dexterous, bimanual tasks by providing actionable geometric and spatial information that high-level language alone cannot convey. A mixture-of-experts architecture (Mid-Level MoE) that integrates multiple such representations achieves up to a $$24\%$$ higher success rate than standard diffusion policies, underscoring the value of structured, spatially-grounded supervision .

Nonetheless, a trade-off exists between sensitivity to these representations and robustness to noise; over-reliance can make policies vulnerable to spurious correlations in the representations themselves . Furthermore, current datasets often lack diversity in instructions and scene complexity, leading to overfitting and poor generalization . Efforts to generate high-complexity simulated data—such as 24K demonstrations with 112K diverse instructions and 3,526 object categories—are aimed at mitigating this issue and rigorously evaluating generalization .

In summary, while VLA models show promise for general-purpose manipulation, their current grounding mechanisms remain fragile, particularly in multi-object settings. Advances in spatially-grounded intermediate representations, combined with more diverse and complex training data, are critical for developing policies that generalize robustly across real-world conditions

Generated Mar 7, 2026

Cerebras Thinking

This research investigates the generalization capabilities of vision-language action (VLA) policies, specifically focusing on the disconnect between high-level task success and low-level grounding fidelity. By employing a multi-object picking benchmark, the authors subject state-of-the-art models to rigorous distribution shifts—such as altering object positions and context—to determine whether policies are genuinely grounding language instructions to visual entities or merely exploiting spurious correlations within the training data. The study isolates the "grounding" component from the "skill" component, revealing that while these models often possess robust motor control for manipulation, their ability to correctly identify and interact with the specific object requested is surprisingly brittle.

The key insight presented is that many VLA policies rely heavily on object-location correlations rather than robust semantic grounding. When evaluated in standard environments, these models may achieve high success rates, giving the illusion of comprehension. However, under the authors' diagnostic protocol—which systematically varies object arrangements to break learned location priors—the performance of these models degrades significantly. This exposes a "Clever Hans" effect where the policy appears to understand language but is actually relying on environmental shortcuts, such as associating a specific object descriptor with a fixed spatial region.

These findings matter significantly for the advancement of embodied AI and reliable robotic deployment. As the field moves toward complex, open-world manipulation, the inability to generalize grounding beyond static training layouts poses a major safety and utility barrier. This work serves as a critical diagnostic tool, highlighting that standard evaluation metrics often mask fundamental failures in perception-language alignment. It underscores the necessity for future research to prioritize counterfactual and out-of-distribution training data that forces models to learn disentangled representations of objects, independent of their typical location or context.

Generated Mar 4, 2026

Open-Weights Reasoning

# Summary: Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking

This paper critically evaluates the generalization capabilities of vision-language action policies in robotic manipulation, focusing on the task of multi-object picking. The authors uncover a significant limitation in current approaches: while these policies demonstrate robust motor control skills, their grounding in language and vision remains brittle, often relying on spurious correlations between object identities and spatial locations rather than truly understanding the task dynamics. Through controlled experiments, they show that models trained on object-location co-occurrences can perform well in-distribution but fail catastrophically when these correlations are disrupted, even for simple variations (e.g., object reordering or novel background contexts). This suggests that many state-of-the-art methods may be overestimating their generalization ability by conflating procedural skill with semantic understanding.

The paper's key contribution is a diagnostic framework for assessing the robustness of vision-language policies, emphasizing the need to disambiguate between procedural and grounded generalization. It introduces a suite of evaluation benchmarks where object-location correlations are systematically broken, forcing models to rely on true language understanding and perceptual grounding. The findings highlight systemic vulnerabilities in current architectures, particularly in how they integrate multimodal inputs during planning. This work is significant because it shifts the conversation from "how well models perform" to "how well they understand," offering a cautionary note for robotics applications where generalization is critical. The insights are particularly relevant for researchers developing embodied AI systems, as they underscore the necessity of evaluating both the skill execution and the underlying representations in vision-language policies.

Generated Mar 4, 2026