HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Brave API

There is no information available about a system called "HoMMI" in the provided context. The query appears to combine elements from different research projects. However, related frameworks such as HuMI (Humanoid Manipulation Interface) and EgoMI (Egocentric Manipulation Interface) address similar challenges in imitation learning for whole-body manipulation.

HuMI presents a robot-free demonstration system for humanoid whole-body manipulation, enabling data collection using portable hardware like sensorized grippers and wearable pose trackers without requiring the physical presence of a robot . This framework captures full-body human motion—including hands, feet, and pelvis—and uses inverse kinematics to translate it into feasible robot trajectories, addressing the embodiment gap between human demonstrators and humanoid robots . HuMI demonstrates a 3x increase in data collection efficiency compared to teleoperation and achieves a 70% success rate in unseen environments across tasks like kneeling, squatting, tossing, and bimanual manipulation .

EgoMI, on the other hand, focuses on egocentric sensing by capturing synchronized head, hand, and visual data during human demonstrations, which is then used to train policies for semi-humanoid robots with actuated heads . It emphasizes active vision and spatial memory to handle rapid viewpoint changes, enabling robust imitation learning without relying on visual augmentation or robot-collected data . Policies trained with head motion and egocentric observation outperform those without, showing the importance of perceptual alignment in closing the embodiment gap .

While neither system is named "HoMMI," both represent significant advances in imitation learning for mobile and whole-body manipulation, leveraging egocentric sensing and robot-free demonstrations to improve generalization and scalability .

Generated Mar 7, 2026

Cerebras Thinking

HoMMI introduces a framework for learning whole-body mobile manipulation policies directly from human demonstrations captured via egocentric sensing, eliminating the need for robot-specific teleoperation. The research tackles the significant "human-to-robot embodiment gap," where the kinematics and dynamics of a human differ drastically from those of a mobile manipulator. By utilizing a head-mounted camera and motion capture data, the system records human operators performing tasks such as object retrieval and navigation, providing a rich source of training data that decouples the data collection process from the physical constraints of the robot hardware.

The core technical contribution lies in a novel retargeting and imitation learning pipeline that translates human motion into viable robot commands. This involves mapping the human's base locomotion to the robot's odometry and the human's arm movements to the robot's end-effector trajectories, ensuring coordinated whole-body motion. The approach demonstrates that policies trained on this retargeted data can successfully generalize to real-world environments, enabling the robot to perform complex tasks that require tight coupling between navigation and manipulation, such as opening doors or moving objects while repositioning the base.

This work is significant because it substantially lowers the barrier to collecting high-quality training data for mobile manipulation, a domain that has traditionally suffered from data scarcity due to the difficulty of teleoperating mobile bases and arms simultaneously. By enabling scalable data collection through "robot-free" human demos, HoMMI advances the state of imitation learning, offering a path toward more generalist service robots capable of operating in unstructured human environments.

Generated Mar 7, 2026

Open-Weights Reasoning

# HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

HoMMI introduces a novel framework for learning whole-body mobile manipulation from human demonstrations, bridging the gap between human embodiment and robotics through egocentric sensing. Unlike traditional approaches that rely on robot-specific demonstrations, HoMMI leverages human-performed tasks captured via wearable sensors (e.g., IMUs and cameras), enabling scalable imitation learning without the need for robot teleoperation. The method employs a hierarchical approach: a base policy handles whole-body locomotion, while a manipulation policy focuses on fine-grained arm/hand control, with a shared cross-modal embedder unifying egocentric observations across modalities. By decoupling locomotion and manipulation, HoMMI achieves robust performance on complex tasks such as navigation, object retrieval, and manipulation in cluttered environments.

The key contribution of HoMMI lies in its ability to generalize human demonstrations to diverse mobile manipulation tasks, addressing the scalability challenges in imitation learning for robotics. The framework demonstrates strong zero-shot transfer capabilities across different robot morphologies, including quadrupedal and wheeled platforms, by leveraging a unified state representation derived from human motion data. This work is particularly relevant to the AI research community as it advances the frontier of embodied AI, offering a more efficient and flexible alternative to costly robot-specific data collection. By enabling learning from human demonstrations, HoMMI paves the way for more intuitive and adaptable mobile manipulation systems, with implications for robotics in logistics, search-and-rescue, and assistive applications.

Source: [arXiv:2603.03243](https://arxiv.org/abs/2603.03243)

Generated Mar 7, 2026