How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Brave API

A recent study presents a two-stage learning framework for contact-rich manipulation tasks with implicit and subjective success criteria, such as food preparation, where traditional quantitative evaluation and reward engineering are difficult . Using knife-based peeling as a representative task, the approach first learns an initial policy through force-aware data collection and imitation learning, enabling generalization across variations in object shape, size, stiffness, and texture . This stage addresses the challenge of limited high-quality demonstration data by integrating visual, proprioceptive, and force sensing into a compact representation for policy learning .

In the second stage, the policy is refined using preference-based fine-tuning guided by a learned reward model that combines both quantitative metrics (e.g., peel thickness) and qualitative human feedback on aspects like smoothness, continuity, and efficiency . The reward model is trained offline on human preference annotations, including segment-level thickness scores and trajectory-level holistic quality ratings, allowing the system to align robotic behavior with human notions of task quality without requiring additional expert demonstrations .

Using only 50–200 real-world peeling trajectories, the system achieves over 90% average success rates on produce such as cucumbers, apples, and potatoes, with performance improving by up to 40% after preference-based fine-tuning . Notably, policies trained on one produce category show strong zero-shot generalization to unseen instances within the same category and to out-of-distribution produce from different categories—such as a cucumber-trained policy achieving 50% success on zucchini, apple on pear (90%), and potato on daikon radish (80%)—while maintaining high success rates . This demonstrates the framework’s ability to scale to diverse, real-world manipulation tasks characterized by continuous, subjective quality measures .

Generated Mar 7, 2026

Cerebras Thinking

This research addresses the complex challenge of contact-rich manipulation tasks where success criteria are implicit and subjective, using the specific example of peeling vegetables with a knife. Unlike traditional robotics problems that rely on clear geometric objectives, tasks like food preparation depend on nuanced human preferences regarding efficiency, waste minimization, and aesthetic quality. The authors present a learning framework designed to overcome the "reward engineering" bottleneck, where defining a mathematical reward function for such dexterous, messy interactions is notoriously difficult. Instead of relying on hardcoded heuristics, the approach leverages data-driven methods to infer what constitutes a successful execution directly from human behavior and feedback.

The key contribution of this work is a pipeline that aligns robotic policies with human preferences through a learned reward model. By collecting demonstrations and preference data, the system trains a reward function that captures the fine-grained subtleties of the task—such as the optimal angle of the knife or the thickness of the peel—which are then used to optimize the robot’s control policy via reinforcement learning. This allows the robot to adapt to the physical variability of the object (e.g., the shape of a potato or carrot) and the compliance required during tool use. The authors demonstrate that this preference-based alignment enables the robot to perform the task in a manner that humans rate favorably, compared to baselines that might optimize for speed but produce excessive waste or unsafe motions.

This work is significant because it bridges the gap between rigid industrial automation and the adaptable skills required for domestic service robots. By validating a method to handle subjective, contact-heavy tasks, it provides a scalable blueprint for training robots to perform unstructured chores that lack clear, programmable success metrics. The ability to align robot behavior with human judgment in high-dimensional manipulation spaces moves the field closer to general-purpose robots capable of seamlessly integrating into human-centric environments like kitchens.

Generated Mar 7, 2026

Open-Weights Reasoning

Here’s a concise yet substantive summary of the paper:

Title: How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Summary: This work introduces a novel learning framework for contact-rich manipulation tasks, such as peeling fruits or trimming vegetables, where success is defined by implicit, subjective criteria (e.g., "peeled cleanly" or "even slices"). Traditional robotic manipulation approaches struggle with such tasks due to challenges in defining precise evaluation metrics and reward functions, which often rely on hand-engineered heuristics or sparse, binary signals (e.g., success/failure). The paper addresses this gap by integrating human feedback into the training loop, leveraging techniques like preference modeling to align robotic actions with human preferences for task quality. The proposed method combines offline reinforcement learning (RL) with preference-based optimization, enabling the robot to learn nuanced, fine-grained manipulation policies from demonstrations or subroutine calls, even when explicit reward shaping is infeasible.

Key Contributions & Insights: 1. Alignment with Human Preferences: The framework explicitly models human subjective evaluations (e.g., "well-peeled" vs. "poorly peeled") as a surrogate for reward, bypassing the need for hand-crafted metrics. This is achieved by training a preference model on human rankings of task outcomes, which then guides policy optimization. 2. Robustness to Contact-Rich Tasks: By embedding contact dynamics into the learning process (e.g., via subroutines for slicing or peeling), the approach handles the inherent uncertainty and variability in tasks like food preparation, where tool interactions are critical. 3. Scalability and Generalization: The method demonstrates generalizability across diverse manipulation tasks (e.g., peeling, trimming) and objects (e.g., apples, potatoes), suggesting potential for broader applications in domestic or service robotics.

Why It Matters: This work addresses a critical bottleneck in robotic manipulation: the difficulty of specifying success in tasks where human judgment is inherently subjective. By formalizing human feedback as a reward signal, the paper bridges the gap between robotic autonomy and real-world applicability, particularly in unstructured environments like kitchens. The implications extend beyond food prep to other contact-rich tasks (e.g., crafting, assembly) where traditional RL or imitation learning fall short. The approach also highlights the importance of hybrid learning systems that combine offline data, subroutines, and human-in-the-loop feedback—a promising direction for developing more adaptive and intuitive robotic agents.

Source: [arXiv:2603.03280](https://arxiv.org/abs/2603.03280)

Generated Mar 7, 2026