Preferenced Oracle Guided Multi-mode Policies for Dynamic Bipedal Loco-Manipulation

Abstract

Dynamic loco-manipulation calls for effective whole-body control and contact-rich interactions with the object and the environment. Existing learning-based control synthesis relies on training low-level skill policies and explicitly switching with a high-level policy or a hand-designed finite state machine, leading to quasi-static behaviors. In contrast, dynamic tasks such as soccer require the robot to run towards the ball, decelerate to an optimal approach to dribble, and eventually kick a goal—a continuum of smooth motion. To this end, we propose Preferenced Oracle Guided Multi-mode Policies (OGMP) to learn a single policy mastering all the required modes and preferred sequence of transitions to solve uni-object loco-manipulation tasks. We design hybrid automatons as oracles to generate references with continuous dynamics and discrete mode jumps to perform a guided policy optimization through bounded exploration. To enforce learning a desired sequence of mode transitions, we present a task-agnostic preference reward that enhances performance. The proposed approach demonstrates successful loco-manipulation for tasks like soccer and moving boxes omnidirectionally through whole-body control. In soccer, a single policy learns to optimally reach the ball, transition to contact-rich dribbling, and execute successful goal kicks and ball stops. Leveraging the oracle's abstraction, we solve each loco-manipulation task on robots with varying morphologies, including HECTOR V1, Berkeley Humanoid, Unitree G1, and H1, using the same reward definition and weights.

Approach

Overview of the framework

Oracle Guided Policy Optimization

For a task \( \mathcal{T}\), given an environment feedback \( \lambda_t \), the oracle generates a finite-horizon reference trajectory from a queried state \( x_{t} \), that is within the \( \epsilon \)-neighbourhood of the optimal trajectory \( x^*_{t} \):

\[ X^{\Xi}_t:= x^{\Xi}[t,\,t+t_H] = \Xi(\lambda_t) \\ \textrm{s.t.} \quad \|x^{\Xi}_t - x^*_{t}\|_W < \epsilon \quad \forall t \in [0, \infty) \]

Given task objective \( J_{\mathcal{T}} \), an Oracle Guided Policy Optimization is perfomed as:

\[ \pi^* := \arg \max_{\pi \in \Pi} J_{\mathcal{T}} \\ \textrm{s.t.} \quad \|x^{\pi}_t - x^{\Xi}_t\|_W < \rho \quad \forall t \in [0, \infty) \]

where \( \rho \) is the maximum deviation from the oracle trajectory \( x^{\Xi}_t \) and \(W\) is a user-defined weight matrix.

By restricting exploration within this bounded region, we escape local optima and guide the policy toward the global optimum. As visualized in the video, the reference (green box) is a linear interpolation from the current state of the COM.

Designing Oracles - A Hybrid Automata Perspective

As reference generators, oracles have continuous dynamics that evolve the reference states and discrete jumps between the control modes based on environment feedback, leading to the following definition, \[ \Xi := (\mathcal{M}, \mathcal{X}, \Lambda , f , \mathcal{S}) \] The oracle generates continuous references in the state space \(\mathcal{X}\) based on environment feedback \(\Lambda\) and jumps between discrete modes \(m\in\mathcal{M}\) through switches in a set of permissible transitions \(\mathcal{S}\). To design an oracle for a task \(\mathcal{T}\), the user can choose the reference generating dynamics \(f\). These design choices can be coarse, merely guiding optimization as an ansatz rather than offering a high-fidelity solution for the task. For instance, in uni-object loco-manipulation we can define an oracle with three modes: reach, manipulate and detach. Using the same oracle, we synthesize a single multi-mode policy per task: soccer (top image) and moving-box (bottom video).

Mode-Switching Preference Reward

The policies can converge to undesirable mode transitions by exploiting the coupled environment-oracle dynamics. For instance, switching from manipulating to reaching in soccer is non-ideal, as we intend to learn reactive dribbling behaviors. To this end, we define mode ranks and a task-agnostic preference reward that penalizes transitions against the user-preferred direction.

\[ r_{\text{pref}} = \begin{cases} -1 & \text{rank}(m_t) < \underset{0\leq \tau \leq t}{\max} \text{ rank}( m_\tau)\\ 0 & \text{otherwise} \end{cases} \]

BibTeX

@misc{ravichandar2025preferencedoracleguidedmultimode,
      title={Preferenced Oracle Guided Multi-mode Policies for Dynamic Bipedal Loco-Manipulation}, 
      author={Prashanth Ravichandar and Lokesh Krishna and Nikhil Sobanbabu and Quan Nguyen},
      year={2025},
      eprint={2410.01030},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.01030}, 
}