Dynamic Bipedal Loco-manipulation using Oracle Guided Multi-mode Policies with Mode-transition Preference

Dynamic Robotics and Control Laboratory, University of Southern California
DRCL logo

One Oracle, Same Reward Weights, Different Robots!

Abstract

Loco-manipulation calls for effective whole-body control and contact-rich interactions with the object and the environment. Existing learning-based control frameworks rely on task-specific engineered rewards, training a set of low-level skill policies and explicitly switching between them with a high-level policy or FSM, leading to quasi-static and fragile transitions between skills. In contrast, for solving highly dynamic tasks such as soccer, the robot should run towards the ball, decelerating into an optimal approach configuration to seamlessly switch to dribbling and eventually score a goal - a continuum of smooth motion. To this end, we propose to learn a single Oracle Guided Multi-mode Policy (OGMP) for mastering all the required modes and transition maneuvers to solve uni-object bipedal loco-manipulation tasks. Specifically, we design a multi-mode oracle as a closed loop state-reference generator, viewing it as a hybrid automaton with continuous reference generating dynamics and discrete mode jumps. Given such an oracle, we then train an OGMP through bounded exploration around the generated reference. Furthermore, to enforce the policy to learn the desired sequence of mode transitions, we present a novel task-agnostic mode-switching preference reward that enhances performance. The proposed approach results in successful dynamic loco-manipulation in omnidirectional soccer and box-moving tasks with a 16-DoF bipedal robot HECTOR.

Approach

Overview of the framework

Overview of the framework

Oracle Guided Policy Optimization

Oracle guided policy optimization

For a task \( \mathcal{T}\), given an environment feedback \( \lambda_t \), the oracle generates a finite-horizon reference trajectory from a queried state \( x_{t} \), that is within the \( \epsilon \)-neighbourhood of the optimal trajectory \( x^*_{t} \):

\[ X^{\Xi}_t:= x^{\Xi}[t,\,t+t_H] = \Xi(\lambda_t) \\ \textrm{s.t.} \quad \|x^{\Xi}_t - x^*_{t}\|_W < \epsilon \quad \forall t \in [0, \infty) \]

Given task objective \( J_{\mathcal{T}} \), an Oracle Guided Policy Optimization is perfomed as:

\[ \pi^* := \arg \max_{\pi \in \Pi} J_{\mathcal{T}} \\ \textrm{s.t.} \quad \|x^{\pi}_t - x^{\Xi}_t\|_W < \rho \quad \forall t \in [0, \infty) \]

where \( \rho \) is the maximum deviation from the oracle trajectory \( x^{\Xi}_t \) and \(W\) is a user-defined weight matrix.

By restricting exploration within this bounded region, we escape local optima and guide the policy toward the global optimum. As visualized in the video, the reference (green box) is a linear interpolation from the current state of the COM.

Designing Oracles - A Hybrid Automata Perspective

Multi-mode Oracle Design Soccer Oracle

The oracle generates continuous references in the state space \(\mathcal{X}\) and jumps between discrete modes \(m\in\mathcal{M}\) through switches in \(\mathcal{S}\), a set of permissible transitions. To design an oracle for a task \(\mathcal{T}\), the user can choose relevant \(\mathcal{M}\), \(\mathcal{X}\), the reference generating dynamics \(f\), and \(\mathcal{S}\), as shown in the hybrid automata (image top-left). These design choices can be coarse, merely guiding optimization as an ansatz rather than offering a high-fidelity solution for the task. For instance, in uni-object loco-manipulation we can define an oracle with three modes: reach, manipulate and detach. Using the same oracle, we synthesize a single multi-mode policy per task: soccer (image top-right) and moving-box (bottom video).

Mode-Switching Preference Reward

The policies can converge to undesirable mode transitions by exploiting the coupled environment-oracle dynamics. For instance, switching from manipulating to reaching in soccer is non-ideal (adjacent video top), as we intend to learn reactive dribbling behaviors. To this end, we define mode ranks and a task-agnostic preference reward that penalizes transitions against the user-preferred direction (adjacent video bottom) \[ r_{\text{pref}} = \begin{cases} -1 & \text{rank}(m_t) < \underset{0\leq \tau \leq t}{\max} \text{ rank}( m_\tau)\\ 0 & \text{otherwise} \end{cases} \]

Related Links

For the theory on oracle guided policy optimization, refer to the paper OGMP: Oracle Guided Multi-mode Policies for Agile and Versatile Robot Control.

BibTeX

@misc{ravichandar2024dynamicbipedallocomanipulationusing,
      title={Dynamic Bipedal Loco-manipulation using Oracle Guided Multi-mode Policies with Mode-transition Preference}, 
      author={Prashanth Ravichandar and Lokesh Krishna and Nikhil Sobanbabu and Quan Nguyen},
      year={2024},
      eprint={2410.01030},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.01030}, 
}