CoRL · 2023roboticsreinforcement-learninglegged-locomotionsim-to-real

Robot Parkour Learning

Ziwen Zhuang et al.

TL;DR

Train a single neural policy end-to-end with reinforcement learning that enables a low-cost quadruped to climb high obstacles, leap over gaps, crawl under bars, and squeeze through tilted slits — from depth images, with no motion capture and no heuristic gait library.

arXiv Project Code

Why this paper matters

Quadruped locomotion used to rely on layered stacks: a planner picks footholds, a model-predictive controller tracks the plan, a low-level PID loop tracks the controller. Each layer is brittle when the environment stops looking like the one the designer imagined.

Robot Parkour Learning goes in the other direction. One policy, one reward, end-to-end RL, depth cameras in and torques out. The robot learns to run at obstacles, climb them, leap gaps, and crawl under bars — all without motion-capture, all from onboard sensing. It's one of the cleanest demonstrations that agile whole-body behavior can be produced by a single learned controller if you stage the training carefully.

What they actually did

Hardware. An off-the-shelf low-cost Unitree A1 / Go1, plus a forward-facing depth camera. No custom actuators, no mocap.
Skills. Five: climb high obstacles (up to 0.4m — 1.5× the robot's height), leap gaps (0.6m — 1.5× body length), crawl under low bars, squeeze through tilted slits, and run.
Training pipeline (two stages):
1. RL pre-training with a privileged expert. In simulation, give the policy access to oracle environment info (exact geometry, contact states) and train each skill with task-specific rewards and a curriculum that grows obstacle difficulty over time.
2. Distillation to a vision-based student. Distill the five expert policies into a single policy that only sees depth images and proprioception — the kind of observations the real robot actually has. This is where sim-to-real transfer happens.
Sim-to-real. Heavy randomization over mass, friction, actuator lag, sensor noise, and obstacle geometry. Deploy the distilled policy directly on hardware — no real-world fine-tuning.

Key findings

A single vision-conditioned policy can switch between five qualitatively different skills based purely on what it sees, with no explicit skill selector.
Two-stage RL-then-distill beats end-to-end RL from vision. Training with privileged information first lets the expert actually learn the skill; then distillation compresses that knowledge into a policy that works from noisy depth.
Sim-to-real without real-world data works at this scale of agility. The robot performs parkour-style maneuvers outdoors and indoors on hardware that was never seen during training.
Low-cost hardware is enough. The result doesn't require exotic actuators. The bottleneck was the controller, not the robot.

Caveats worth remembering

Reward engineering is still doing work. Each skill has hand-crafted rewards and curricula. The paper shows RL can produce parkour given a good training setup — it doesn't show RL inventing that setup on its own.
The environment is still structured. Obstacles are boxes, gaps, bars, slits. Truly unstructured terrain (mud, loose rocks, rubble) is not in-distribution.
No long-horizon reasoning. The policy is reactive. It sees the next obstacle and decides how to cross it; it does not plan a route across many obstacles or reason about which path is faster.
Safety. Failures on real hardware are expensive. The paper sidesteps this by training purely in sim — so anyone reproducing the work pays a different cost: good simulation and good randomization.

Where it fits in physical AI

Parkour Learning sits on one end of the modern legged-locomotion spectrum: RL-end-to-end, distilled to vision, sim-to-real. On the other end sit MPC-with-learned-residuals approaches. The honest read is that neither has won — each wins on different axes (sample efficiency, interpretability, agility, safety).

For practitioners, the useful takeaways are the two-stage recipe (privileged expert → vision student) and the curriculum design. Both are now standard patterns across follow-up work on humanoids, manipulation, and multi-skill mobile robots — not just quadrupeds.