2 Training Convergence & Latency Analysis

This section evaluates the convergence of the primary Reinforcement Learning (RL) Teacher and the Flow-Matched distillation process across all random seeds, as well as the \(O(1)\) inference speedup claims.

2.1 Data Loading

We aggregate the extracted Weights & Biases history files for all training runs.

2.2 RL Teacher Convergence

The Continuous PPO Teacher utilizes privileged terrain and physical parameters. Convergence is expected within 300 epochs.

2.3 Flow-Matched Distillation

The Student Policy is trained purely via Optimal Transport Flow Matching (Rectified Flows). The vector field loss targets the ground-truth Euler integration step.

2.4 Teacher RL Optimization Dynamics

The quality of the Flow-Matched distillation is strictly bounded by the structural optimality of the Teacher’s continuous manifold. The Teacher policy is trained via PPO, utilizing parallelization and privileged physical states. We dissect the internal RL optimization dynamics—Value Loss, Policy Gradient Loss, and Entropy—to mathematically verify manifold convergence prior to distillation.

2.4.1 Interpretation

The total loss curves empirically demonstrate that both the privileged PPO Teacher and the Flow-Matched Student converge to a steady state across all independent seeds.

Crucially, dissecting the PPO optimization reveals the transition from exploration to stable exploitation. The monotonic decay in entropy indicates that the stochastic policy condenses into a deterministic, low-variance action manifold. Concurrently, the stabilization of the v_loss (Value Function error) verifies that the Teacher accurately estimates the infinite-horizon return of the state space. By ensuring the Teacher has constructed a strict, deterministic optimal manifold, we guarantee that the Flow Student’s \(O(1)\) ODE mapping is distilling a coherent topology rather than random noise.