[ICML submitted] FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, Daesol Cho, and H. Jin Kim

Abstract: Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent MaxEnt-RL approaches incorporate expressive generative policies via weighted supervised learning, they use importance sampling, which is prone to a collapse in effective sample size, limiting their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the effective sampling region, avoiding the weight degeneracy induced by standard importance sampling over the entire action space. To instantiate this insight, we introduce FLAG (Flow policy MaxEnt-RL by Latent Augmented Guidance). FLAG leverages a composite policy consisting of a flow policy and a Gaussian head, optimized via a proxy MaxEnt-RL objective that we prove is consistent with the original MaxEnt-RL formulation. We empirically demonstrate that FLAG facilitates expressive policy learning even with limited importance samples and scales to high-dimensional control tasks. As a result, FLAG achieves state-of-the-art performance across challenging benchmarks.