PROPEL observes that neural policy representations are amenable to gradient-based learning but are hard to verify or interpret. On the other hand, symbolic/programmatic policy representations are relatively easy to verify and interpret but are more difficult to learn because of the combinatorial nature of program synthesis. Why not then simultaneously maintain two representations, one neural and one symbolic, during learning?
The PROPEL approach formalizes this insight by viewing program learning as constrained mirror ascent, a generalization of gradient ascent to constrained optimization settings. We consider two classes of policies: a highly expressive class H (implemented in practice using a mix of neural networks and symbolic functions) that possesses approximate gradients, and a more constrained, and possibly non-differentiable, class F of “desirable” symbolic/ programmatic policies