PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Youngsik Yoon1, Sungjae Lee1, Seockbean Song1, Siwei Wang2, Wei Chen2, Jungseul Ok1
1Pohang University of Science and Technology (POSTECH)
2Microsoft Research Asia (MSRA)
ACL 2026
Cost - Pass@1 trade-off

Cost (↓) - Pass@1 (↑) trade-off across diverse sizes. We plot the average Pass@1 across foundational benchmarks (HumanEval, MBPP and their EvalPlus variants) against the relative inference cost. PaT consistently advances the Pareto frontier across model sizes (Qwen34B, 8B, 14B, and 32B).

Abstract

Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.

Proposed Method

MY ALT TEXT

Comparison with existing methods and PaT (Ours). Problems are grouped by difficulty (easy, mid, and hard). Boxes denote the key components: a Generator (creates code), a Planner (decomposes the problem), and an Executor (verifies the solution). (a) Standard directly generate and execute; works on easy problems but often fails on harder ones. (b) FunCoder always plans first, so planning cost is paid even when unnecessary. (c) PaT (Ours) trials first and plans only on failure; solves easy problems cheaply and hard problems adaptively.

Main Results

Main Results Table

Performance and cost comparison on foundational benchmarks. We report Pass@1 across four benchmarks (HumanEval, HumanEval+, MBPP, and MBPP+) and their average ('Avg.'). Cost is normalized relative to the Standard baseline (1.00).

Heterogeneous Model Configuration

Trade-off curve for heterogeneous configurations

Trade-off curve for heterogeneous configurations. This plot visualizes the Pass@1 performance vs. relative cost. Labels denote (Generator, Planner) sizes. Employing a powerful Qwen332B planner (dashed line) yields a significant performance gain for a marginal increase in cost compared to Homogeneous configurations (solid line).

BibTeX

@article{yoon2026pat,
  title={PaT: Planning-after-Trial for Efficient Test-Time Code Generation},
  author={Yoon, Youngsik and Lee, Sungjae and Song, Seockbean and Wang, Siwei and Chen, Wei and Ok, Jungseul},
  journal={Submitted to ACL},
  year={2026}
}