PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Youngsik Yoon¹, Sungjae Lee¹, Seockbean Song¹, Siwei Wang², Wei Chen², Jungseul Ok¹

¹Pohang University of Science and Technology (POSTECH)
²Microsoft Research Asia (MSRA)
ACL 2026

Abstract

Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.

Proposed Method

Comparison with existing methods and PaT (Ours). Problems are grouped by difficulty (easy, mid, and hard). Boxes denote the key components: a Generator (creates code), a Planner (decomposes the problem), and an Executor (verifies the solution). (a) Standard directly generate and execute; works on easy problems but often fails on harder ones. (b) FunCoder always plans first, so planning cost is paid even when unnecessary. (c) PaT (Ours) trials first and plans only on failure; solves easy problems cheaply and hard problems adaptively.

Main Results

Performance and cost comparison on foundational benchmarks. We report Pass@1 across four benchmarks (HumanEval, HumanEval+, MBPP, and MBPP+) and their average ('Avg.'). Cost is normalized relative to the Standard baseline (1.00).

Heterogeneous Model Configuration

Trade-off curve for heterogeneous configurations. This plot visualizes the Pass@1 performance vs. relative cost. Labels denote (Generator, Planner) sizes. Employing a powerful Qwen3_32B planner (dashed line) yields a significant performance gain for a marginal increase in cost compared to Homogeneous configurations (solid line).

@article{yoon2026pat, title={PaT: Planning-after-Trial for Efficient Test-Time Code Generation}, author={Yoon, Youngsik and Lee, Sungjae and Song, Seockbean and Wang, Siwei and Chen, Wei and Ok, Jungseul}, journal={Submitted to ACL}, year={2026} }

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Cost (↓) - Pass@1 (↑) trade-off across diverse sizes. We plot the average Pass@1 across foundational benchmarks (HumanEval, MBPP and their EvalPlus variants) against the relative inference cost. PaT consistently advances the Pareto frontier across model sizes (Qwen3_{4B, 8B, 14B, and 32B}).

Abstract

Proposed Method

Main Results

Performance and cost comparison on foundational benchmarks. We report Pass@1 across four benchmarks (HumanEval, HumanEval+, MBPP, and MBPP+) and their average ('Avg.'). Cost is normalized relative to the Standard baseline (1.00).

Heterogeneous Model Configuration

BibTeX

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Cost (↓) - Pass@1 (↑) trade-off across diverse sizes. We plot the average Pass@1 across foundational benchmarks (HumanEval, MBPP and their EvalPlus variants) against the relative inference cost. PaT consistently advances the Pareto frontier across model sizes (Qwen34B, 8B, 14B, and 32B).

Abstract

Proposed Method

Main Results

Performance and cost comparison on foundational benchmarks. We report Pass@1 across four benchmarks (HumanEval, HumanEval+, MBPP, and MBPP+) and their average ('Avg.'). Cost is normalized relative to the Standard baseline (1.00).

Heterogeneous Model Configuration

BibTeX

Cost (↓) - Pass@1 (↑) trade-off across diverse sizes. We plot the average Pass@1 across foundational benchmarks (HumanEval, MBPP and their EvalPlus variants) against the relative inference cost. PaT consistently advances the Pareto frontier across model sizes (Qwen3_{4B, 8B, 14B, and 32B}).