May 7, 2026 · Essam Sleiman
meta-reward: reward modeling as harness optimization
trajectory
LLM judge
score
We propose meta-reward: a method for improving agent reward models by optimizing the evaluator harness around a fixed LLM judge.
Agent task performance is only as good as the reward signal it optimizes against. Traditionally, reward models are trained to score model outputs using preferences, ratings, or verifier labels. But agents are harder to evaluate: an effective reward model needs to judge an agent's full trajectory, not just its final output.
LLM judges provide a scalable way to score those trajectories. They can read full agent traces and convert trajectory-level behavior into a reward signal. A raw LLM judge, however, is still an underspecified reward model. To become reliable, it needs a surrounding system to ground the reward signal with what evidence it sees, what rubrics it applies, and the procedure that turns its judgment into a score.
In our work, we pose this surrounding system as the evaluator harness and optimize it directly. On τ³-bench airline, optimizing the harness around a fixed Haiku 4.5 judge raises held-out agreement from 52.8% to 78.2% and improves natural best-of-N trajectory selection by up to +30.2 points.
Motivation
The core challenge in agent post-training is defining a reward signal that captures the behavior we want the agent to learn. In domains with verifiable outcomes, this is relatively clean. Math solutions can be checked deterministically and code can be evaluated with executable tests. In cases where automatic verification isn't available, reward signals are often constructed from human judgment.
For long-horizon agent tasks, reward specification is harder because the reward must judge the full trajectory, not just the final response. A customer support, research, or workflow agent is evaluated by what information it gathered, which tools it called, what policy it applied, when it changed external state, and when it chose not to act.
Without accurate trajectory-level supervision, we risk rewarding the right outcome for the wrong reasons. For example, an agent might reach the correct final state through a lucky guess, unnecessary tool use, or an unauthorized action.
We observed this in τ³-airline as an action bias: the untuned judge often over-rewarded visible state-changing actions, like cancellations, compensation, and booking changes, even when policy required restraint.
Human annotation can provide this trajectory-level supervision, but labeling full agent traces is slow and expensive to scale. LLM judges offer a more scalable approximation. They can read agent traces, evaluate behavior against task criteria, and turn that judgment into a reward signal. In our previous meta-agent work, we used LLM judges to score unlabeled agent traces during harness optimization.
meta-agent: continual learning for agents
We built meta-agent: an open-source library that automatically and continuously improves agent harnesses from production traces. Point it at an existing agent, a stream of unlabeled production traces,…
But a judge call is not yet a reward procedure. Given a long trace and a rubric, the judge still has to infer what evidence matters, which constraints to prioritize, how to handle conflicting signals, and how to turn its reasoning into a score. Those choices determine what behavior gets rewarded. We call the system that specifies these choices the evaluator harness. It defines the trace view, policy context, checks, rubric, decision process, and scoring logic around the judge.
meta-reward optimizes the evaluator harness directly. Using a small set of trusted trajectory preferences from human annotation or task-specific labels, it tunes the evaluation procedure so the judge's scores better align with trusted preferences and generalize to unseen trajectories.
How it works
meta-reward keeps the judge model parameters fixed and optimizes the evaluator harness around it: the trace view, policy context, checks, rubric, decision process, and scoring logic.
We start with a small set of trusted trajectory preferences. Each example contains two agent trajectories and a label for which trajectory should receive higher reward. The evaluator scores each trajectory 0–100 independently, then our system predicts the preference by choosing the higher-scoring trajectory.
The optimization loop is:
- Score. Run the current evaluator on trajectory pairs. Each trajectory receives a 0–100 reward score.
- Compare. Pick the higher-scoring trajectory and check whether it matches the trusted preference.
- Diagnose. Read disagreements and history of prior harnesses, scores, and failures to identify recurring evaluator mistakes.
- Propose. Write targeted harness updates: changes to the trace view, policy context, checks, rubric, decision process, or scoring logic.
- Validate. Evaluate candidate harness on held-out preference pairs. Store every result.
- Repeat. Continue the loop with the best validated evaluator harness.
declines refund
offers escalation
issues refund
bypasses policy
Patch: add a policy-check tool the judge must call before deciding.
After optimization, the evaluator assigns a scalar reward to each trajectory. For pairwise evaluation, we choose the higher-scoring trajectory. For best-of-N, we choose the highest-scoring rollout in the pool.
Results
We evaluated meta-reward on τ³-bench airline, a multi-turn customer-service benchmark where agents must follow policy, use tools correctly, and update external state.
Evaluator-harness tuning substantially improved agreement with trusted trajectory preferences. With the same judge weights held fixed, pointwise reward agreement increased from 52.8% to 78.2%. The gain also held for stronger judges, suggesting that meta-reward is improving the evaluation procedure itself, not only compensating for a weak judge.
The tuned evaluator also became a more decisive ranking signal. On the τ³ airline test pool, the mean margin in favor of the trusted-preferred trajectory increased from +6.4 to +31.1, while ties dropped from 83 to 16.
| Rollouts sampled | Random selector | Baseline evaluator | Tuned evaluator | Oracle@N | Lift |
|---|---|---|---|---|---|
| 1 | 55.6% | 55.6% | 55.6% | 55.6% | +0.0 |
| 2 | 56.3% | 58.1% | 70.4% | 76.4% | +12.3 |
| 4 | 57.9% | 60.4% | 79.2% | 92.4% | +18.8 |
| 8 | 57.6% | 51.9% | 82.1% | 99.2% | +30.2 |
That translated into better trajectory selection. In natural best-of-N, the tuned evaluator selected stronger rollouts from the same candidate pools, improving performance by up to +30.2 points.
The same direction held beyond τ³. On Plan-RewardBench, evaluator-harness tuning improved held-out agreement from 60.5% to 72.4%.
τ³-bench airline used Claude Haiku 4.5 as the evaluator and Claude Opus 4.6 as the proposer. Plan-RewardBench used Claude Opus 4.6 for both evaluator and proposer.
We then ran a few smaller ablations to better understand whether the learned reward was useful downstream and how much trusted supervision it needed.
A small reward-hacking stress test. A natural concern is that optimizing against a learned reward could simply game the judge. To examine this, we hid the official τ³ grader during downstream agent-harness optimization and used the tuned learned reward as the only search objective. We evaluated official τ³ success only after optimization. After six candidate updates, the learned-reward-optimized harness improved official τ³ success from 60% to 80%, while learned-reward holdout increased from 68.0% to 82.2%.
Sample efficiency ablation. We also tested how much trusted supervision meta-reward needed to tune a useful evaluator harness. In this small experiment, we used 5 trusted trajectory pairs for each included task, then varied how many training tasks were included. With 5 tasks, the tuned evaluator reached 60.2% held-out agreement. With 10 tasks, it reached 67.6%. With 20 tasks, it reached 79.6%, close to the 80.9% result from the full 25-task set. With 80% task coverage, it recovered 95.4% of the gain.
These results demonstrate that the tuned evaluator became more aligned and useful for ranking. The next question we explored is which mistakes did harness tuning fix and which ones remained?
What changed in the reward signal
After demonstrating that evaluator-harness optimization improved agreement, we asked what actually changed in the reward signal. This is especially relevant because any agent optimizing against this reward will inherit its blind spots.
Harness tuning corrected an action bias. When we clustered the baseline reward model's failures, we found that the untuned judge often over-scored visible state-changing actions, such as cancellations, compensation, and booking changes. These actions looked like progress even when the policy did not authorize them. The tuned harness reduced this bias by making policy authorization, tool evidence, and final-state correctness explicit parts of the reward judgment.
On this cluster, Haiku improved from 46.4% to 76.8% after harness tuning. GPT-5.5 started much higher, at 75.4%, and improved to 85.5%. The tuned Haiku harness relied heavily on explicit policy rules to get these cases right, while the GPT-5.5 harness needed much less scaffolding. This suggests the stronger judge already had more of the action/restraint prior internally, while the weaker judge needed the harness to supply it.
The hardest remaining failures involved state mismatches. When we clustered the traces that harness tuning did not recover, we found another group. Most of these shared a similar issue: both rollouts looked correct from the conversation, but one made a small mistake in a tool call or final state update, and the judge missed it. For example:
user asks to change Alice's flight → agent retrieves reservation → agent calls update tool → tool updates Bob instead of Alice → agent says “your reservation has been updated”
On this slice, Haiku improved from 44.9% to 60.2% after tuning. GPT-5.5 started higher at 58.2% but only reached 61.3%. Unlike the action-bias cases, neither harness tuning nor a stronger base model closed most of the gap. These errors require detailed state auditing: tracking entities across a long conversation, checking tool arguments, and comparing the final database state against the user's request.
| Haiku Baseline | Haiku Tuned | GPT-5.5 Baseline | GPT-5.5 Tuned |
|
|---|---|---|---|---|
| Action-bias | 46.4% | 76.8% | 75.4% | 85.5% |
| State-mismatch | 44.9% | 60.2% | 58.2% | 61.3% |
This suggests a boundary for evaluator-harness tuning. It works best when the judge is capable of the right decision but needs a better procedure for what to attend to. It doesn't work as well when the error depends on the judge's underlying ability to track long context, reconstruct state, or audit small factual differences. Those cases may require post-training the judge itself or a more specialized state-auditing component.
What we learned
- Reward model design can be posed as a harness optimization problem. The same frozen judge became better aligned with expert preferences once we optimized the evaluator harness around it.
- Evaluator tuning produced a more useful reward signal. The tuned evaluator did not just match trusted preferences more often. It selected better trajectories in best-of-N, and in a small downstream stress test, optimizing against the learned reward improved official task success. This suggests the reward became more useful for filtering, reranking, and optimization.
- Evaluator-harness tuning has a boundary. It works best when the judge can make the right decision but needs a better procedure for attending to evidence, applying policy, or avoiding biases like over-rewarding visible actions. It helps less when the error requires deeper state reconstruction, where model-level judge training or a specialized state-auditing component may be needed.
Next steps
Scale the closed loop. We ran a small stress test where the tuned reward improved downstream agent-harness optimization. The next step is to scale this by using the learned reward across larger task sets, longer search budgets, and more agent systems, then measuring whether reward alignment consistently translates into higher official task success without entropy collapse or narrowing into reward-favored behaviors.
Go beyond harness tuning. Some failures remained even after evaluator-harness optimization, especially state-mismatch errors that require detailed state reconstruction. For these cases, we're interested to test whether post-training the judge itself can recover the remaining gap.
Benchmark the proposer. The loop depends on the proposer's ability to generate useful harness updates. A natural next benchmark is to measure how well frontier agents can drive the full harness-improvement loop on a defined task.