May 7, 2026 · Essam Sleiman

meta-reward: reward modeling as harness optimization

Agent
trajectory

→

Evaluator harness: optimized by meta-reward

Trace view

evidence shown to the judge

System prompt

rubric + policy checks

Control logic

score procedure + validation

Scoring interface

structured output + calibration

→

Frozen
LLM judge

→

Reward
score

We propose meta-reward: a method for improving agent reward models by optimizing the evaluator harness around a fixed LLM judge.

Agent task performance is only as good as the reward signal it optimizes against. Traditionally, reward models are trained to score model outputs using preferences, ratings, or verifier labels. But agents are harder to evaluate: an effective reward model needs to judge an agent's full trajectory, not just its final output.

LLM judges provide a scalable way to score those trajectories. They can read full agent traces and convert trajectory-level behavior into a reward signal. A raw LLM judge, however, is still an underspecified reward model. To become reliable, it needs a surrounding system to ground the reward signal with what evidence it sees, what rubrics it applies, and the procedure that turns its judgment into a score.

In our work, we pose this surrounding system as the evaluator harness and optimize it directly. On τ³-bench airline, optimizing the harness around a fixed Haiku 4.5 judge raises held-out agreement from 52.8% to 78.2% and improves natural best-of-N trajectory selection by up to +30.2 points.

Pointwise reward agreement on held-out agent rollouts

Baseline harness

Tuned harness

τ³-bench airline

Haiku 4.5 · Optimized harness with policy checks and structured scoring

78.2%+25.4 pts

Haiku 4.5 · Default evaluation procedure

52.8%

Plan-RewardBench

Opus 4.6 · Optimized harness with task-specific rubric

72.4%+11.9 pts

Opus 4.6 · Default evaluation procedure

60.5%

Motivation

The core challenge in agent post-training is defining a reward signal that captures the behavior we want the agent to learn. In domains with verifiable outcomes, this is relatively clean. Math solutions can be checked deterministically and code can be evaluated with executable tests. In cases where automatic verification isn't available, reward signals are often constructed from human judgment.

For long-horizon agent tasks, reward specification is harder because the reward must judge the full trajectory, not just the final response. A customer support, research, or workflow agent is evaluated by what information it gathered, which tools it called, what policy it applied, when it changed external state, and when it chose not to act.

Without accurate trajectory-level supervision, we risk rewarding the right outcome for the wrong reasons. For example, an agent might reach the correct final state through a lucky guess, unnecessary tool use, or an unauthorized action.

We observed this in τ³-airline as an action bias: the untuned judge often over-rewarded visible state-changing actions, like cancellations, compensation, and booking changes, even when policy required restraint.

Human annotation can provide this trajectory-level supervision, but labeling full agent traces is slow and expensive to scale. LLM judges offer a more scalable approximation. They can read agent traces, evaluate behavior against task criteria, and turn that judgment into a reward signal. In our previous meta-agent work, we used LLM judges to score unlabeled agent traces during harness optimization.

Essam Sleiman@essamsleiman · Apr 6

meta-agent: continual learning for agents

We built meta-agent: an open-source library that automatically and continuously improves agent harnesses from production traces. Point it at an existing agent, a stream of unlabeled production traces,…

But a judge call is not yet a reward procedure. Given a long trace and a rubric, the judge still has to infer what evidence matters, which constraints to prioritize, how to handle conflicting signals, and how to turn its reasoning into a score. Those choices determine what behavior gets rewarded. We call the system that specifies these choices the evaluator harness. It defines the trace view, policy context, checks, rubric, decision process, and scoring logic around the judge.

meta-reward optimizes the evaluator harness directly. Using a small set of trusted trajectory preferences from human annotation or task-specific labels, it tunes the evaluation procedure so the judge's scores better align with trusted preferences and generalize to unseen trajectories.

How it works

meta-reward keeps the judge model parameters fixed and optimizes the evaluator harness around it: the trace view, policy context, checks, rubric, decision process, and scoring logic.

We start with a small set of trusted trajectory preferences. Each example contains two agent trajectories and a label for which trajectory should receive higher reward. The evaluator scores each trajectory 0–100 independently, then our system predicts the preference by choosing the higher-scoring trajectory.

The optimization loop is:

Score. Run the current evaluator on trajectory pairs. Each trajectory receives a 0–100 reward score.
Compare. Pick the higher-scoring trajectory and check whether it matches the trusted preference.
Diagnose. Read disagreements and history of prior harnesses, scores, and failures to identify recurring evaluator mistakes.
Propose. Write targeted harness updates: changes to the trace view, policy context, checks, rubric, decision process, or scoring logic.
Validate. Evaluate candidate harness on held-out preference pairs. Store every result.
Repeat. Continue the loop with the best validated evaluator harness.

Sample Rollouts

Task

Refund my non-refundable ticket.

Rollout A

checks policy
declines refund
offers escalation

Rollout B

apologizes
issues refund
bypasses policy

Judge Harness

Applies a judging procedure to choose the better rollout.

Prefers Rollout B because the visible refund seems helpful, even though policy forbids it.

Harness Proposer

Studies wrong judgments and proposes an improvement.

Diagnosis: judge has an action bias. It rewards visible action over policy.

Patch: add a policy-check tool the judge must call before deciding.

Evaluate Harness

Tests the new harness on a small labeled holdout set, then updates it if better.

Patched harness calls the policy check and correctly picks Rollout A.

← since it's better, update harness ←

After optimization, the evaluator assigns a scalar reward to each trajectory. For pairwise evaluation, we choose the higher-scoring trajectory. For best-of-N, we choose the highest-scoring rollout in the pool.

Results

We evaluated meta-reward on τ³-bench airline, a multi-turn customer-service benchmark where agents must follow policy, use tools correctly, and update external state.

Evaluator-harness tuning substantially improved agreement with trusted trajectory preferences. With the same judge weights held fixed, pointwise reward agreement increased from 52.8% to 78.2%. The gain also held for stronger judges, suggesting that meta-reward is improving the evaluation procedure itself, not only compensating for a weak judge.

Evaluator-harness tuning on τ³

Pointwise reward agreement (%)

Baseline harness

Qwen 397B · Default harness

49.5%

Haiku 4.5 · Default harness

52.8%

Kimi K2.6 · Default harness

61.6%

Sonnet 4.5 · Default harness

66.7%

Opus 4.6 · Default harness

65.7%

GPT-5.5 · Default harness

72.7%

Qwen 397B baseline

Haiku baseline

Kimi K2.6 baseline

Sonnet baseline

Opus baseline

GPT-5.5 baseline

Tuned harness

Haiku 4.5 · Optimized harness (+25.4 pts)

78.2%

Sonnet 4.5 · Optimized harness (+13.9 pts)

80.6%

GPT-5.5 · Optimized harness (+8.9 pts)

81.6%

Haiku tuned

Sonnet tuned

GPT-5.5 tuned

The tuned evaluator also became a more decisive ranking signal. On the τ³ airline test pool, the mean margin in favor of the trusted-preferred trajectory increased from +6.4 to +31.1, while ties dropped from 83 to 16.

Natural Best-of-N trajectory selection on held-out τ³-bench airline

Rollouts sampled	Random selector	Baseline evaluator	Tuned evaluator	Oracle@N	Lift
1	55.6%	55.6%	55.6%	55.6%	+0.0
2	56.3%	58.1%	70.4%	76.4%	+12.3
4	57.9%	60.4%	79.2%	92.4%	+18.8
8	57.6%	51.9%	82.1%	99.2%	+30.2

Oracle@N shows how often the sampled pool contains at least one successful trajectory.

That translated into better trajectory selection. In natural best-of-N, the tuned evaluator selected stronger rollouts from the same candidate pools, improving performance by up to +30.2 points.

The same direction held beyond τ³. On Plan-RewardBench, evaluator-harness tuning improved held-out agreement from 60.5% to 72.4%.

τ³-bench airline used Claude Haiku 4.5 as the evaluator and Claude Opus 4.6 as the proposer. Plan-RewardBench used Claude Opus 4.6 for both evaluator and proposer.

We then ran a few smaller ablations to better understand whether the learned reward was useful downstream and how much trusted supervision it needed.

A small reward-hacking stress test. A natural concern is that optimizing against a learned reward could simply game the judge. To examine this, we hid the official τ³ grader during downstream agent-harness optimization and used the tuned learned reward as the only search objective. We evaluated official τ³ success only after optimization. After six candidate updates, the learned-reward-optimized harness improved official τ³ success from 60% to 80%, while learned-reward holdout increased from 68.0% to 82.2%.

Sample efficiency ablation. We also tested how much trusted supervision meta-reward needed to tune a useful evaluator harness. In this small experiment, we used 5 trusted trajectory pairs for each included task, then varied how many training tasks were included. With 5 tasks, the tuned evaluator reached 60.2% held-out agreement. With 10 tasks, it reached 67.6%. With 20 tasks, it reached 79.6%, close to the 80.9% result from the full 25-task set. With 80% task coverage, it recovered 95.4% of the gain.

Sample-efficiency ablation: agreement improved as trusted preferences covered more training tasks, with 20/25 tasks nearly matching the full 25-task result.

These results demonstrate that the tuned evaluator became more aligned and useful for ranking. The next question we explored is which mistakes did harness tuning fix and which ones remained?

What changed in the reward signal

After demonstrating that evaluator-harness optimization improved agreement, we asked what actually changed in the reward signal. This is especially relevant because any agent optimizing against this reward will inherit its blind spots.

Harness tuning corrected an action bias. When we clustered the baseline reward model's failures, we found that the untuned judge often over-scored visible state-changing actions, such as cancellations, compensation, and booking changes. These actions looked like progress even when the policy did not authorize them. The tuned harness reduced this bias by making policy authorization, tool evidence, and final-state correctness explicit parts of the reward judgment.

On this cluster, Haiku improved from 46.4% to 76.8% after harness tuning. GPT-5.5 started much higher, at 75.4%, and improved to 85.5%. The tuned Haiku harness relied heavily on explicit policy rules to get these cases right, while the GPT-5.5 harness needed much less scaffolding. This suggests the stronger judge already had more of the action/restraint prior internally, while the weaker judge needed the harness to supply it.

The hardest remaining failures involved state mismatches. When we clustered the traces that harness tuning did not recover, we found another group. Most of these shared a similar issue: both rollouts looked correct from the conversation, but one made a small mistake in a tool call or final state update, and the judge missed it. For example:

user asks to change Alice's flight → agent retrieves reservation → agent calls update tool → tool updates Bob instead of Alice → agent says “your reservation has been updated”

On this slice, Haiku improved from 44.9% to 60.2% after tuning. GPT-5.5 started higher at 58.2% but only reached 61.3%. Unlike the action-bias cases, neither harness tuning nor a stronger base model closed most of the gap. These errors require detailed state auditing: tracking entities across a long conversation, checking tool arguments, and comparing the final database state against the user's request.

	Haiku Baseline	Haiku Tuned	GPT-5.5 Baseline	GPT-5.5 Tuned
Action-bias	46.4%	76.8%	75.4%	85.5%
State-mismatch	44.9%	60.2%	58.2%	61.3%

Harness tuning improved action-bias errors much more than state-mismatch errors, suggesting it helps most when failures can be fixed by better evaluation procedure rather than deeper state tracking.

This suggests a boundary for evaluator-harness tuning. It works best when the judge is capable of the right decision but needs a better procedure for what to attend to. It doesn't work as well when the error depends on the judge's underlying ability to track long context, reconstruct state, or audit small factual differences. Those cases may require post-training the judge itself or a more specialized state-auditing component.

What we learned

Reward model design can be posed as a harness optimization problem. The same frozen judge became better aligned with expert preferences once we optimized the evaluator harness around it.
Evaluator tuning produced a more useful reward signal. The tuned evaluator did not just match trusted preferences more often. It selected better trajectories in best-of-N, and in a small downstream stress test, optimizing against the learned reward improved official task success. This suggests the reward became more useful for filtering, reranking, and optimization.
Evaluator-harness tuning has a boundary. It works best when the judge can make the right decision but needs a better procedure for attending to evidence, applying policy, or avoiding biases like over-rewarding visible actions. It helps less when the error requires deeper state reconstruction, where model-level judge training or a specialized state-auditing component may be needed.

Next steps

Scale the closed loop. We ran a small stress test where the tuned reward improved downstream agent-harness optimization. The next step is to scale this by using the learned reward across larger task sets, longer search budgets, and more agent systems, then measuring whether reward alignment consistently translates into higher official task success without entropy collapse or narrowing into reward-favored behaviors.

Go beyond harness tuning. Some failures remained even after evaluator-harness optimization, especially state-mismatch errors that require detailed state reconstruction. For these cases, we're interested to test whether post-training the judge itself can recover the remaining gap.

Benchmark the proposer. The loop depends on the proposer's ability to generate useful harness updates. A natural next benchmark is to measure how well frontier agents can drive the full harness-improvement loop on a defined task.