ReBel

REWARDING BELIEFS, NOT ACTIONS

The agent's world model should not be a black box.

Make it explicit. Structured. Verifiable.

Credit what the agent knows, not what it does.

93.2%
ALFWorld Success Rate
+20.4 pp
Gain over GRPO
2.1×
Sample Efficiency
3.2×
Shorter Episodes
(29.9 → 9.2 steps)

The World Model Hypothesis

Every capable agent maintains a world model — a compressed, structured representation of the environment used to reason, anticipate, and plan. Standard LLM agents bury this model inside opaque hidden states where it is invisible to credit assignment and free to drift without consequence. ReBel externalizes the world model as a structured belief segment, making it a first-class object that can be supervised, verified against observations, and corrected when wrong. This is a minimal but concrete instantiation of the world-model loop.

Belief Think Action

The Failure: Belief Drift

In partially observable environments, agents infer latent state from incomplete observations. Small inference errors compound over 30+ steps into belief drift — the agent thinks it holds an apple, but its hands are empty. Delayed terminal rewards can't trace the failure back to the original misinference. Credit assignment collapses.

The Fix: Belief as First-Class Variable

ReBel makes belief explicit, structured, and verifiable. At each step, the agent outputs a structured belief (object locations, states, task phase, predictions) alongside its reasoning. This belief is checked against subsequent observations — mismatches produce immediate, dense learning signals.

Belief drift vs consistent belief tracking

Belief inconsistency (left) vs. ReBel's consistent belief tracking (right). When the internal world model drifts, actions become invalid even with confident token probabilities. ReBel eliminates this failure mode.

Two Mechanisms, One Principle

Reward the agent for maintaining an accurate world model, not merely for reaching the goal. ReBel converts sparse terminal rewards into dense process-level signals through two complementary mechanisms.

Belief-Consistency Reward

A dense, step-wise signal that verifies each predicted predicate against subsequent observations. Three components track object locations & states (r_state), task phase (r_task), and expected observation keywords (r_pred). Unverifiable predicates go into a pending buffer and receive credit retroactively when evidence arrives. Observability masking prevents penalizing the agent for what it cannot yet see.

Belief-Anchor Step Advantage

GRPO variants group rollouts by observation hash, which collapses in POMDPs — most groups become singletons, nullifying step-level advantage. ReBel groups by belief equivalence class: two steps share a group if the agent believes the same predicates at that moment. This yields semantically homogeneous comparison groups even when physical states never repeat.

  • Decouples cognition from action — isolates belief error from action quality
  • Resilient under sparse rewards — captures belief-maintenance differences even when all terminal rewards are zero
ReBel overview

Overview of ReBel. Structured belief generation → consistency verification against observations → belief-anchor grouping for stable step-level advantage. Dense self-supervised signals replace sparse terminal-only feedback.

Strongest 1.5B Agent on Both Benchmarks

ReBel establishes a new performance frontier on ALFWorld and WebShop. All RL methods use Qwen2.5-1.5B-Instruct; mean ± std over 3 random seeds.

Paradigm Method ALFWorld WebShop
PickLookCleanHeatCoolPick2Overall ScoreSR
PromptGPT-4o75.360.831.256.721.649.848.031.823.7
PromptGemini-2.5-Pro92.863.362.169.026.658.760.342.535.9
PromptQwen2.55.95.53.39.74.20.04.123.15.2
PromptReAct17.420.515.76.27.72.012.840.111.3
PromptReflexion35.322.221.713.619.43.721.855.821.9
RLPPO64.840.557.160.646.447.454.473.851.5
RLRLOO88.352.871.062.866.456.969.773.952.1
RLGRPO85.353.784.578.259.753.572.875.856.8
RLGiGPO w/std94.467.594.894.479.876.486.783.165.0
RLGiGPO w/o std96.076.591.891.371.779.586.183.567.4
RLReBel (Ours) 91.884.291.395.884.896.593.279.875.1
🎯

Hardest Task: Pick2

96.5% — +36.2 pp over Gemini-2.5-Pro, +17.0 pp over best GiGPO.

2.1× Faster Convergence

Matches GRPO's final SR at iteration 35 (vs. 100), without dense human annotations.

📉

3.2× Shorter Trajectories

Average episode length drops from 29.9 to 9.2 steps — accurate world model → efficient plans.

Training dynamics and per-task performance

Training dynamics and per-task performance. (a) ReBel reaches GRPO's terminal SR at iteration ~35. (b) Per-task success rates sorted by trajectory length; Δ = gain over GRPO. (c) ReBel's advantage grows with task difficulty, confirming belief-tracking value scales with partial-observability depth.

Ablation & Efficiency
Variant Belief Prompt Belief Grouping Step Adv. Belief Reward ALFWorld SR Δ
B0: GRPO 60.9
B1: + Belief Prompt 78.1+17.2
B2: + Grouping & StepAdv 93.0+14.9
B3: ReBel (full) 96.9+3.9

B0 → B1: Explicit Representation (+17.2)

Making latent state tracking explicit already provides a large gain. But B1 still relies on high-variance trajectory-level rewards — insufficient for long horizons.

B1 → B2: Belief-Based Grouping (+14.9)

The largest single gain. Belief-anchor step advantage provides dense optimization signals that identify critical intermediate subgoals — the mechanism behind the 2.1× sample efficiency gain.

B2 → B3 (+3.9): Belief-consistency reward (r_cons) anchors the policy to environmental ground truth, preventing correct actions from hallucinated states. All three components are synergistic.

Grouping Quality Drives Everything
Grouping quality and efficiency

Grouping quality → credit assignment → efficiency. (a) ReBel maintains low singleton ratios; GiGPO's observation-hash grouping frequently collapses. (b) Average episode length drops 29.9 → 9.2 steps. (c) ReBel reaches 85% rollout success with 1.6× fewer environment steps, with smoother convergence and tighter confidence bands.

🔗

The Chain: Grouping → Credit → Efficiency

Better groups → more reliable advantages → shorter execution → faster convergence. One causal story.

🏗️

Semantic vs. Surface

Observation-hash grouping conflates distinct latent states; belief-equivalence grouping captures what the agent thinks is true, not what it sees.

📊

Smoother, Tighter Convergence

ReBel's training is not just faster but more stable — narrower variance reflects the regularizing effect of belief-consistency supervision.

LLMs as Nascent World Models

ReBel is not just a better RL algorithm. It is a concrete architectural argument for how LLM agents should be built — with an explicit, inspectable world model at their core.

A New Credit Assignment Paradigm

Standard RL treats the value function as a black box. ReBel shows that externalized, verifiable belief can carry the credit signal directly — bypassing the need to reconstruct latent state.

Toward Predictive World Models

The belief segment today is a structured snapshot. At scale, this loop could grow into a genuine predictive model — anticipating observations, detecting causal structure, planning over imagined trajectories.

Belief
World Model
Think
Reasoning
Action
Execution
Long-horizon decision making is not won by faster actions, but by truer beliefs — credit must flow to what the agent knows, not merely to what it does.

— ReBel