ReBel — Rewarding Beliefs, Not Actions

The World Model Hypothesis

Every capable agent maintains a world model — a compressed, structured representation of the environment used to reason, anticipate, and plan. Standard LLM agents bury this model inside opaque hidden states where it is invisible to credit assignment and free to drift without consequence. ReBel externalizes the world model as a structured belief segment, making it a first-class object that can be supervised, verified against observations, and corrected when wrong. This is a minimal but concrete instantiation of the world-model loop.

Belief → Think → Action

The Failure: Belief Drift

In partially observable environments, agents infer latent state from incomplete observations. Small inference errors compound over 30+ steps into belief drift — the agent thinks it holds an apple, but its hands are empty. Delayed terminal rewards can't trace the failure back to the original misinference. Credit assignment collapses.

The Fix: Belief as First-Class Variable

ReBel makes belief explicit, structured, and verifiable. At each step, the agent outputs a structured belief (object locations, states, task phase, predictions) alongside its reasoning. This belief is checked against subsequent observations — mismatches produce immediate, dense learning signals.

Belief drift vs consistent belief tracking

Belief inconsistency (left) vs. ReBel's consistent belief tracking (right). When the internal world model drifts, actions become invalid even with confident token probabilities. ReBel eliminates this failure mode.

Method

Two Mechanisms, One Principle

Reward the agent for maintaining an accurate world model, not merely for reaching the goal. ReBel converts sparse terminal rewards into dense process-level signals through two complementary mechanisms.

Belief-Consistency Reward

A dense, step-wise signal that verifies each predicted predicate against subsequent observations. Three components track object locations & states (r_state), task phase (r_task), and expected observation keywords (r_pred). Unverifiable predicates go into a pending buffer and receive credit retroactively when evidence arrives. Observability masking prevents penalizing the agent for what it cannot yet see.

Belief-Anchor Step Advantage

GRPO variants group rollouts by observation hash, which collapses in POMDPs — most groups become singletons, nullifying step-level advantage. ReBel groups by belief equivalence class: two steps share a group if the agent believes the same predicates at that moment. This yields semantically homogeneous comparison groups even when physical states never repeat.

Decouples cognition from action — isolates belief error from action quality
Resilient under sparse rewards — captures belief-maintenance differences even when all terminal rewards are zero

Overview of ReBel. Structured belief generation → consistency verification against observations → belief-anchor grouping for stable step-level advantage. Dense self-supervised signals replace sparse terminal-only feedback.

Results

Strongest 1.5B Agent on Both Benchmarks

ReBel establishes a new performance frontier on ALFWorld and WebShop. All RL methods use Qwen2.5-1.5B-Instruct; mean ± std over 3 random seeds.

Paradigm	Method	ALFWorld							WebShop
Paradigm	Method	Pick	Look	Clean	Heat	Cool	Pick2	Overall	Score	SR
Closed-source Frontier Models (zero-shot)
Prompt	GPT-4o	75.3	60.8	31.2	56.7	21.6	49.8	48.0	31.8	23.7
Prompt	Gemini-2.5-Pro	92.8	63.3	62.1	69.0	26.6	58.7	60.3	42.5	35.9
Prompt-based Agents (Qwen2.5-1.5B-Instruct)
Prompt	Qwen2.5	5.9	5.5	3.3	9.7	4.2	0.0	4.1	23.1	5.2
Prompt	ReAct	17.4	20.5	15.7	6.2	7.7	2.0	12.8	40.1	11.3
Prompt	Reflexion	35.3	22.2	21.7	13.6	19.4	3.7	21.8	55.8	21.9
RL-trained Agents (Qwen2.5-1.5B-Instruct, 3 seeds)
RL	PPO	64.8	40.5	57.1	60.6	46.4	47.4	54.4	73.8	51.5
RL	RLOO	88.3	52.8	71.0	62.8	66.4	56.9	69.7	73.9	52.1
RL	GRPO	85.3	53.7	84.5	78.2	59.7	53.5	72.8	75.8	56.8
RL	GiGPO w/std	94.4	67.5	94.8	94.4	79.8	76.4	86.7	83.1	65.0
RL	GiGPO w/o std	96.0	76.5	91.8	91.3	71.7	79.5	86.1	83.5	67.4
RL	ReBel (Ours)	91.8	84.2	91.3	95.8	84.8	96.5	93.2	79.8	75.1

🎯

Hardest Task: Pick2

96.5% — +36.2 pp over Gemini-2.5-Pro, +17.0 pp over best GiGPO.

⚡

2.1× Faster Convergence

Matches GRPO's final SR at iteration 35 (vs. 100), without dense human annotations.

📉

3.2× Shorter Trajectories

Average episode length drops from 29.9 to 9.2 steps — accurate world model → efficient plans.

Training dynamics and per-task performance. (a) ReBel reaches GRPO's terminal SR at iteration ~35. (b) Per-task success rates sorted by trajectory length; Δ = gain over GRPO. (c) ReBel's advantage grows with task difficulty, confirming belief-tracking value scales with partial-observability depth.

Analysis

Ablation & Efficiency

Variant	Belief Prompt	Belief Grouping	Step Adv.	Belief Reward	ALFWorld SR	Δ
B0: GRPO	—	—	—	—	60.9	—
B1: + Belief Prompt	✓	—	—	—	78.1	+17.2
B2: + Grouping & StepAdv	✓	✓	✓	—	93.0	+14.9
B3: ReBel (full)	✓	✓	✓	✓	96.9	+3.9

B0 → B1: Explicit Representation (+17.2)

Making latent state tracking explicit already provides a large gain. But B1 still relies on high-variance trajectory-level rewards — insufficient for long horizons.

B1 → B2: Belief-Based Grouping (+14.9)

The largest single gain. Belief-anchor step advantage provides dense optimization signals that identify critical intermediate subgoals — the mechanism behind the 2.1× sample efficiency gain.

B2 → B3 (+3.9): Belief-consistency reward (r_cons) anchors the policy to environmental ground truth, preventing correct actions from hallucinated states. All three components are synergistic.

Efficiency

Grouping Quality Drives Everything

Grouping quality → credit assignment → efficiency. (a) ReBel maintains low singleton ratios; GiGPO's observation-hash grouping frequently collapses. (b) Average episode length drops 29.9 → 9.2 steps. (c) ReBel reaches 85% rollout success with 1.6× fewer environment steps, with smoother convergence and tighter confidence bands.

🔗

The Chain: Grouping → Credit → Efficiency

Better groups → more reliable advantages → shorter execution → faster convergence. One causal story.

🏗️

Semantic vs. Surface

Observation-hash grouping conflates distinct latent states; belief-equivalence grouping captures what the agent thinks is true, not what it sees.

📊

Smoother, Tighter Convergence

ReBel's training is not just faster but more stable — narrower variance reflects the regularizing effect of belief-consistency supervision.

The Bigger Picture

LLMs as Nascent World Models

ReBel is not just a better RL algorithm. It is a concrete architectural argument for how LLM agents should be built — with an explicit, inspectable world model at their core.

A New Credit Assignment Paradigm

Standard RL treats the value function as a black box. ReBel shows that externalized, verifiable belief can carry the credit signal directly — bypassing the need to reconstruct latent state.

Toward Predictive World Models

The belief segment today is a structured snapshot. At scale, this loop could grow into a genuine predictive model — anticipating observations, detecting causal structure, planning over imagined trajectories.

Belief
World Model → Think
Reasoning → Action
Execution