Skip to content

RFC 004: Add delayed rewards support for trajectory-based scoring#337

Merged
Darktex merged 1 commit intomainfrom
rfc-004-delayed-rewards
Jan 29, 2026
Merged

RFC 004: Add delayed rewards support for trajectory-based scoring#337
Darktex merged 1 commit intomainfrom
rfc-004-delayed-rewards

Conversation

@Darktex
Copy link
Copy Markdown
Contributor

@Darktex Darktex commented Jan 28, 2026

Summary

Extends RFC 004 to address Issue #107: the per-step forward(action, obs) API doesn't support delayed rewards where score depends on future events.

Examples of delayed reward scenarios:

  • Cursor Plan Mode: Reward for writing a plan depends on later execution success
  • Codenames: Spymaster's clue quality depends on Operative's subsequent guesses
  • Chess: Win/loss only known at game end, needs discounting back to earlier moves

Key Additions to RFC 004

Self-Accumulating TrajectoryRubric

Since OpenEnv doesn't batch (one env = one trajectory), the rubric itself accumulates the trajectory internally:

  1. TrajectoryRubric.__call__(action, obs) records step internally
  2. Returns 0.0 (or configurable intermediate reward) until obs.done=True
  3. On done, computes final score from accumulated trajectory
  4. reset() clears the internal buffer
  5. Composes naturally with Sequential, RubricDict, etc.

ExponentialDiscountingTrajectoryRubric

Standard gamma-based discounting: r_t = gamma^(T-1-t) * R_final

Memory Model

CPU-only trajectories to avoid GPU pressure. Environments with GPU tensors must move them to CPU before returning from step().

Examples

  • Chess with win/loss and temporal discounting
  • Cursor Plan Mode with custom credit assignment
  • Codenames mixing per-step and trajectory rewards

Test Plan

  • Review RFC additions for clarity and completeness
  • Verify examples are correct and representative
  • Check consistency with existing RFC 004 content

Resolves: #107

Extends RFC 004 to address Issue #107: the per-step `forward(action, obs)`
API doesn't support delayed rewards where score depends on future events.

Key additions:
- TrajectoryRubric base class that accumulates (action, obs) pairs
- ExponentialDiscountingTrajectoryRubric with gamma-based credit assignment
- CPU-only memory model to avoid GPU pressure
- Examples: Chess (win/loss), Cursor Plan Mode, Codenames
- Environment integration and training loop patterns

Design insight: Since OpenEnv doesn't batch (one env = one trajectory),
the rubric itself accumulates the trajectory internally. No separate
trajectory buffer needed.

Resolves: #107
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

This PR extends RFC 004 to add delayed rewards support through a TrajectoryRubric abstraction that accumulates trajectory state internally and computes final scores when episodes complete.

Key additions:

  • TrajectoryRubric base class with self-accumulating pattern (no separate trajectory buffer needed)
  • ExponentialDiscountingTrajectoryRubric for standard gamma-based credit assignment
  • CPU-only memory model to avoid GPU pressure
  • Three comprehensive examples: Chess (win/loss with discounting), Cursor Plan Mode (custom credit assignment), and Codenames (mixed per-step + trajectory)
  • Natural composition with existing containers (Sequential, RubricDict, WeightedSum)

Design highlights:

  • Leverages "one env = one trajectory" principle from RFC 004
  • Maintains "rewards inside environment" principle from RFC 002
  • Environments call rubric.reset() during env.reset() - agents never access this
  • Returns intermediate rewards (default 0.0) until obs.done=True, then computes final trajectory score
  • Training loops can optionally retrieve per-step rewards via compute_step_rewards() for gradient computation

The RFC additions are well-structured, include clear examples, and fit naturally into the existing Rubric framework.

Confidence Score: 5/5

  • This RFC extension is safe to merge - it's a well-designed documentation addition with no code changes
  • Perfect score because this is an RFC document addition with comprehensive examples, clear motivation, proper alignment with existing principles, and no implementation code that could introduce bugs
  • No files require special attention - this is a single RFC document with clear, well-structured additions

Important Files Changed

Filename Overview
rfcs/004-rubrics.md Added comprehensive delayed rewards section with TrajectoryRubric base class, exponential discounting implementation, and three practical examples (Chess, Cursor Plan Mode, Codenames)

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Env as Environment
    participant TR as TrajectoryRubric
    participant Buffer as Internal Trajectory Buffer

    Note over Env,TR: Episode Start
    Agent->>Env: reset()
    Env->>TR: reset()
    TR->>Buffer: Clear trajectory []
    Env-->>Agent: initial_observation

    Note over Env,TR: During Episode (Step 1)
    Agent->>Env: step(action_1)
    Env->>TR: __call__(action_1, obs_1)
    TR->>Buffer: append((action_1, obs_1))
    Note over TR: obs_1.done = False
    TR-->>Env: return 0.0 (intermediate_reward)
    Env-->>Agent: obs_1 (reward=0.0)

    Note over Env,TR: During Episode (Step 2)
    Agent->>Env: step(action_2)
    Env->>TR: __call__(action_2, obs_2)
    TR->>Buffer: append((action_2, obs_2))
    Note over TR: obs_2.done = False
    TR-->>Env: return 0.0 (intermediate_reward)
    Env-->>Agent: obs_2 (reward=0.0)

    Note over Env,TR: Final Step (obs.done=True)
    Agent->>Env: step(action_T)
    Env->>TR: __call__(action_T, obs_T)
    TR->>Buffer: append((action_T, obs_T))
    Note over TR: obs_T.done = True
    TR->>TR: score_trajectory(buffer)
    Note over TR: Compute final score from<br/>full trajectory
    TR-->>Env: return final_score
    Env-->>Agent: obs_T (reward=final_score)

    Note over Agent,Buffer: Credit Assignment (optional)
    Agent->>Env: rubric.compute_step_rewards()
    TR->>TR: Apply discounting strategy<br/>r_t = gamma^(T-1-t) * R_final
    TR-->>Agent: [r_0, r_1, ..., r_T]

    Note over Agent,Buffer: Next Episode
    Agent->>Env: reset()
    Env->>TR: reset()
    TR->>Buffer: Clear trajectory []
Loading

@Darktex
Copy link
Copy Markdown
Contributor Author

Darktex commented Jan 29, 2026

Merging this quickly so we have a complete RFC

@Darktex Darktex merged commit ae45c2e into main Jan 29, 2026
5 checks passed
@Darktex Darktex mentioned this pull request Jan 29, 2026
12 tasks
@greptile-apps greptile-apps Bot mentioned this pull request Feb 23, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewards spec design philosophy

1 participant