RFC 004: Add delayed rewards support for trajectory-based scoring by Darktex · Pull Request #337 · meta-pytorch/OpenEnv · GitHub

Darktex · 2026-01-28T22:01:23Z

Summary

Extends RFC 004 to address Issue #107: the per-step forward(action, obs) API doesn't support delayed rewards where score depends on future events.

Examples of delayed reward scenarios:

Cursor Plan Mode: Reward for writing a plan depends on later execution success
Codenames: Spymaster's clue quality depends on Operative's subsequent guesses
Chess: Win/loss only known at game end, needs discounting back to earlier moves

Key Additions to RFC 004

Self-Accumulating TrajectoryRubric

Since OpenEnv doesn't batch (one env = one trajectory), the rubric itself accumulates the trajectory internally:

TrajectoryRubric.__call__(action, obs) records step internally
Returns 0.0 (or configurable intermediate reward) until obs.done=True
On done, computes final score from accumulated trajectory
reset() clears the internal buffer
Composes naturally with Sequential, RubricDict, etc.

ExponentialDiscountingTrajectoryRubric

Standard gamma-based discounting: r_t = gamma^(T-1-t) * R_final

Memory Model

CPU-only trajectories to avoid GPU pressure. Environments with GPU tensors must move them to CPU before returning from step().

Examples

Chess with win/loss and temporal discounting
Cursor Plan Mode with custom credit assignment
Codenames mixing per-step and trajectory rewards

Test Plan

Review RFC additions for clarity and completeness
Verify examples are correct and representative
Check consistency with existing RFC 004 content

Resolves: #107

Extends RFC 004 to address Issue #107: the per-step `forward(action, obs)` API doesn't support delayed rewards where score depends on future events. Key additions: - TrajectoryRubric base class that accumulates (action, obs) pairs - ExponentialDiscountingTrajectoryRubric with gamma-based credit assignment - CPU-only memory model to avoid GPU pressure - Examples: Chess (win/loss), Cursor Plan Mode, Codenames - Environment integration and training loop patterns Design insight: Since OpenEnv doesn't batch (one env = one trajectory), the rubric itself accumulates the trajectory internally. No separate trajectory buffer needed. Resolves: #107

greptile-apps · 2026-01-28T22:04:07Z

Greptile Overview

Greptile Summary

This PR extends RFC 004 to add delayed rewards support through a TrajectoryRubric abstraction that accumulates trajectory state internally and computes final scores when episodes complete.

Key additions:

TrajectoryRubric base class with self-accumulating pattern (no separate trajectory buffer needed)
ExponentialDiscountingTrajectoryRubric for standard gamma-based credit assignment
CPU-only memory model to avoid GPU pressure
Three comprehensive examples: Chess (win/loss with discounting), Cursor Plan Mode (custom credit assignment), and Codenames (mixed per-step + trajectory)
Natural composition with existing containers (Sequential, RubricDict, WeightedSum)

Design highlights:

Leverages "one env = one trajectory" principle from RFC 004
Maintains "rewards inside environment" principle from RFC 002
Environments call rubric.reset() during env.reset() - agents never access this
Returns intermediate rewards (default 0.0) until obs.done=True, then computes final trajectory score
Training loops can optionally retrieve per-step rewards via compute_step_rewards() for gradient computation

The RFC additions are well-structured, include clear examples, and fit naturally into the existing Rubric framework.

Confidence Score: 5/5

This RFC extension is safe to merge - it's a well-designed documentation addition with no code changes
Perfect score because this is an RFC document addition with comprehensive examples, clear motivation, proper alignment with existing principles, and no implementation code that could introduce bugs
No files require special attention - this is a single RFC document with clear, well-structured additions

Important Files Changed

Filename	Overview
rfcs/004-rubrics.md	Added comprehensive delayed rewards section with TrajectoryRubric base class, exponential discounting implementation, and three practical examples (Chess, Cursor Plan Mode, Codenames)

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Env as Environment
    participant TR as TrajectoryRubric
    participant Buffer as Internal Trajectory Buffer

    Note over Env,TR: Episode Start
    Agent->>Env: reset()
    Env->>TR: reset()
    TR->>Buffer: Clear trajectory []
    Env-->>Agent: initial_observation

    Note over Env,TR: During Episode (Step 1)
    Agent->>Env: step(action_1)
    Env->>TR: __call__(action_1, obs_1)
    TR->>Buffer: append((action_1, obs_1))
    Note over TR: obs_1.done = False
    TR-->>Env: return 0.0 (intermediate_reward)
    Env-->>Agent: obs_1 (reward=0.0)

    Note over Env,TR: During Episode (Step 2)
    Agent->>Env: step(action_2)
    Env->>TR: __call__(action_2, obs_2)
    TR->>Buffer: append((action_2, obs_2))
    Note over TR: obs_2.done = False
    TR-->>Env: return 0.0 (intermediate_reward)
    Env-->>Agent: obs_2 (reward=0.0)

    Note over Env,TR: Final Step (obs.done=True)
    Agent->>Env: step(action_T)
    Env->>TR: __call__(action_T, obs_T)
    TR->>Buffer: append((action_T, obs_T))
    Note over TR: obs_T.done = True
    TR->>TR: score_trajectory(buffer)
    Note over TR: Compute final score from<br/>full trajectory
    TR-->>Env: return final_score
    Env-->>Agent: obs_T (reward=final_score)

    Note over Agent,Buffer: Credit Assignment (optional)
    Agent->>Env: rubric.compute_step_rewards()
    TR->>TR: Apply discounting strategy<br/>r_t = gamma^(T-1-t) * R_final
    TR-->>Agent: [r_0, r_1, ..., r_T]

    Note over Agent,Buffer: Next Episode
    Agent->>Env: reset()
    Env->>TR: reset()
    TR->>Buffer: Clear trajectory []

Darktex · 2026-01-29T00:47:51Z

Merging this quickly so we have a complete RFC

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

Darktex mentioned this pull request Jan 28, 2026

Implement TrajectoryRubric and ExponentialDiscountingTrajectoryRubric #338

Closed

9 tasks

Darktex merged commit ae45c2e into main Jan 29, 2026
5 checks passed

Darktex mentioned this pull request Jan 29, 2026

Adding chess environment #324

Merged

12 tasks

greptile-apps Bot mentioned this pull request Feb 23, 2026

Add new RFCs to the README #403

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 004: Add delayed rewards support for trajectory-based scoring#337

RFC 004: Add delayed rewards support for trajectory-based scoring#337
Darktex merged 1 commit intomainfrom
rfc-004-delayed-rewards

Darktex commented Jan 28, 2026

Uh oh!

greptile-apps Bot commented Jan 28, 2026

Uh oh!

Darktex commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Darktex commented Jan 28, 2026

Summary

Key Additions to RFC 004

Self-Accumulating TrajectoryRubric

ExponentialDiscountingTrajectoryRubric

Memory Model

Examples

Test Plan

Uh oh!

greptile-apps Bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Darktex commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant