Skip to content

AnthropicProvider: expose cache_control for prompt caching on system/tools #57

@johnnichev

Description

@johnnichev

Problem

Anthropic's prompt-caching API needs cache_control: {"type": "ephemeral"} markers on system blocks, tools entries, and messages to hit the 5-minute cache and get ~90% discount on cached input tokens (plus better TTFT). Today selectools.providers.anthropic_provider.AnthropicProvider sends:

# anthropic_provider.py:166-174 (sync complete) and :503-509 (acomplete)
request_args = {
    "model": model_name,
    "system": system_prompt,  # plain string — no cache_control possible
    "messages": payload,
    ...
}
if tools:
    request_args["tools"] = [self._map_tool_to_anthropic(t) for t in tools]

# anthropic_provider.py:474-481
def _map_tool_to_anthropic(self, tool: Tool) -> Dict[str, Any]:
    return {
        "name": schema["name"],
        "description": schema["description"],
        "input_schema": schema["parameters"],
        # no cache_control hook
    }

Installed anthropic SDK is 0.95.0 — new enough to support caching; the gap is purely in the wrapper.

Impact for consumers

A real-world profile from Sheriff (WhatsApp finance agent on Haiku 4.5):

  • Static system instructions: ~500 tokens
  • Category list: ~100-500 tokens (stable within session)
  • Tool schemas (40 tools × ~75 tokens): ~3000 tokens
  • Per-turn history/user message: ~250-1050 tokens

That's ~3500-4000 cacheable tokens per call. Any user sending a second WhatsApp message within 5 minutes would hit cache and pay 10% of that on input — material at scale on Haiku, and the TTFT win is independent of cost.

Proposal

Add caching opt-ins to AnthropicProvider.__init__:

AnthropicProvider(
    api_key=...,
    default_model=...,
    cache_system: bool = False,   # NEW
    cache_tools: bool = False,    # NEW
)

When cache_system=True: wrap the system string in block form with a single cache_control: ephemeral marker. When cache_tools=True: add cache_control: ephemeral to the last tool in the outgoing schema list (Anthropic applies the marker to everything up to it).

Surface response.usage.cache_creation_input_tokens and cache_read_input_tokens on UsageStats so consumers can observe hit rate.

Why flags instead of automatic

Caching has minimum-token requirements (1024 for Sonnet/Opus, 2048 historically for Haiku — check current docs before implementing) and a write surcharge on first-turn. Forcing it on would regress short-prompt users. Opt-in keeps the default behavior unchanged.

Scope

  • complete, acomplete, and stream paths.
  • _format_messages stays unchanged; this is purely about the top-level system and tools fields.
  • Response parsing stays unchanged except for two new optional fields on UsageStats.

Sheriff context

Filed from the Sheriff project (~/projects/sheriff) as a follow-up to the 2026-04-16 agentic-layer investigation. Full writeup: vault note prompt-caching-investigation.md in the Clovis vault. Happy to open a PR if the API shape looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions