Skip to content

fastCallDepth leaks through runWithCatch / evalWithCatch on caught panic #2529

@borisbat

Description

@borisbat

Symptom

After runWithCatch (or evalWithCatch) catches a panic that originated inside a fastcall path, Context::fastCallDepth is incremented but never decremented — the -- after the inner eval() is skipped by the throw/longjmp. Repeated caught-and-swallowed panics from the fastcall path grow fastCallDepth monotonically and eventually trigger a false "stack overflow, fast call depth limit exceeded" error from the bounds check at simulate.h:599 / simulate_nodes.h:1023.

Root cause

Both fastcall eval paths use a bare ++ / eval / -- sequence with no try/finally and no RAII guard:

Context::callOrFastcall (simulate.h):

fastCallDepth ++;
auto aa = abiArg;
abiArg = args;
result = fn->code->eval(*this);   // throw/longjmp here skips the decrement
stopFlags = 0;
abiArg = aa;
fastCallDepth --;

SimNode_FastCall<argCount>::eval (simulate_nodes.h, plus the typed EVAL_NODE macro variant): same shape.

The catch sites only restore some of the per-call state:

runWithCatch and evalWithCatch snapshot/restore abiArg / abiCMRES / abiThisBlockArg and pop the stack watermark on the catch path — but do not snapshot/restore fastCallDepth. The same omission exists in the four in-VM try_recover paths.

Context::restart() does zero fastCallDepth, but it asserts insideContext == 0, so it can only be used from outside an active eval — not from a callback / hook / surface-and-forget binding that's already locked.

Why this is more visible after #2526

The new Context::clearException() helper added in #2526 makes the leak more visible because surface-and-forget bindings (e.g. dasAudio mixer callback) now keep the same context alive across many caught panics instead of restart()-ing it. Pre-#2526 the typical path was either crash-out or restart(), both of which masked the leak. clearException() itself does NOT touch fastCallDepth (early review proposed it, but that would underflow uint32_t if called inside an active fastcall).

Two fix shapes

A. Catch-site fix (~5 sites): snapshot fastCallDepth on entry to runWithCatch / evalWithCatch / each in-VM try_recover path, restore in the catch / longjmp branch. Mirrors the existing aa / acm / atba snapshot pattern in those functions.

B. Increment-site fix (~3 sites): wrap each ++ / eval / -- in an RAII guard so the decrement runs in the dtor regardless of how control leaves. Touches hot eval paths, but bulletproof against any catch site (current or future) and removes the requirement that every catcher remember to do the snapshot.

B is structurally cleaner; A more closely mirrors the existing pattern. Either one fixes the symptom.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions