Release GIL during msgpack packing operations for true thread parallelism by Copilot · Pull Request #1 · Rexhaif/msgpack-python · GitHub

Copilot · 2026-02-12T11:11:00Z

msgpack-python's C extension holds the GIL throughout packing/unpacking, preventing parallel execution even though memcpy of payload bodies is pure C and doesn't touch Python objects.

Core Changes

msgpack/pack.h: Replaced PyMem_Realloc with realloc to enable GIL-free buffer allocation. Error handling moved to caller with GIL held.

msgpack/_packer.pyx:

Switched to malloc/free from PyMem_* allocators
Declared msgpack_pack_raw_body as nogil-safe
Wrapped raw body writes in with nogil: blocks for bytes, str, memoryview, ExtType
Added error checking after GIL reacquisition

msgpack/_unpacker.pyx: Switched to libc allocators for consistency (no GIL release yet - unpacker creates Python objects per element).

Performance

Thread parallelism (4 cores):

Before: 0.83x (serialized by GIL)
After: 3.84x (near-ideal)

Asyncio throughput with asyncio.to_thread():

Direct async (no threading): 156k ops/sec, 4138 async pings/sec
Small payloads + threading: 195k ops/sec, 488 async pings/sec
Large payloads + threading: 81k ops/sec (lower raw throughput, but 2.47x parallelism)

Latency Trade-off

P99 latency can increase 2-3x during large payload operations since event loop must wait for memcpy completion. Acceptable for batch/throughput workloads; keep payloads small for latency-critical applications.

Safety

Payload pointers extracted before entering nogil block
Python objects remain referenced on stack (no GC)
msgpack_packer is C struct (no Python API calls in nogil section)
Return value checked after GIL reacquisition

Usage Pattern

# Optimal: Use threading for large payloads
async def pack_large(data: bytes):
    if len(data) > 10 * 1024:
        return await asyncio.to_thread(msgpack.packb, data)  # GIL released, true parallelism
    else:
        return msgpack.packb(data)  # Direct call, better responsiveness

Benchmarks

Comprehensive benchmarks added:

benchmark_asyncio_parallel.py: Three-way comparison (small/large threading, direct async)
benchmark_gil_detailed.py: GIL vs no-GIL comparison across payload sizes
CPU-count based thread scaling to prevent event loop saturation

All existing tests pass. No API changes.

Original prompt

Goal

Release the GIL in the Cython C extension where it is safe to do so, allowing true parallelism for CPU-bound packing operations involving large payloads when using Python threads.

Background

Currently, msgpack-python's C extension never releases the GIL during packing or unpacking. While most of the code is deeply intertwined with Python C API calls (type checks, object creation, etc.) and must hold the GIL, there are specific phases — particularly the memcpy of raw payload bodies — that operate purely on C buffers and don't touch any Python objects. These can safely run without the GIL.

The biggest win is in _pack_inner in _packer.pyx, where large bytes, str, memoryview, and ext payloads are copied into the packer's internal buffer via msgpack_pack_raw_body(). For large payloads (e.g. serializing many MB-sized binary blobs from multiple threads), releasing the GIL around these memcpy calls enables true parallel execution.

Architecture of the Change

1. `msgpack/pack.h` — Switch buffer allocator from `PyMem_Realloc` to `realloc`

The msgpack_pack_write function currently uses PyMem_Realloc (which requires the GIL) and PyErr_NoMemory(). To allow msgpack_pack_raw_body to be called without the GIL, the internal buffer must use standard C malloc/realloc/free instead.

Change msgpack_pack_write from:

buf = (char*)PyMem_Realloc(buf, bs);
if (!buf) {
    PyErr_NoMemory();
    return -1;
}

To:

buf = (char*)realloc(buf, bs);
if (!buf) {
    return -1;  // caller checks and raises MemoryError with GIL held
}

2. `msgpack/_packer.pyx` — Switch allocator and add `nogil` sections

a) Switch __cinit__ / __dealloc__ allocator:

Change PyMem_Malloc → malloc in __cinit__
Change PyMem_Free → free in __dealloc__
Add from libc.stdlib cimport malloc, realloc, free at the top
Add from libc.string cimport memcpy at the top

b) Declare msgpack_pack_raw_body as nogil-safe:

Add a separate nogil declaration block:

cdef extern from "pack.h" nogil:
    int msgpack_pack_raw_body(msgpack_packer* pk, const char* body, size_t l)

Note: keep the original declarations too for the functions that DO need the GIL (they call PyErr_NoMemory on realloc failure paths). Only msgpack_pack_raw_body needs the nogil declaration since its internal msgpack_pack_write now uses plain realloc instead of PyMem_Realloc.

Actually, after the pack.h change (removing PyMem_Realloc and PyErr_NoMemory), ALL the pack functions become GIL-free. So the cleaner approach is:

Change the entire cdef extern from "pack.h": block to add nogil to ALL function declarations
The functions themselves are pure C after the pack.h change

c) Add nogil blocks around large raw body writes in _pack_inner:

Define a threshold constant:

cdef size_t NOGIL_THRESHOLD = 1024  # Only release GIL for payloads > 1KB

For the bytes/bytearray packing path (around line 185-191), change:

# Original:
msgpack_pack_bin(&self.pk, L)
msgpack_pack_raw_body(&self.pk, rawval, L)

# New:
msgpack_pack_bin(&self.pk, L)
if L > NOGIL_THRESHOLD:
    with nogil:
        msgpack_pack_raw_body(&self.pk, rawval, L)
else:
    msgpack_pack_raw_body(&self.pk, rawval, L)

Apply the same pattern to:

The unicode/str packing path (around line 203-204)
The memoryview packing path (around line 239-240)
The ExtType packing path (around line 220)

d) Handle realloc failure with GIL:

After each msgpack_pack_raw_body call (or any pack call that could trigger realloc), check the return value. If it returns -1 (realloc failed), raise MemoryError since we now have the GIL back:

cdef int rc
if L > NOGIL_THRESHOLD:
    with nogil:
        rc = msgpack_pack_raw_body(&self.pk, rawval, L)
    if rc == -1:
        raise MemoryError("Unable to allocate internal buffer.")
else:
    msgpack_pack_raw_body(&self.pk, rawval, L)

3. `msgpack/_unpacker.pyx` — Switch allocator for consistency

Change the Unpacker class to also use malloc/free instead of PyMem_Malloc/PyMem_Free for its internal buffer, for consistency. The unpacker's unpack_execute function creates Python objects at every step so we can't release the GIL there, but switching the allocator keeps the codebase consistent and opens the door for future work.

Add from libc.stdlib cimport malloc, realloc, free at the top
In __init__: self.buf = <char*>malloc(read_size) instead of PyMem_Malloc
In __dealloc__: free(self.buf) instead of PyMem_Free
In append_buffer: new_buf = <char*>realloc(...) or keep using malloc+memcpy+free pattern but with libc versions