Skip to content

Release GIL during msgpack packing operations for true thread parallelism#1

Draft
Copilot wants to merge 20 commits intomainfrom
copilot/release-gil-cython-extension
Draft

Release GIL during msgpack packing operations for true thread parallelism#1
Copilot wants to merge 20 commits intomainfrom
copilot/release-gil-cython-extension

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 12, 2026

msgpack-python's C extension holds the GIL throughout packing/unpacking, preventing parallel execution even though memcpy of payload bodies is pure C and doesn't touch Python objects.

Core Changes

msgpack/pack.h: Replaced PyMem_Realloc with realloc to enable GIL-free buffer allocation. Error handling moved to caller with GIL held.

msgpack/_packer.pyx:

  • Switched to malloc/free from PyMem_* allocators
  • Declared msgpack_pack_raw_body as nogil-safe
  • Wrapped raw body writes in with nogil: blocks for bytes, str, memoryview, ExtType
  • Added error checking after GIL reacquisition

msgpack/_unpacker.pyx: Switched to libc allocators for consistency (no GIL release yet - unpacker creates Python objects per element).

Performance

Thread parallelism (4 cores):

  • Before: 0.83x (serialized by GIL)
  • After: 3.84x (near-ideal)

Asyncio throughput with asyncio.to_thread():

  • Direct async (no threading): 156k ops/sec, 4138 async pings/sec
  • Small payloads + threading: 195k ops/sec, 488 async pings/sec
  • Large payloads + threading: 81k ops/sec (lower raw throughput, but 2.47x parallelism)

Latency Trade-off

P99 latency can increase 2-3x during large payload operations since event loop must wait for memcpy completion. Acceptable for batch/throughput workloads; keep payloads small for latency-critical applications.

Safety

  • Payload pointers extracted before entering nogil block
  • Python objects remain referenced on stack (no GC)
  • msgpack_packer is C struct (no Python API calls in nogil section)
  • Return value checked after GIL reacquisition

Usage Pattern

# Optimal: Use threading for large payloads
async def pack_large(data: bytes):
    if len(data) > 10 * 1024:
        return await asyncio.to_thread(msgpack.packb, data)  # GIL released, true parallelism
    else:
        return msgpack.packb(data)  # Direct call, better responsiveness

Benchmarks

Comprehensive benchmarks added:

  • benchmark_asyncio_parallel.py: Three-way comparison (small/large threading, direct async)
  • benchmark_gil_detailed.py: GIL vs no-GIL comparison across payload sizes
  • CPU-count based thread scaling to prevent event loop saturation

All existing tests pass. No API changes.

Original prompt

Goal

Release the GIL in the Cython C extension where it is safe to do so, allowing true parallelism for CPU-bound packing operations involving large payloads when using Python threads.

Background

Currently, msgpack-python's C extension never releases the GIL during packing or unpacking. While most of the code is deeply intertwined with Python C API calls (type checks, object creation, etc.) and must hold the GIL, there are specific phases — particularly the memcpy of raw payload bodies — that operate purely on C buffers and don't touch any Python objects. These can safely run without the GIL.

The biggest win is in _pack_inner in _packer.pyx, where large bytes, str, memoryview, and ext payloads are copied into the packer's internal buffer via msgpack_pack_raw_body(). For large payloads (e.g. serializing many MB-sized binary blobs from multiple threads), releasing the GIL around these memcpy calls enables true parallel execution.

Architecture of the Change

1. msgpack/pack.h — Switch buffer allocator from PyMem_Realloc to realloc

The msgpack_pack_write function currently uses PyMem_Realloc (which requires the GIL) and PyErr_NoMemory(). To allow msgpack_pack_raw_body to be called without the GIL, the internal buffer must use standard C malloc/realloc/free instead.

Change msgpack_pack_write from:

buf = (char*)PyMem_Realloc(buf, bs);
if (!buf) {
    PyErr_NoMemory();
    return -1;
}

To:

buf = (char*)realloc(buf, bs);
if (!buf) {
    return -1;  // caller checks and raises MemoryError with GIL held
}

2. msgpack/_packer.pyx — Switch allocator and add nogil sections

a) Switch __cinit__ / __dealloc__ allocator:

  • Change PyMem_Mallocmalloc in __cinit__
  • Change PyMem_Freefree in __dealloc__
  • Add from libc.stdlib cimport malloc, realloc, free at the top
  • Add from libc.string cimport memcpy at the top

b) Declare msgpack_pack_raw_body as nogil-safe:

Add a separate nogil declaration block:

cdef extern from "pack.h" nogil:
    int msgpack_pack_raw_body(msgpack_packer* pk, const char* body, size_t l)

Note: keep the original declarations too for the functions that DO need the GIL (they call PyErr_NoMemory on realloc failure paths). Only msgpack_pack_raw_body needs the nogil declaration since its internal msgpack_pack_write now uses plain realloc instead of PyMem_Realloc.

Actually, after the pack.h change (removing PyMem_Realloc and PyErr_NoMemory), ALL the pack functions become GIL-free. So the cleaner approach is:

  • Change the entire cdef extern from "pack.h": block to add nogil to ALL function declarations
  • The functions themselves are pure C after the pack.h change

c) Add nogil blocks around large raw body writes in _pack_inner:

Define a threshold constant:

cdef size_t NOGIL_THRESHOLD = 1024  # Only release GIL for payloads > 1KB

For the bytes/bytearray packing path (around line 185-191), change:

# Original:
msgpack_pack_bin(&self.pk, L)
msgpack_pack_raw_body(&self.pk, rawval, L)

# New:
msgpack_pack_bin(&self.pk, L)
if L > NOGIL_THRESHOLD:
    with nogil:
        msgpack_pack_raw_body(&self.pk, rawval, L)
else:
    msgpack_pack_raw_body(&self.pk, rawval, L)

Apply the same pattern to:

  • The unicode/str packing path (around line 203-204)
  • The memoryview packing path (around line 239-240)
  • The ExtType packing path (around line 220)

d) Handle realloc failure with GIL:

After each msgpack_pack_raw_body call (or any pack call that could trigger realloc), check the return value. If it returns -1 (realloc failed), raise MemoryError since we now have the GIL back:

cdef int rc
if L > NOGIL_THRESHOLD:
    with nogil:
        rc = msgpack_pack_raw_body(&self.pk, rawval, L)
    if rc == -1:
        raise MemoryError("Unable to allocate internal buffer.")
else:
    msgpack_pack_raw_body(&self.pk, rawval, L)

3. msgpack/_unpacker.pyx — Switch allocator for consistency

Change the Unpacker class to also use malloc/free instead of PyMem_Malloc/PyMem_Free for its internal buffer, for consistency. The unpacker's unpack_execute function creates Python objects at every step so we can't release the GIL there, but switching the allocator keeps the codebase consistent and opens the door for future work.

  • Add from libc.stdlib cimport malloc, realloc, free at the top
  • In __init__: self.buf = <char*>malloc(read_size) instead of PyMem_Malloc
  • In __dealloc__: free(self.buf) instead of PyMem_Free
  • In append_buffer: new_buf = <char*>realloc(...) or keep using malloc+memcpy+free pattern but with libc versions

4. Important Notes

  • The except -1 Cython exception spec on the pack functions won't conflict with nogil as long as we capture the return value and check it after re-acquiring the GIL.
  • The with nogil: block must only contain C-leve...

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 4 commits February 12, 2026 11:15
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title [WIP] Release GIL in Cython extension for parallel packing Release GIL during large payload packing operations Feb 12, 2026
Copilot AI requested a review from Rexhaif February 12, 2026 11:22
Copilot AI and others added 5 commits February 13, 2026 09:25
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Release GIL during large payload packing operations Add asyncio GIL release benchmarks Feb 13, 2026
Copilot AI and others added 2 commits February 13, 2026 09:47
- All benchmarks now use os.cpu_count() to determine thread/worker counts
- Primary benchmark uses CPU count threads (prevents saturation)
- Intensive benchmark uses CPU count * 2 workers (stress test)
- Example uses CPU count * 2 concurrent requests (realistic load)
- Updated documentation to reflect CPU-based scaling
- Prevents thread pool overload and event loop latency issues

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Add asyncio GIL release benchmarks Use CPU core count for asyncio benchmark thread scaling Feb 13, 2026
Copilot AI and others added 2 commits February 13, 2026 09:56
- Created LATENCY_ANALYSIS.md with detailed latency trade-off analysis
- Updated ASYNCIO_BENCHMARK_RESULTS.md with latency considerations
- Updated ASYNCIO_BENCHMARK_INDEX.md with latency warnings and guidance
- Updated EXECUTIVE_SUMMARY.md to address "what about the latency?" question
- Explains why P99 latency can be 2-3x worse with GIL release
- Provides decision matrix for when to use GIL release vs small payloads
- Includes mitigation strategies and use case guidance

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Use CPU core count for asyncio benchmark thread scaling Document latency trade-offs in GIL release implementation Feb 13, 2026
Copilot AI and others added 2 commits February 13, 2026 10:15
- Modified msgpack/_packer.pyx to always release GIL for all payload sizes
- Removed NOGIL_THRESHOLD constant (was 1024 bytes)
- Simplified code by removing conditional GIL release logic
- All 5 locations now unconditionally release GIL:
  1. bytes/bytearray packing
  2. unicode/str packing
  3. ExtType packing
  4. memoryview packing
  5. pack_ext_type method
- Fixed pyproject.toml license format issue
- All 123 tests pass
- Benchmarks show improved parallelism for small payloads (3.01x vs 0.8x)

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
- Updated GIL_RELEASE_IMPLEMENTATION.md to reflect no threshold
- Updated LATENCY_ANALYSIS.md with new benchmark results
- Documented excellent parallelism: 3.01x-3.84x with 4 threads
- Documented low latencies: 57µs avg, 566-594µs P99
- Removed outdated information about 1KB threshold
- Simplified guidance - GIL release beneficial for almost all use cases

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Document latency trade-offs in GIL release implementation Remove size threshold for GIL release - always release for all payloads Feb 13, 2026
Copilot AI and others added 2 commits February 13, 2026 11:01
- Created benchmark_gil_vs_nogil.py: Compares current (GIL released) vs simulated original (GIL held)
- Created benchmark_gil_detailed.py: Comprehensive comparison across payload sizes
- Created benchmark_gil_comparison.py: Framework for future true comparison
- Created _packer_nogil.pyx: Reference implementation without GIL release (for future builds)
- Demonstrates throughput improvements from GIL release (9-14%)
- Shows methodology for comparing GIL release impact

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Remove size threshold for GIL release - always release for all payloads Release GIL during msgpack packing to enable thread parallelism Feb 13, 2026
Copilot AI and others added 2 commits February 13, 2026 11:35
- Updated benchmark_asyncio_parallel.py to compare three approaches:
  1. Small payloads with asyncio.to_thread (GIL held)
  2. Large payloads with asyncio.to_thread (GIL released)
  3. Direct async calls without threading (baseline)
- Added async_msgpack_work_direct() function for direct async execution
- Added direct_async_benchmark() to measure direct async performance
- Updated main() to run all three benchmarks and compare results
- Results show direct async has highest throughput (156k ops/sec) and best async responsiveness (4138 pings/sec)
- Provides clear guidance on when to use each approach
- Demonstrates threading overhead vs parallelism trade-offs

Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Co-authored-by: Rexhaif <5154447+Rexhaif@users.noreply.github.com>
Copilot AI changed the title Release GIL during msgpack packing to enable thread parallelism Release GIL during msgpack packing operations for true thread parallelism Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants