Inspiration

Large Language Models are quickly outgrowing the boundaries of conventional infrastructure - some reaching hundreds of gigabytes and pushing far beyond what a single GPU container can handle. We love Google Cloud Run for its simplicity - autoscaling, revisions, IAM, and seamless serverless deployment - but it wasn’t designed for multi-GPU, multi-hundred-GB, or HPC-style workloads.

So we asked ourselves:

Can we run a model that’s β€œtoo big for one container”… without leaving the Cloud Run experience?

That question became Commissure.

In neuroscience, a commissure connects the hemispheres of the brain - enabling distributed intelligence.
Our project borrows that metaphor: Commissure connects multiple Cloud Run GPU services into one distributed LLM runtime, each stage acting as a β€œhemisphere” of a single AI mind.


What It Does

Commissure transforms Cloud Run into an HPC-grade LLM runtime that scales models far beyond a single GPU.

It splits a large model - e.g. Gemma-3-27B-Instruct - into three cooperating GPU microservices:

  • Stage A: Handles HTTP requests, tokenization, and runs the front layers.
  • Stage B: Executes the middle transformer layers.
  • Stage C: Produces the final activations, normalization, and logits.

These services communicate through gRPC, streaming bf16 boundary tensors at high speed.
From the outside, users see a single OpenAI-compatible /v1/chat/completions endpoint, but under the hood, Cloud Run GPUs collaborate as one distributed brain.

Architecture

Each stage loads only a subset of the model’s layers:

$$ \text{Stage A: layers } 0..K_1,\quad \text{Stage B: } K_1..K_2,\quad \text{Stage C: } K_2..L $$

Intermediate activations are transmitted as boundary tensors xB,S,d over gRPC in bf16 precision - small enough to stream efficiently, precise enough to preserve accuracy.

This architecture allows a 27-billion-parameter model to run within Cloud Run GPU limits, and scales naturally to multiple stages:

$$ L = \sum_i L_i $$

where each ( L_i ) corresponds to a Cloud Run service holding its own segment of the model. Together, they form a seamless distributed inference graph, running entirely on managed serverless infrastructure.


System Diagram (Tensor Streaming)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           User Request (HTTPS)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE A (Cloud Run Service – L4 GPU)                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ β€’ FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible)      β”‚  β”‚
β”‚  β”‚ β€’ Tokenizer (chat templates, stop IDs, text β†’ token IDs)              β”‚  β”‚
β”‚  β”‚ β€’ Embeddings + decoder layers 0..Kβ‚βˆ’1 (front of the model)            β”‚  β”‚
β”‚  β”‚ β€’ Maintains its own KV cache for layers 0..Kβ‚βˆ’1                       β”‚  β”‚
β”‚  β”‚ β€’ Computes boundary activations: xβ‚€ ∈ ℝ^{BΓ—SΓ—d_model}                 β”‚  β”‚
β”‚  β”‚ β€’ Outputs xβ‚€ as bf16 boundary tensor over gRPC                        β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚  gRPC stream (bf16-serialized xβ‚€)
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE B (Cloud Run Service – L4 GPU)                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ β€’ gRPC bidirectional streaming server (Boundary.Decode)               β”‚  β”‚
β”‚  β”‚ β€’ Middle decoder layers K₁..Kβ‚‚βˆ’1                                      β”‚  β”‚
β”‚  β”‚ β€’ Dynamic KV cache management for its layer range                     β”‚  β”‚
β”‚  β”‚ β€’ Receives boundary tensor xβ‚€                                         β”‚  β”‚
β”‚  β”‚ β€’ Computes x₁ = f_B(xβ‚€) through layers K₁..Kβ‚‚βˆ’1                       β”‚  β”‚
β”‚  β”‚ β€’ Shape preserved: xβ‚€, x₁ ∈ ℝ^{BΓ—SΓ—d_model}                           β”‚  β”‚
β”‚  β”‚ β€’ Sends x₁ as bf16 boundary tensor over gRPC                          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚  gRPC stream (bf16-serialized x₁)
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE C (Cloud Run Service – L4 GPU)                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ β€’ gRPC bidirectional streaming server (Boundary.Decode)               β”‚  β”‚
β”‚  β”‚ β€’ Final decoder layers Kβ‚‚..Lβˆ’1                                        β”‚  β”‚
β”‚  β”‚ β€’ Final LayerNorm + LM Head                                           β”‚  β”‚
β”‚  β”‚ β€’ Dynamic KV cache for its own layers                                 β”‚  β”‚
β”‚  β”‚ β€’ Receives boundary tensor x₁                                         β”‚  β”‚
β”‚  β”‚ β€’ Computes logits via xβ‚‚ = f_C(x₁)                                    β”‚  β”‚
β”‚  β”‚ β€’ Token sampling (temperature, top-p)                                 β”‚  β”‚
β”‚  β”‚ β€’ Returns next_token_id back over the same gRPC stream                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each token step can be expressed as the composition of stage functions:

$$ x_{t+1} = f_C!\big(f_B(f_A(\text{input}))\big), $$

where activations are streamed between Cloud Run GPU services as boundary tensors

$$ x_{B,S,d} \in \mathbb{R}^{B \times S \times d}, $$

preserving the hidden state across stages in bf16 precision.


How We Built It

We built Commissure in Python 3.11, using:

  • PyTorch 2.8 + CUDA 12.8 - model slicing and tensor streaming
  • FastAPI - for the HTTP/SSE public interface (Stage A)
  • gRPC + Protocol Buffers - for bf16 boundary streaming between services
  • Google Cloud Run (L4 GPUs) - for all compute stages
  • Cloud Build, Artifact Registry, Secret Manager, and GCS - for fully automated deployment
  • A custom CLI - commissure up - that builds the Docker image, uploads weights, deploys all three services, and automatically wires them together (C β†’ B β†’ A)

We also implemented:

  • Lazy model loading to reduce cold-start latency
  • Asynchronous warm-up and caching
  • DynamicCache handling for cross-stage attention reuse
  • bf16 wire serialization for compact activation transfer

Challenges We Ran Into

  • Memory ceilings: fitting partial model shards while preserving bf16 precision.
  • Inter-stage streaming: optimizing latency while maintaining numerical stability.
  • Checkpoint slicing: dynamically mapping Hugging Face prefixes across layer splits.
  • Cold starts: solved via lazy loading, async warm-ups, and cache persistence.
  • Serverless orchestration: ensuring ephemeral GPU services behave like one continuous model.

Accomplishments That We’re Proud Of

  • Successfully ran Gemma-3-27B across three cooperating Cloud Run GPU services, end-to-end.
  • Achieved real-time, streaming chat completions fully compatible with OpenAI’s API format.
  • Created a reproducible one-command deployment pipeline (commissure up) using Google Cloud tools.
  • Delivered a proof that Cloud Run can perform distributed inference once thought impossible for serverless.
  • Most importantly - made large-model inference accessible to teams that love Cloud Run’s simplicity.

What We Learned

  • Cloud Run is more capable than expected - it can host distributed compute graphs, not just APIs.
  • gRPC + bf16 is the key to bridging precision and performance in cross-container inference.
  • Layer partitioning + DynamicCache can scale models linearly with the number of services.
  • Serverless can be fast, scalable, and HPC-grade when you rethink the architecture, not the platform.

What’s Next for Commissure

  • Broader model support: Extend the same split-runtime approach to LLaMA, DeepSeek, MiniMax, GPT-OSS, and future open-weight models.
  • Per-stage quantization: Enable seamless execution of hundreds-of-gigabyte checkpoints through mixed 4-bit / 8-bit compression - with no architecture changes required.
  • Adaptive scaling: Automatically tune layer splits ((K₁, Kβ‚‚, Lα΅’)) based on GPU size, region, or workload.
  • Cross-region mesh inference: Link Cloud Run GPUs across continents for global, low-latency collaboration.
  • Distributed training: Reuse the same staged, gRPC-connected topology to experiment with training and fine-tuning across multiple Cloud Run GPUs.

Our next goal is simple - to keep pushing serverless beyond its limits, until Cloud Run thinks like a supercomputer.


Summary

Commissure bridges the gap between serverless and high-performance AI.
It proves that with the right architecture, even massive language models can run fully managed, secure, and elastic β€” no clusters, no manual scaling, just Cloud Run.

Built With

Share this project:

Updates