Inspiration
Large Language Models are quickly outgrowing the boundaries of conventional infrastructure - some reaching hundreds of gigabytes and pushing far beyond what a single GPU container can handle. We love Google Cloud Run for its simplicity - autoscaling, revisions, IAM, and seamless serverless deployment - but it wasnβt designed for multi-GPU, multi-hundred-GB, or HPC-style workloads.
So we asked ourselves:
Can we run a model thatβs βtoo big for one containerββ¦ without leaving the Cloud Run experience?
That question became Commissure.
In neuroscience, a commissure connects the hemispheres of the brain - enabling distributed intelligence.
Our project borrows that metaphor: Commissure connects multiple Cloud Run GPU services into one distributed LLM runtime, each stage acting as a βhemisphereβ of a single AI mind.
What It Does
Commissure transforms Cloud Run into an HPC-grade LLM runtime that scales models far beyond a single GPU.
It splits a large model - e.g. Gemma-3-27B-Instruct - into three cooperating GPU microservices:
- Stage A: Handles HTTP requests, tokenization, and runs the front layers.
- Stage B: Executes the middle transformer layers.
- Stage C: Produces the final activations, normalization, and logits.
These services communicate through gRPC, streaming bf16 boundary tensors at high speed.
From the outside, users see a single OpenAI-compatible /v1/chat/completions endpoint, but under the hood, Cloud Run GPUs collaborate as one distributed brain.
Architecture
Each stage loads only a subset of the modelβs layers:
$$ \text{Stage A: layers } 0..K_1,\quad \text{Stage B: } K_1..K_2,\quad \text{Stage C: } K_2..L $$
Intermediate activations are transmitted as boundary tensors xB,S,d over gRPC in bf16 precision - small enough to stream efficiently, precise enough to preserve accuracy.
This architecture allows a 27-billion-parameter model to run within Cloud Run GPU limits, and scales naturally to multiple stages:
$$ L = \sum_i L_i $$
where each ( L_i ) corresponds to a Cloud Run service holding its own segment of the model. Together, they form a seamless distributed inference graph, running entirely on managed serverless infrastructure.
System Diagram (Tensor Streaming)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Request (HTTPS) β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE A (Cloud Run Service β L4 GPU) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible) β β
β β β’ Tokenizer (chat templates, stop IDs, text β token IDs) β β
β β β’ Embeddings + decoder layers 0..Kββ1 (front of the model) β β
β β β’ Maintains its own KV cache for layers 0..Kββ1 β β
β β β’ Computes boundary activations: xβ β β^{BΓSΓd_model} β β
β β β’ Outputs xβ as bf16 boundary tensor over gRPC β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β gRPC stream (bf16-serialized xβ)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE B (Cloud Run Service β L4 GPU) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ gRPC bidirectional streaming server (Boundary.Decode) β β
β β β’ Middle decoder layers Kβ..Kββ1 β β
β β β’ Dynamic KV cache management for its layer range β β
β β β’ Receives boundary tensor xβ β β
β β β’ Computes xβ = f_B(xβ) through layers Kβ..Kββ1 β β
β β β’ Shape preserved: xβ, xβ β β^{BΓSΓd_model} β β
β β β’ Sends xβ as bf16 boundary tensor over gRPC β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β gRPC stream (bf16-serialized xβ)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE C (Cloud Run Service β L4 GPU) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ gRPC bidirectional streaming server (Boundary.Decode) β β
β β β’ Final decoder layers Kβ..Lβ1 β β
β β β’ Final LayerNorm + LM Head β β
β β β’ Dynamic KV cache for its own layers β β
β β β’ Receives boundary tensor xβ β β
β β β’ Computes logits via xβ = f_C(xβ) β β
β β β’ Token sampling (temperature, top-p) β β
β β β’ Returns next_token_id back over the same gRPC stream β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each token step can be expressed as the composition of stage functions:
$$ x_{t+1} = f_C!\big(f_B(f_A(\text{input}))\big), $$
where activations are streamed between Cloud Run GPU services as boundary tensors
$$ x_{B,S,d} \in \mathbb{R}^{B \times S \times d}, $$
preserving the hidden state across stages in bf16 precision.
How We Built It
We built Commissure in Python 3.11, using:
- PyTorch 2.8 + CUDA 12.8 - model slicing and tensor streaming
- FastAPI - for the HTTP/SSE public interface (Stage A)
- gRPC + Protocol Buffers - for bf16 boundary streaming between services
- Google Cloud Run (L4 GPUs) - for all compute stages
- Cloud Build, Artifact Registry, Secret Manager, and GCS - for fully automated deployment
- A custom CLI -
commissure up- that builds the Docker image, uploads weights, deploys all three services, and automatically wires them together (C β B β A)
We also implemented:
- Lazy model loading to reduce cold-start latency
- Asynchronous warm-up and caching
- DynamicCache handling for cross-stage attention reuse
- bf16 wire serialization for compact activation transfer
Challenges We Ran Into
- Memory ceilings: fitting partial model shards while preserving bf16 precision.
- Inter-stage streaming: optimizing latency while maintaining numerical stability.
- Checkpoint slicing: dynamically mapping Hugging Face prefixes across layer splits.
- Cold starts: solved via lazy loading, async warm-ups, and cache persistence.
- Serverless orchestration: ensuring ephemeral GPU services behave like one continuous model.
Accomplishments That Weβre Proud Of
- Successfully ran Gemma-3-27B across three cooperating Cloud Run GPU services, end-to-end.
- Achieved real-time, streaming chat completions fully compatible with OpenAIβs API format.
- Created a reproducible one-command deployment pipeline (
commissure up) using Google Cloud tools. - Delivered a proof that Cloud Run can perform distributed inference once thought impossible for serverless.
- Most importantly - made large-model inference accessible to teams that love Cloud Runβs simplicity.
What We Learned
- Cloud Run is more capable than expected - it can host distributed compute graphs, not just APIs.
- gRPC + bf16 is the key to bridging precision and performance in cross-container inference.
- Layer partitioning + DynamicCache can scale models linearly with the number of services.
- Serverless can be fast, scalable, and HPC-grade when you rethink the architecture, not the platform.
Whatβs Next for Commissure
- Broader model support: Extend the same split-runtime approach to LLaMA, DeepSeek, MiniMax, GPT-OSS, and future open-weight models.
- Per-stage quantization: Enable seamless execution of hundreds-of-gigabyte checkpoints through mixed 4-bit / 8-bit compression - with no architecture changes required.
- Adaptive scaling: Automatically tune layer splits ((Kβ, Kβ, Lα΅’)) based on GPU size, region, or workload.
- Cross-region mesh inference: Link Cloud Run GPUs across continents for global, low-latency collaboration.
- Distributed training: Reuse the same staged, gRPC-connected topology to experiment with training and fine-tuning across multiple Cloud Run GPUs.
Our next goal is simple - to keep pushing serverless beyond its limits, until Cloud Run thinks like a supercomputer.
Summary
Commissure bridges the gap between serverless and high-performance AI.
It proves that with the right architecture, even massive language models can run fully managed, secure, and elastic β no clusters, no manual scaling, just Cloud Run.


Log in or sign up for Devpost to join the conversation.