vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
-
Updated
May 12, 2026 - Python
vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
Algorithm-System Co-design: accurate and efficient 2-bit KV cache quantization for LLM Inference..
KV-cache compression for LLMs: reference implementations of TurboAngle and TurboQuant codecs with Triton GPU kernels
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
Production-ready 2/4-bit KV Cache quantization for vLLM via Triton; 70% VRAM saving & 1.8x speedup
LLM inference in C/C++ Including TurboQuant and Ternary models, referencing from PrismML-Eng & TheTom. Grab a Bonasi Ternary 8B model and test it out: https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf
Add a description, image, and links to the kvcache-compression topic page so that developers can more easily learn about it.
To associate your repository with the kvcache-compression topic, visit your repo's landing page and select "manage topics."