Alibaba Cloud Tair has partnered with SGLang to develop HiCache, a hierarchical KV cache infrastructure designed specifically for “agentic inference” the emerging AI paradigm where models make multi-turn decisions, self-reflect, and collaborate with other agents. The system addresses critical bottlenecks in traditional cache mechanisms and is now integrated as a core component in the SGLang framework, with plans to become a key module in Alibaba Cloud Tair KVCache. Early tests show cache hit rates reaching 80%, a 56% reduction in time-to-first-token (TTFT), and doubled inference throughput.
What’s New: Multi-Layer Cache Architecture
HiCache constructs a unified hierarchical cache system that spans GPU memory, host memory, local disk, and remote distributed storage like 3FS. This approach tackles three major problems in AI agent workloads: state bloat from extremely long contexts, lack of cross-turn session persistence, and cache isolation between multi-agent tasks.
The system uses intelligent heat-aware scheduling to keep frequently accessed “hot” data in GPU memory while transparently offloading “cold” data to larger-capacity storage. A GPU with only 40GB memory can now leverage CPU memory to expand effective cache capacity beyond 200GB, with storage-layer integration supporting terabyte-level ultra-long context caching.
3FS, Alibaba’s distributed file system, provides the storage backbone with 6.6 TiB/s read bandwidth across 180-node clusters and RDMA network support.
Why It Matters: Breaking Through Memory Limits
Traditional KV cache mechanisms struggle with AI agents that maintain context across hours of interaction, not just single requests. Programming agents operating in “Think-Act-Observe” loops add tokens incrementally but must retain full historical state, causing cache requirements to explode from gigabytes to petabytes.
HiCache enables models to handle contexts stretching to millions of tokens far beyond what GPU memory alone can hold. This unlocks practical deployment of multi-agent systems that need shared memory across tasks, complete tool call traces, and long-term user preference tracking.
The 56% TTFT improvement means faster response times for users interacting with AI agents, while the doubled QPS supports higher concurrent workloads.
Technical Implementation
HiRadixTree forms the core architecture, a dual-layer prefix cache tree that automatically synchronizes KVCache between GPU and CPU. Key features include:
- Pluggable storage backends supporting 3FS, Mooncake, and NIXL
- Zero-copy data transmission through unified batch operations
- LRU-based eviction prioritizing high-frequency data
- Kubernetes-based deployment with automatic fault recovery
The system supports hybrid attention models combining full attention and linear attention layers, including recent architectures like DeepSeek-V3.
What’s Next: Expanded Model Support
The Tair and SGLang teams are working on enhanced sparse attention support and smarter scheduling strategies that dynamically adjust backup and prefetch rates based on real-time bandwidth usage. HiCache will expand compatibility with additional inference engines beyond SGLang, including vLLM, RTP-LLM, and TensorRT-LLM.
The technology is currently deployed within Alibaba Cloud infrastructure, though no public release timeline has been announced for third-party access.
Featured Snippet Boxes
What is HiCache and how does it differ from traditional KV cache?
HiCache is a hierarchical caching system that spans GPU memory, CPU memory, and distributed storage, unlike traditional KV cache limited to GPU memory. It uses heat-aware scheduling to automatically move data between layers, enabling terabyte-scale context caching for AI agents that need persistent multi-turn memory.
Why do AI agents need HiCache instead of standard caching?
AI agents engage in continuous “Think-Act-Observe” loops over hours, maintaining context across tool calls, decisions, and collaborations. Standard per-request caching can’t persist state between turns or share memory across multiple agents, causing redundant computation and decision conflicts that HiCache eliminates through global cache sharing.
What performance improvements does HiCache deliver?
Integration with 3FS KVStore achieved 80% cache hit rates, reduced average time-to-first-token by 56%, and doubled inference QPS in production tests. A single 40GB GPU can expand effective cache capacity beyond 200GB through CPU memory extension.
Which inference frameworks support HiCache?
HiCache currently serves as a core component in SGLang, with planned support for vLLM, RTP-LLM, and TensorRT-LLM through unified storage interfaces. The system works with hybrid models including DeepSeek-V3 and architectures using Mamba or sliding window attention.

