HomeTechAlibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System...

Alibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System for Agentic AI Inference

Published on

Claude’s Agent Harness Patterns Are Rewriting Developer Assumptions About What AI Can Handle Alone

That’s Anthropic’s confirmed BrowseComp score for Claude Opus 4.6 running with a multi-agent harness, web search, compaction triggered at 50,000 tokens, and max reasoning effort.

Alibaba Cloud Tair has partnered with SGLang to develop HiCache, a hierarchical KV cache infrastructure designed specifically for “agentic inference” the emerging AI paradigm where models make multi-turn decisions, self-reflect, and collaborate with other agents. The system addresses critical bottlenecks in traditional cache mechanisms and is now integrated as a core component in the SGLang framework, with plans to become a key module in Alibaba Cloud Tair KVCache. Early tests show cache hit rates reaching 80%, a 56% reduction in time-to-first-token (TTFT), and doubled inference throughput.

What’s New: Multi-Layer Cache Architecture

HiCache constructs a unified hierarchical cache system that spans GPU memory, host memory, local disk, and remote distributed storage like 3FS. This approach tackles three major problems in AI agent workloads: state bloat from extremely long contexts, lack of cross-turn session persistence, and cache isolation between multi-agent tasks.

The system uses intelligent heat-aware scheduling to keep frequently accessed “hot” data in GPU memory while transparently offloading “cold” data to larger-capacity storage. A GPU with only 40GB memory can now leverage CPU memory to expand effective cache capacity beyond 200GB, with storage-layer integration supporting terabyte-level ultra-long context caching.

3FS, Alibaba’s distributed file system, provides the storage backbone with 6.6 TiB/s read bandwidth across 180-node clusters and RDMA network support.

Why It Matters: Breaking Through Memory Limits

Traditional KV cache mechanisms struggle with AI agents that maintain context across hours of interaction, not just single requests. Programming agents operating in “Think-Act-Observe” loops add tokens incrementally but must retain full historical state, causing cache requirements to explode from gigabytes to petabytes.

HiCache enables models to handle contexts stretching to millions of tokens far beyond what GPU memory alone can hold. This unlocks practical deployment of multi-agent systems that need shared memory across tasks, complete tool call traces, and long-term user preference tracking.

The 56% TTFT improvement means faster response times for users interacting with AI agents, while the doubled QPS supports higher concurrent workloads.

Technical Implementation

HiRadixTree forms the core architecture, a dual-layer prefix cache tree that automatically synchronizes KVCache between GPU and CPU. Key features include:

  • Pluggable storage backends supporting 3FS, Mooncake, and NIXL
  • Zero-copy data transmission through unified batch operations
  • LRU-based eviction prioritizing high-frequency data
  • Kubernetes-based deployment with automatic fault recovery

The system supports hybrid attention models combining full attention and linear attention layers, including recent architectures like DeepSeek-V3.

What’s Next: Expanded Model Support

The Tair and SGLang teams are working on enhanced sparse attention support and smarter scheduling strategies that dynamically adjust backup and prefetch rates based on real-time bandwidth usage. HiCache will expand compatibility with additional inference engines beyond SGLang, including vLLM, RTP-LLM, and TensorRT-LLM.

The technology is currently deployed within Alibaba Cloud infrastructure, though no public release timeline has been announced for third-party access.

Featured Snippet Boxes

What is HiCache and how does it differ from traditional KV cache?

HiCache is a hierarchical caching system that spans GPU memory, CPU memory, and distributed storage, unlike traditional KV cache limited to GPU memory. It uses heat-aware scheduling to automatically move data between layers, enabling terabyte-scale context caching for AI agents that need persistent multi-turn memory.

Why do AI agents need HiCache instead of standard caching?

AI agents engage in continuous “Think-Act-Observe” loops over hours, maintaining context across tool calls, decisions, and collaborations. Standard per-request caching can’t persist state between turns or share memory across multiple agents, causing redundant computation and decision conflicts that HiCache eliminates through global cache sharing.

What performance improvements does HiCache deliver?

Integration with 3FS KVStore achieved 80% cache hit rates, reduced average time-to-first-token by 56%, and doubled inference QPS in production tests. A single 40GB GPU can expand effective cache capacity beyond 200GB through CPU memory extension.

Which inference frameworks support HiCache?

HiCache currently serves as a core component in SGLang, with planned support for vLLM, RTP-LLM, and TensorRT-LLM through unified storage interfaces. The system works with hybrid models including DeepSeek-V3 and architectures using Mamba or sliding window attention.

Mohammad Kashif
Mohammad Kashif
Senior Technology Analyst and Writer at AdwaitX, specializing in the convergence of Mobile Silicon, Generative AI, and Consumer Hardware. Moving beyond spec sheets, his reviews rigorously test "real-world" metrics analyzing sustained battery efficiency, camera sensor behavior, and long-term software support lifecycles. Kashif’s data-driven approach helps enthusiasts and professionals distinguish between genuine innovation and marketing hype, ensuring they invest in devices that offer lasting value.

Latest articles

Claude’s Agent Harness Patterns Are Rewriting Developer Assumptions About What AI Can Handle Alone

That’s Anthropic’s confirmed BrowseComp score for Claude Opus 4.6 running with a multi-agent harness, web search, compaction triggered at 50,000 tokens, and max reasoning effort.

Xcode 26.5 Beta Ships Swift 6.3 and an iOS SDK That Lays Groundwork for Maps Ads

Xcode 26.5 beta (17F5012f) arrived on March 30, 2026, and it carries more developer impact than a typical point release. Swift 6.3 ships as the new default compiler, five platform SDKs move forward simultaneously, and

macOS Tahoe 26.5 Beta 1 Quietly Tests RCS Encryption Again and Lays the Foundation for Apple Maps Ads

Apple released macOS Tahoe 26.5 Beta 1 on March 29, 2026, less than a week after macOS 26.4 reached Mac hardware worldwide. Most coverage frames this as a routine maintenance drop.

iOS 26.5 Beta Flips RCS Encryption Back On, Puts Ads Inside Apple Maps, and Expands EU Wearable Access

Apple dropped iOS 26.5 beta 1 (build 23F5043g) on March 29, 2026, one week after iOS 26.4 shipped to the public. Siri watchers will find nothing new here. But the update carries three changes significant enough to

More like this

Claude’s Agent Harness Patterns Are Rewriting Developer Assumptions About What AI Can Handle Alone

That’s Anthropic’s confirmed BrowseComp score for Claude Opus 4.6 running with a multi-agent harness, web search, compaction triggered at 50,000 tokens, and max reasoning effort.

Xcode 26.5 Beta Ships Swift 6.3 and an iOS SDK That Lays Groundwork for Maps Ads

Xcode 26.5 beta (17F5012f) arrived on March 30, 2026, and it carries more developer impact than a typical point release. Swift 6.3 ships as the new default compiler, five platform SDKs move forward simultaneously, and

macOS Tahoe 26.5 Beta 1 Quietly Tests RCS Encryption Again and Lays the Foundation for Apple Maps Ads

Apple released macOS Tahoe 26.5 Beta 1 on March 29, 2026, less than a week after macOS 26.4 reached Mac hardware worldwide. Most coverage frames this as a routine maintenance drop.