back to top
More
    HomeTechAlibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System...

    Alibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System for Agentic AI Inference

    Published on

    Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

    Cursor just crossed a threshold that most AI coding tools have only promised. Its cloud agents no longer just generate code; they spin up their own virtual machines, run the software they build, capture video evidence, and submit pull requests that are ready to merge.

    Alibaba Cloud Tair has partnered with SGLang to develop HiCache, a hierarchical KV cache infrastructure designed specifically for “agentic inference” the emerging AI paradigm where models make multi-turn decisions, self-reflect, and collaborate with other agents. The system addresses critical bottlenecks in traditional cache mechanisms and is now integrated as a core component in the SGLang framework, with plans to become a key module in Alibaba Cloud Tair KVCache. Early tests show cache hit rates reaching 80%, a 56% reduction in time-to-first-token (TTFT), and doubled inference throughput.

    What’s New: Multi-Layer Cache Architecture

    HiCache constructs a unified hierarchical cache system that spans GPU memory, host memory, local disk, and remote distributed storage like 3FS. This approach tackles three major problems in AI agent workloads: state bloat from extremely long contexts, lack of cross-turn session persistence, and cache isolation between multi-agent tasks.

    The system uses intelligent heat-aware scheduling to keep frequently accessed “hot” data in GPU memory while transparently offloading “cold” data to larger-capacity storage. A GPU with only 40GB memory can now leverage CPU memory to expand effective cache capacity beyond 200GB, with storage-layer integration supporting terabyte-level ultra-long context caching.

    3FS, Alibaba’s distributed file system, provides the storage backbone with 6.6 TiB/s read bandwidth across 180-node clusters and RDMA network support.

    Why It Matters: Breaking Through Memory Limits

    Traditional KV cache mechanisms struggle with AI agents that maintain context across hours of interaction, not just single requests. Programming agents operating in “Think-Act-Observe” loops add tokens incrementally but must retain full historical state, causing cache requirements to explode from gigabytes to petabytes.

    HiCache enables models to handle contexts stretching to millions of tokens far beyond what GPU memory alone can hold. This unlocks practical deployment of multi-agent systems that need shared memory across tasks, complete tool call traces, and long-term user preference tracking.

    The 56% TTFT improvement means faster response times for users interacting with AI agents, while the doubled QPS supports higher concurrent workloads.

    Technical Implementation

    HiRadixTree forms the core architecture, a dual-layer prefix cache tree that automatically synchronizes KVCache between GPU and CPU. Key features include:

    • Pluggable storage backends supporting 3FS, Mooncake, and NIXL
    • Zero-copy data transmission through unified batch operations
    • LRU-based eviction prioritizing high-frequency data
    • Kubernetes-based deployment with automatic fault recovery

    The system supports hybrid attention models combining full attention and linear attention layers, including recent architectures like DeepSeek-V3.

    What’s Next: Expanded Model Support

    The Tair and SGLang teams are working on enhanced sparse attention support and smarter scheduling strategies that dynamically adjust backup and prefetch rates based on real-time bandwidth usage. HiCache will expand compatibility with additional inference engines beyond SGLang, including vLLM, RTP-LLM, and TensorRT-LLM.

    The technology is currently deployed within Alibaba Cloud infrastructure, though no public release timeline has been announced for third-party access.

    Featured Snippet Boxes

    What is HiCache and how does it differ from traditional KV cache?

    HiCache is a hierarchical caching system that spans GPU memory, CPU memory, and distributed storage, unlike traditional KV cache limited to GPU memory. It uses heat-aware scheduling to automatically move data between layers, enabling terabyte-scale context caching for AI agents that need persistent multi-turn memory.

    Why do AI agents need HiCache instead of standard caching?

    AI agents engage in continuous “Think-Act-Observe” loops over hours, maintaining context across tool calls, decisions, and collaborations. Standard per-request caching can’t persist state between turns or share memory across multiple agents, causing redundant computation and decision conflicts that HiCache eliminates through global cache sharing.

    What performance improvements does HiCache deliver?

    Integration with 3FS KVStore achieved 80% cache hit rates, reduced average time-to-first-token by 56%, and doubled inference QPS in production tests. A single 40GB GPU can expand effective cache capacity beyond 200GB through CPU memory extension.

    Which inference frameworks support HiCache?

    HiCache currently serves as a core component in SGLang, with planned support for vLLM, RTP-LLM, and TensorRT-LLM through unified storage interfaces. The system works with hybrid models including DeepSeek-V3 and architectures using Mamba or sliding window attention.

    Mohammad Kashif
    Mohammad Kashif
    Senior Technology Analyst and Writer at AdwaitX, specializing in the convergence of Mobile Silicon, Generative AI, and Consumer Hardware. Moving beyond spec sheets, his reviews rigorously test "real-world" metrics analyzing sustained battery efficiency, camera sensor behavior, and long-term software support lifecycles. Kashif’s data-driven approach helps enthusiasts and professionals distinguish between genuine innovation and marketing hype, ensuring they invest in devices that offer lasting value.

    Latest articles

    Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

    Cursor just crossed a threshold that most AI coding tools have only promised. Its cloud agents no longer just generate code; they spin up their own virtual machines, run the software they build, capture video evidence, and submit pull requests that are ready to merge.

    Claude Cowork Enterprise Plugins: How Anthropic Is Rebuilding the AI Workplace in 2026

    This is what separates it from generic AI assistants. The update gives IT admins, department heads, and knowledge workers a unified system to build, manage, and deploy AI agents that follow how their organization

    Anthropic Acquires Vercept: Claude Now Operates Software Like a Human

    Anthropic’s acquisition of Vercept is not a talent grab or a defensive move. It is a direct investment in making Claude the most capable computer-using AI agent available. The bottleneck has always

    Samsung Galaxy Buds4 Pro Officially Lauched: Everything You Need to Know Before March 11

    Samsung launched the Galaxy Buds4 series at Galaxy Unpacked 2026 in San Francisco, and the lineup arrives with more hardware changes than any previous Buds generation. The Buds4 Pro moves to a dual-

    More like this

    Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

    Cursor just crossed a threshold that most AI coding tools have only promised. Its cloud agents no longer just generate code; they spin up their own virtual machines, run the software they build, capture video evidence, and submit pull requests that are ready to merge.

    Claude Cowork Enterprise Plugins: How Anthropic Is Rebuilding the AI Workplace in 2026

    This is what separates it from generic AI assistants. The update gives IT admins, department heads, and knowledge workers a unified system to build, manage, and deploy AI agents that follow how their organization

    Anthropic Acquires Vercept: Claude Now Operates Software Like a Human

    Anthropic’s acquisition of Vercept is not a talent grab or a defensive move. It is a direct investment in making Claude the most capable computer-using AI agent available. The bottleneck has always
    Skip to main content