back to top
More
    HomeTechAlibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System...

    Alibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System for Agentic AI Inference

    Published on

    Papa Johns Deploys Google Cloud’s AI Agent Across All Digital Channels

    Quick Brief The Deal: Papa Johns becomes the first restaurant...

    Alibaba Cloud Tair has partnered with SGLang to develop HiCache, a hierarchical KV cache infrastructure designed specifically for “agentic inference” the emerging AI paradigm where models make multi-turn decisions, self-reflect, and collaborate with other agents. The system addresses critical bottlenecks in traditional cache mechanisms and is now integrated as a core component in the SGLang framework, with plans to become a key module in Alibaba Cloud Tair KVCache. Early tests show cache hit rates reaching 80%, a 56% reduction in time-to-first-token (TTFT), and doubled inference throughput.

    What’s New: Multi-Layer Cache Architecture

    HiCache constructs a unified hierarchical cache system that spans GPU memory, host memory, local disk, and remote distributed storage like 3FS. This approach tackles three major problems in AI agent workloads: state bloat from extremely long contexts, lack of cross-turn session persistence, and cache isolation between multi-agent tasks.

    The system uses intelligent heat-aware scheduling to keep frequently accessed “hot” data in GPU memory while transparently offloading “cold” data to larger-capacity storage. A GPU with only 40GB memory can now leverage CPU memory to expand effective cache capacity beyond 200GB, with storage-layer integration supporting terabyte-level ultra-long context caching.

    3FS, Alibaba’s distributed file system, provides the storage backbone with 6.6 TiB/s read bandwidth across 180-node clusters and RDMA network support.

    Why It Matters: Breaking Through Memory Limits

    Traditional KV cache mechanisms struggle with AI agents that maintain context across hours of interaction, not just single requests. Programming agents operating in “Think-Act-Observe” loops add tokens incrementally but must retain full historical state, causing cache requirements to explode from gigabytes to petabytes.

    HiCache enables models to handle contexts stretching to millions of tokens far beyond what GPU memory alone can hold. This unlocks practical deployment of multi-agent systems that need shared memory across tasks, complete tool call traces, and long-term user preference tracking.

    The 56% TTFT improvement means faster response times for users interacting with AI agents, while the doubled QPS supports higher concurrent workloads.

    Technical Implementation

    HiRadixTree forms the core architecture, a dual-layer prefix cache tree that automatically synchronizes KVCache between GPU and CPU. Key features include:

    • Pluggable storage backends supporting 3FS, Mooncake, and NIXL
    • Zero-copy data transmission through unified batch operations
    • LRU-based eviction prioritizing high-frequency data
    • Kubernetes-based deployment with automatic fault recovery

    The system supports hybrid attention models combining full attention and linear attention layers, including recent architectures like DeepSeek-V3.

    What’s Next: Expanded Model Support

    The Tair and SGLang teams are working on enhanced sparse attention support and smarter scheduling strategies that dynamically adjust backup and prefetch rates based on real-time bandwidth usage. HiCache will expand compatibility with additional inference engines beyond SGLang, including vLLM, RTP-LLM, and TensorRT-LLM.

    The technology is currently deployed within Alibaba Cloud infrastructure, though no public release timeline has been announced for third-party access.

    Featured Snippet Boxes

    What is HiCache and how does it differ from traditional KV cache?

    HiCache is a hierarchical caching system that spans GPU memory, CPU memory, and distributed storage, unlike traditional KV cache limited to GPU memory. It uses heat-aware scheduling to automatically move data between layers, enabling terabyte-scale context caching for AI agents that need persistent multi-turn memory.

    Why do AI agents need HiCache instead of standard caching?

    AI agents engage in continuous “Think-Act-Observe” loops over hours, maintaining context across tool calls, decisions, and collaborations. Standard per-request caching can’t persist state between turns or share memory across multiple agents, causing redundant computation and decision conflicts that HiCache eliminates through global cache sharing.

    What performance improvements does HiCache deliver?

    Integration with 3FS KVStore achieved 80% cache hit rates, reduced average time-to-first-token by 56%, and doubled inference QPS in production tests. A single 40GB GPU can expand effective cache capacity beyond 200GB through CPU memory extension.

    Which inference frameworks support HiCache?

    HiCache currently serves as a core component in SGLang, with planned support for vLLM, RTP-LLM, and TensorRT-LLM through unified storage interfaces. The system works with hybrid models including DeepSeek-V3 and architectures using Mamba or sliding window attention.

    Mohammad Kashif
    Mohammad Kashif
    Topics covers smartphones, AI, and emerging tech, explaining how new features affect daily life. Reviews focus on battery life, camera behavior, update policies, and long-term value to help readers choose the right gadgets and software.

    Latest articles

    Papa Johns Deploys Google Cloud’s AI Agent Across All Digital Channels

    Quick Brief The Deal: Papa Johns becomes the first restaurant partner for Google Cloud's Food...

    Honeywell Deploys Google Cloud AI to Transform In-Store Retail Experience

    Quick Brief The Launch: Honeywell unveils Smart Shopping Platform with Google Cloud's Gemini and Vertex...

    Kroger Deploys Google’s Gemini AI Shopping Assistant Nationwide to Drive Digital Profitability

    Quick Brief The Partnership: Kroger (NYSE: KR) expands Google Cloud relationship to deploy Gemini Enterprise...

    Datavault AI Expands IBM Partnership to Deploy Enterprise AI at the Edge with SanQtum Platform

    QUICK BRIEF The Deal: Datavault AI (Nasdaq: DVLT) expands IBM watsonx collaboration to deploy real-time...

    More like this

    Papa Johns Deploys Google Cloud’s AI Agent Across All Digital Channels

    Quick Brief The Deal: Papa Johns becomes the first restaurant partner for Google Cloud's Food...

    Honeywell Deploys Google Cloud AI to Transform In-Store Retail Experience

    Quick Brief The Launch: Honeywell unveils Smart Shopping Platform with Google Cloud's Gemini and Vertex...

    Kroger Deploys Google’s Gemini AI Shopping Assistant Nationwide to Drive Digital Profitability

    Quick Brief The Partnership: Kroger (NYSE: KR) expands Google Cloud relationship to deploy Gemini Enterprise...