Alibaba Cloud Tair & SGLang Launch HiCache for AI Agents

Alibaba Cloud Tair has partnered with SGLang to develop HiCache, a hierarchical KV cache infrastructure designed specifically for “agentic inference” the emerging AI paradigm where models make multi-turn decisions, self-reflect, and collaborate with other agents. The system addresses critical bottlenecks in traditional cache mechanisms and is now integrated as a core component in the SGLang framework, with plans to become a key module in Alibaba Cloud Tair KVCache. Early tests show cache hit rates reaching 80%, a 56% reduction in time-to-first-token (TTFT), and doubled inference throughput.

What’s New: Multi-Layer Cache Architecture

HiCache constructs a unified hierarchical cache system that spans GPU memory, host memory, local disk, and remote distributed storage like 3FS. This approach tackles three major problems in AI agent workloads: state bloat from extremely long contexts, lack of cross-turn session persistence, and cache isolation between multi-agent tasks.

The system uses intelligent heat-aware scheduling to keep frequently accessed “hot” data in GPU memory while transparently offloading “cold” data to larger-capacity storage. A GPU with only 40GB memory can now leverage CPU memory to expand effective cache capacity beyond 200GB, with storage-layer integration supporting terabyte-level ultra-long context caching.

3FS, Alibaba’s distributed file system, provides the storage backbone with 6.6 TiB/s read bandwidth across 180-node clusters and RDMA network support.

Why It Matters: Breaking Through Memory Limits

Traditional KV cache mechanisms struggle with AI agents that maintain context across hours of interaction, not just single requests. Programming agents operating in “Think-Act-Observe” loops add tokens incrementally but must retain full historical state, causing cache requirements to explode from gigabytes to petabytes.

HiCache enables models to handle contexts stretching to millions of tokens far beyond what GPU memory alone can hold. This unlocks practical deployment of multi-agent systems that need shared memory across tasks, complete tool call traces, and long-term user preference tracking.

The 56% TTFT improvement means faster response times for users interacting with AI agents, while the doubled QPS supports higher concurrent workloads.

Technical Implementation

HiRadixTree forms the core architecture, a dual-layer prefix cache tree that automatically synchronizes KVCache between GPU and CPU. Key features include:

Pluggable storage backends supporting 3FS, Mooncake, and NIXL
Zero-copy data transmission through unified batch operations
LRU-based eviction prioritizing high-frequency data
Kubernetes-based deployment with automatic fault recovery

The system supports hybrid attention models combining full attention and linear attention layers, including recent architectures like DeepSeek-V3.

What’s Next: Expanded Model Support

The Tair and SGLang teams are working on enhanced sparse attention support and smarter scheduling strategies that dynamically adjust backup and prefetch rates based on real-time bandwidth usage. HiCache will expand compatibility with additional inference engines beyond SGLang, including vLLM, RTP-LLM, and TensorRT-LLM.

The technology is currently deployed within Alibaba Cloud infrastructure, though no public release timeline has been announced for third-party access.

Featured Snippet Boxes

What is HiCache and how does it differ from traditional KV cache?

HiCache is a hierarchical caching system that spans GPU memory, CPU memory, and distributed storage, unlike traditional KV cache limited to GPU memory. It uses heat-aware scheduling to automatically move data between layers, enabling terabyte-scale context caching for AI agents that need persistent multi-turn memory.

Why do AI agents need HiCache instead of standard caching?

AI agents engage in continuous “Think-Act-Observe” loops over hours, maintaining context across tool calls, decisions, and collaborations. Standard per-request caching can’t persist state between turns or share memory across multiple agents, causing redundant computation and decision conflicts that HiCache eliminates through global cache sharing.

What performance improvements does HiCache deliver?

Integration with 3FS KVStore achieved 80% cache hit rates, reduced average time-to-first-token by 56%, and doubled inference QPS in production tests. A single 40GB GPU can expand effective cache capacity beyond 200GB through CPU memory extension.

Which inference frameworks support HiCache?

HiCache currently serves as a core component in SGLang, with planned support for vLLM, RTP-LLM, and TensorRT-LLM through unified storage interfaces. The system works with hybrid models including DeepSeek-V3 and architectures using Mamba or sliding window attention.

Search for an article

Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

Anthropic Acquires Vercept: Claude Now Operates Software Like a Human

Perplexity Computer Is the General-Purpose AI Worker That Handles Entire Projects, Not Just Prompts

Arvind KC Joins OpenAI as Chief People Officer at a Critical Moment for AI-Era Work

Anthropic RSP Version 3.0: The AI Safety Framework Rewritten for a More Dangerous Era

Samsung Galaxy Buds4 Pro Officially Lauched: Everything You Need to Know Before March 11

Samsung ProScaler: The AI Display Technology That Makes Every Screen Sharper

Samsung Galaxy S26 in 2026: The AI Phone That Finally Delivers on Its Promises

Samsung Galaxy S26 Ultra: Privacy Display, Faster AI, and a Camera Built for Low Light

ASUS ExpertBook B5 G2: Enterprise-Grade Copilot+ PC with Panther Lake Architecture

HP Laser MFP 323sdnw: Complete Specifications and Cost Analysis for Indian Businesses

ASUS TUF A15 2025: The Budget Gaming Laptop That Outperforms Its Price Tag

Dell Inspiron 3530 Review: 13th Gen i5-1334U, 16GB RAM & 120Hz Display for ₹53,990 – Best Student Laptop?

Alexa Plus: Amazon’s AI Assistant That Actually Gets Things Done

Sennheiser Deploys USB-C Audio Lineup to Replace Legacy 3.5mm Models

Huawei Launches FreeClip 2 Open-Ear Earbuds with Dedicated NPU AI Processor

Apple Vision Pro vs Meta Quest 3: Complete 2026 Comparison Guide

Wireless Earbuds Showdown: Sony WF-1000XM5, AirPods Pro 3, and Galaxy Buds 3 Pro Battle for Audio Supremacy

Alibaba Cloud Tair Partners with SGLang to Build HiCache: New Cache System for Agentic AI Inference

Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

What’s New: Multi-Layer Cache Architecture

Why It Matters: Breaking Through Memory Limits

Technical Implementation

What’s Next: Expanded Model Support

Featured Snippet Boxes

What is HiCache and how does it differ from traditional KV cache?

Why do AI agents need HiCache instead of standard caching?

What performance improvements does HiCache deliver?

Which inference frameworks support HiCache?

Latest articles

Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

Claude Cowork Enterprise Plugins: How Anthropic Is Rebuilding the AI Workplace in 2026

Anthropic Acquires Vercept: Claude Now Operates Software Like a Human

Samsung Galaxy Buds4 Pro Officially Lauched: Everything You Need to Know Before March 11

More like this

Cursor’s AI Agents Now Write Code, Run It, and Prove It Works

Claude Cowork Enterprise Plugins: How Anthropic Is Rebuilding the AI Workplace in 2026

Anthropic Acquires Vercept: Claude Now Operates Software Like a Human