DeepSeek’s new V3.2-Exp adds Sparse Attention (DSA) and pushes 128K context with big price cuts. If your app benefits from long prompts or retrieval with repeat context, you’ll likely pay a lot less especially when you hit cache. Performance broadly tracks V3.1 with small coding gains.
Table of Contents
What DeepSeek just launched
V3.2-Exp is an experimental step on top of V3.1-Terminus. The headline change is DeepSeek Sparse Attention (DSA), which trims the amount of work the model does on long sequences. It ships on the app, web, and API, with open assets on Hugging Face and GitHub. Pricing has been cut across the board, with the eye-catcher being well under 3¢ per 1M input tokens on cache hits, plus a 50% cut on cache misses. Benchmarks remain similar to V3.1 overall, with a small boost on programming tasks.
Sparse Attention, simply explained
Full transformer attention compares every token with every other token. That’s quadratic work; costs explode as context grows. Sparse attention keeps the most relevant links and ignores the rest, so the model does far fewer comparisons while keeping quality close to full attention. DeepSeek adds a “lightning indexer” to score and keep the good stuff, which is how it stretches to 128K tokens without blowing up memory or latency.
V3.2-Exp vs V3.1-Terminus (quick compare)
- Context length: 128K for V3.2-Exp; designed for long docs, RAG, and multi-turn workflows.
- Costs: Cache hit input pricing drops below 3¢/M tokens. Cache miss and output are also lower than before. In real apps that reuse long system prompts or retrieved chunks, cache hits compound savings.
- Quality: Broad parity with V3.1; coding shows a small uptick (e.g., higher Codeforces rating).
- Ecosystem: Day-one support from SGLang and vLLM, plus open kernels (TileLang + CUDA/FlashMLA) for researchers and infra teams.
Where it helps (and where it might not)
Good fits
- RAG and memory-heavy agents: Long prompts with repeated context get more cache hits and fewer wasted comparisons.
- Batch analytics and summarization: Large docs, logs, and transcripts that previously pushed you into timeouts or OOMs.
- Cost-sensitive prototypes: Teams that want GPT-4-ish outputs for long context tasks but can’t swallow quadratic costs.
Less ideal (for now)
- Latency-critical, short prompts: Sparse attention shines with longer contexts; tiny prompts may see little gain.
- Benchmarks where V3.1 edged V3.2: A few reasoning tasks dip slightly; if you’re married to those scores, test before switching.
Comparison DeepSeek V3.1 vs V3.2-Exp
| Feature | V3.1-Terminus | V3.2-Exp |
|---|---|---|
| Attention | Dense/MLA | Sparse (DSA) + lightning indexer |
| Context | 128K | 128K |
| Coding (Codeforces) | ~2046 | ~2121 |
| API input (cache hit) | ~≥$0.07/M historically | $0.028/M (as of 2025-09-30) |
| Fit | General-purpose | Long-context, cost-sensitive |
How to try it today (quickstart)
- Check pricing & quotas in the API docs and enable caching in your stack.
- Spin up with SGLang or vLLM for immediate support and KV-cache wins.
- Start with your longest prompt path (system + RAG chunks). Measure cache-hit rate and cost per output.
- Tune retrieval windows (don’t flood the context if it’s not helping).
- Watch memory: 128K is friendly, not free—monitor VRAM/CPU RAM in production.
Limitations & considerations
- Experimental: This is an intermediate step; expect updates.
- Caching matters: Your real-world cost depends on hit rate.
- Framework parity: vLLM/SGLang support is good, but pinned versions help.
- Numbers move: Pricing and context limits can change; always reconfirm before committing budgets.
Frequently Asked Question
Is V3.2-Exp open source?
Weights and repos are available; deployment terms are permissive for builders.
Does it truly keep quality?
Mostly similar to V3.1 across public suites; coding shows a small bump.
What’s the context limit?
128K tokens, which is plenty for most RAG and doc workflows.
What if my prompts are short?
You may not see headline savings; test with your real traffic.
Checklist
- Confirm latest pricing and enable caching.
- Route long-context paths to V3.2-Exp first.
- Measure hit rate, latency, $ per output in GA4/observability.
- Keep a rollback to V3.1 for tasks where it still wins.
Glossary
Sparse attention: An efficiency trick that preserves only the most useful attention links.
Cache hit/miss: Re-using prior compute on the same prompt chunk (hit) vs recomputing (miss).
KV cache: Stored keys/values from attention layers used to speed up repeated context.
128K context: Roughly a book-length window the model can “see” at once.
Featured Answer Boxes
What is DeepSeek Sparse Attention?
A pruning method that scores tokens and keeps only the most relevant attention links. It cuts memory and compute on long prompts while keeping output quality close to dense attention, enabling 128K context at lower cost.
How much does DeepSeek cost now?
As of Sept 29–30, 2025: input cache hit ≈ $0.028/M tokens, cache miss ≈ $0.28/M, output ≈ $0.42/M. Always check the pricing page before deploying.
Is V3.2-Exp faster for long prompts?
Yes sparse attention reduces work as context grows. You’ll see the biggest gains on long prompts that repeat content (RAG, memory), especially with a high cache-hit rate.
Source:

