Does V3.2-Exp support 128K context?

Yes, the current context length is 128K tokens per the model table.

DeepSeek Sparse Attention: 128K context, 50% cheaper

Q: How much does DeepSeek cost now?

As of Sept 30, 2025: input cache hit ≈ $0.028/M tokens, cache miss ≈ $0.28/M, output ≈ $0.42/M. Always verify the pricing page before deployment.

DeepSeek’s new V3.2-Exp adds Sparse Attention (DSA) and pushes 128K context with big price cuts. If your app benefits from long prompts or retrieval with repeat context, you’ll likely pay a lot less especially when you hit cache. Performance broadly tracks V3.1 with small coding gains.

What DeepSeek just launched

V3.2-Exp is an experimental step on top of V3.1-Terminus. The headline change is DeepSeek Sparse Attention (DSA), which trims the amount of work the model does on long sequences. It ships on the app, web, and API, with open assets on Hugging Face and GitHub. Pricing has been cut across the board, with the eye-catcher being well under 3¢ per 1M input tokens on cache hits, plus a 50% cut on cache misses. Benchmarks remain similar to V3.1 overall, with a small boost on programming tasks.

Sparse Attention, simply explained

Full transformer attention compares every token with every other token. That’s quadratic work; costs explode as context grows. Sparse attention keeps the most relevant links and ignores the rest, so the model does far fewer comparisons while keeping quality close to full attention. DeepSeek adds a “lightning indexer” to score and keep the good stuff, which is how it stretches to 128K tokens without blowing up memory or latency.

V3.2-Exp vs V3.1-Terminus (quick compare)

Context length: 128K for V3.2-Exp; designed for long docs, RAG, and multi-turn workflows.
Costs: Cache hit input pricing drops below 3¢/M tokens. Cache miss and output are also lower than before. In real apps that reuse long system prompts or retrieved chunks, cache hits compound savings.
Quality: Broad parity with V3.1; coding shows a small uptick (e.g., higher Codeforces rating).
Ecosystem: Day-one support from SGLang and vLLM, plus open kernels (TileLang + CUDA/FlashMLA) for researchers and infra teams.

Where it helps (and where it might not)

Good fits

RAG and memory-heavy agents: Long prompts with repeated context get more cache hits and fewer wasted comparisons.
Batch analytics and summarization: Large docs, logs, and transcripts that previously pushed you into timeouts or OOMs.
Cost-sensitive prototypes: Teams that want GPT-4-ish outputs for long context tasks but can’t swallow quadratic costs.

Less ideal (for now)

Latency-critical, short prompts: Sparse attention shines with longer contexts; tiny prompts may see little gain.
Benchmarks where V3.1 edged V3.2: A few reasoning tasks dip slightly; if you’re married to those scores, test before switching.

Comparison DeepSeek V3.1 vs V3.2-Exp

Feature	V3.1-Terminus	V3.2-Exp
Attention	Dense/MLA	Sparse (DSA) + lightning indexer
Context	128K	128K
Coding (Codeforces)	~2046	~2121
API input (cache hit)	~≥$0.07/M historically	$0.028/M (as of 2025-09-30)
Fit	General-purpose	Long-context, cost-sensitive

How to try it today (quickstart)

Check pricing & quotas in the API docs and enable caching in your stack.
Spin up with SGLang or vLLM for immediate support and KV-cache wins.
Start with your longest prompt path (system + RAG chunks). Measure cache-hit rate and cost per output.
Tune retrieval windows (don’t flood the context if it’s not helping).
Watch memory: 128K is friendly, not free—monitor VRAM/CPU RAM in production.

Limitations & considerations

Experimental: This is an intermediate step; expect updates.
Caching matters: Your real-world cost depends on hit rate.
Framework parity: vLLM/SGLang support is good, but pinned versions help.
Numbers move: Pricing and context limits can change; always reconfirm before committing budgets.

Frequently Asked Question

Is V3.2-Exp open source?
Weights and repos are available; deployment terms are permissive for builders.

Does it truly keep quality?
Mostly similar to V3.1 across public suites; coding shows a small bump.

What’s the context limit?
128K tokens, which is plenty for most RAG and doc workflows.

What if my prompts are short?
You may not see headline savings; test with your real traffic.

Checklist

Confirm latest pricing and enable caching.
Route long-context paths to V3.2-Exp first.
Measure hit rate, latency, $ per output in GA4/observability.
Keep a rollback to V3.1 for tasks where it still wins.

Glossary

Sparse attention: An efficiency trick that preserves only the most useful attention links.
Cache hit/miss: Re-using prior compute on the same prompt chunk (hit) vs recomputing (miss).
KV cache: Stored keys/values from attention layers used to speed up repeated context.
128K context: Roughly a book-length window the model can “see” at once.

Featured Answer Boxes

What is DeepSeek Sparse Attention?

A pruning method that scores tokens and keeps only the most relevant attention links. It cuts memory and compute on long prompts while keeping output quality close to dense attention, enabling 128K context at lower cost.

How much does DeepSeek cost now?

As of Sept 29–30, 2025: input cache hit ≈ $0.028/M tokens, cache miss ≈ $0.28/M, output ≈ $0.42/M. Always check the pricing page before deploying.

Is V3.2-Exp faster for long prompts?

Yes sparse attention reduces work as context grows. You’ll see the biggest gains on long prompts that repeat content (RAG, memory), especially with a high cache-hit rate.

Source:

Search for an article

DeepSeek V3.2-Exp: Sparse Attention, faster long-context, cheaper API

Table of Contents