Quick Brief
- Grok TTS API (Beta) is priced at $4.20 per 1 million characters, with 100 concurrent requests per team
- Five distinct voices (eve, ara, rex, sal, leo) cover use cases from customer support to authoritative narration
- Inline speech tags let developers control pauses, laughter, whispers, and pitch without extra tooling
- Supports 20 languages including Hindi and Bengali, with automatic language detection via the
autocode
xAI just made voice a first-class API feature, and it changes what developers can build in a single afternoon. The Grok TTS API delivers expressive, human-like speech with fine-grained delivery control through inline tags, multiple output formats, and a pricing model accessible to independent developers and enterprises alike. This guide covers every verified capability and exactly how to integrate it.
What the Grok TTS API Actually Delivers
The API converts text into spoken audio with a single POST request to https://api.x.ai/v1/tts. No pipeline stitching, no separate prosody model. You send text, a voice ID, and a language code, and you receive raw audio bytes directly.
Default output is MP3 at 24 kHz and 128 kbps, which handles most web use cases cleanly. For higher fidelity, the API supports up to 48 kHz at 192 kbps. For telephony, G.711 μ-law at 8 kHz is natively supported.
5 Voices and When to Use Each
Each voice has a defined personality rather than generic labels. Choosing the right one affects user trust and completion rates in conversational apps.
| Voice | Tone | Best Use Case |
|---|---|---|
| eve | Energetic, upbeat | Demos, announcements, consumer apps |
| ara | Warm, friendly | Customer support, onboarding flows |
| rex | Confident, clear | Business tools, corporate communications |
| sal | Smooth, balanced | Podcasts, general narration |
| leo | Authoritative, strong | Educational content, instructions |
Voice IDs are case-insensitive: eve, Eve, and EVE all work. You can also list available voices programmatically via the GET /v1/tts/voices endpoint without hardcoding them.
Speech Tags: The Feature That Sets Grok TTS Apart
Most TTS APIs give you text-in, audio-out with no delivery control. Grok TTS lets you embed expressive instructions directly in the text string.
Two tag types exist:
- Inline tags like
[pause],[laugh],[long-pause],[chuckle],[sigh], and[breath]fire a vocal expression at the exact position in the sentence - Wrapping tags like
<whisper>text</whisper>,<slow>text</slow>, and<soft>text</soft>change delivery style across an entire phrase
A practical example from xAI’s own documentation: "So I walked in and [pause] there it was. [laugh] I honestly could not believe it!" produces dramatically more natural audio than flat synthesis of the same text. For interactive storytelling apps or conversational interfaces, this eliminates the need for a separate expressive TTS layer.
Combining tags creates layered effects. <slow><soft>Goodnight, sleep well.</soft></slow> renders calm, measured narration without any additional configuration.
Output Formats and Audio Quality Control
The API supports five codecs suited to different deployment targets:
- MP3 at up to 192 kbps for web and mobile playback
- WAV for lossless post-production workflows
- PCM for raw audio pipelines and real-time processing
- μ-law for telephony at 8 kHz (G.711)
- A-law for telephony at 8 kHz (G.711)
Sample rates span 8 kHz narrowband to 48 kHz studio-grade. This range means one API handles everything from an IVR phone line to a high-fidelity narration pipeline without codec conversion.
Note: Raw codecs (PCM, μ-law, A-law) are not directly playable in the browser. Use MP3 or WAV for browser-based playback applications.
Streaming TTS via WebSocket
For applications where audio must start before the full text is ready, the bidirectional WebSocket endpoint at wss://api.x.ai/v1/tts streams audio back as base64-encoded chunks in real time. There is no total character limit over WebSocket, only a 15,000-character cap per individual text.delta message.
The connection stays open after audio.done, enabling multi-turn sessions without reconnect overhead. This is essential for voice assistants where each response builds on the previous one. The WebSocket endpoint caps at 50 concurrent sessions per team, and each session has a Session Permit TTL of 600 seconds.
For simpler batch use cases, the standard POST endpoint handles up to 15,000 characters per request with a 15-minute timeout.
Pricing: What You Actually Pay
At $4.20 per 1 million characters during Beta, the Grok TTS API is competitively positioned for developers building at scale. The 100 concurrent requests limit on the POST endpoint suits most production workloads, and Beta pricing may change when the API reaches general availability.
xAI explicitly states on the pricing page: “Pricing and rate limits may change when the API becomes generally available.” Factor this into cost planning for long-running integrations.
Limits Reference
| Property | POST Endpoint | WebSocket Endpoint |
|---|---|---|
| Max text length | 15,000 characters | No limit (15,000 per delta) |
| Request timeout | 15 minutes | No timeout |
| Concurrent sessions | 100 per team | 50 per team |
| Session TTL | N/A | 600 seconds |
Language Support: India-Relevant Coverage
The API covers 20 languages with BCP-47 codes and includes Hindi (hi) and Bengali (bn) natively. Arabic dialects for Egypt (ar-EG), Saudi Arabia (ar-SA), and UAE (ar-AE) are distinct entries, not a single generalized model. Chinese (Simplified), Japanese, Korean, and Vietnamese round out the Asia-Pacific coverage.
The auto language detection option handles mixed-language input without forcing developers to identify language server-side first. For Indian apps where Hinglish content is common, this reduces preprocessing overhead significantly.
xAI’s documentation also notes the model can generate speech in additional languages beyond the listed 20, with varying degrees of accuracy.
Integration: What It Takes to Start
A working TTS call requires three parameters: text, voice_id, and language. The API key goes in the Authorization header. The response body contains raw audio bytes written directly to a file or piped to a player.
For browser-based apps, the official docs explicitly state never to call the API directly from client-side JavaScript as it exposes the API key. Route requests through a backend proxy, then use the Web Audio API or an <audio> element to play the returned blob. Safari requires AudioContext creation inside a user gesture handler or it permanently suspends.
Limitations Worth Knowing
The WebSocket Session Permit TTL of 600 seconds constrains very long uninterrupted sessions. Beta status means pricing and rate limits carry no GA guarantees. Speech tag behavior in non-English languages is not explicitly documented, so expressive controls may vary in accuracy outside English. The concurrent WebSocket session cap of 50 per team requires connection pooling strategies for high-traffic voice applications.
Gemini in Google Workspace Now Builds Docs, Sheets, and Slides From Your Own Files and Emails
Frequently Asked Questions (FAQs)
What is the Grok Text to Speech API?
The Grok TTS API is xAI’s developer service that converts text into spoken audio via a single API call. It supports 5 voices, 20 languages, expressive speech tags, and multiple audio codecs including MP3, WAV, PCM, and telephony formats. It is currently in Beta.
How much does the Grok TTS API cost?
The API is priced at $4.20 per 1 million characters during Beta. Each team gets 100 concurrent requests on the POST endpoint and 50 concurrent sessions on the WebSocket streaming endpoint. Pricing and rate limits may change when the API reaches general availability.
What voices does Grok TTS support?
Five voices are available: eve (energetic and upbeat), ara (warm and friendly), rex (confident and professional), sal (smooth and versatile), and leo (authoritative and strong). Each is optimized for specific content types, from customer support to educational narration.
Does Grok TTS support Hindi or other Indian languages?
Yes. Hindi (hi) and Bengali (bn) are explicitly supported alongside 18 other languages. The auto language detection mode handles mixed-language input without requiring server-side language identification. The model may also handle additional languages beyond the listed 20, with varying accuracy.
Can I stream audio in real time with Grok TTS?
Yes. The WebSocket endpoint at wss://api.x.ai/v1/tts streams audio back as base64-encoded chunks as text is being sent. The connection supports multi-turn sessions without reconnecting, subject to a Session Permit TTL of 600 seconds per session.
What are speech tags in Grok TTS?
Speech tags are inline instructions embedded directly in the text that control vocal delivery. Inline tags like [laugh] or [pause] fire a single expression at that position. Wrapping tags like <whisper>text</whisper> change the delivery style of an entire phrase. Both types require no extra API parameters.
Is Grok TTS suitable for telephony applications?
Yes. The API natively outputs G.711 μ-law and A-law codecs at 8 kHz, which are standard formats for telephony systems. Match the codec to the use case: mulaw or alaw at 8 kHz for telephony, mp3 at 24 kHz for web, and wav at 44.1 kHz or higher for post-production.
What is the maximum text length per request?
The standard POST endpoint accepts up to 15,000 characters per request with a 15-minute timeout. For longer content, the WebSocket endpoint has no total character limit, with individual delta messages capped at 15,000 characters each.

