Amazon announced Nova 2 Sonic on December 2, 2025 a speech-to-speech model designed for natural, real-time conversational AI. Unlike traditional voice assistants that convert speech-to-text-to-speech, Nova 2 Sonic processes audio directly, delivering responses in under 700 milliseconds with industry-leading accuracy. The model now supports Portuguese and Hindi, features polyglot voices that speak multiple languages natively, and includes a one-million token context window for sustained conversations. Developers can integrate Nova 2 Sonic through Amazon Bedrock’s bidirectional streaming API, with pricing approximately 80% lower than OpenAI’s GPT-4o Realtime.
Amazon Web Services just launched Nova 2 Sonic, a speech-to-speech model that processes voice conversations entirely in the audio domain with no text conversion required. Announced at AWS re:Invent 2025 on December 2, this upgrade brings polyglot voices, expanded language support, and a massive one-million token context window. For developers building voice assistants, customer support bots, or interactive AI applications, Nova 2 Sonic promises faster responses and lower costs than competing models from OpenAI and Google.
What Is Amazon Nova 2 Sonic?
Amazon Nova 2 Sonic is AWS’s second-generation speech-to-speech foundation model that enables real-time, human-like voice conversations through Amazon Bedrock. Unlike traditional voice AI systems that convert speech to text, process the text with a language model, then convert back to speech, Nova 2 Sonic handles the entire conversation in the audio domain. This unified architecture preserves acoustic features like tone, emotion, and speaking style from the input audio, resulting in responses that adapt to the user’s sentiment and energy level.
The model delivers best-in-class streaming speech understanding with robustness to background noise, diverse accents, and speaking styles. It supports efficient dialog handling with natural turn-taking, including the ability to detect user interruptions and non-verbal cues like laughter, hesitations, and pauses. AWS claims Nova 2 Sonic offers superior reasoning, instruction following, and tool invocation accuracy compared to the original Nova Sonic model released earlier in 2025.
Amazon Nova 2 Sonic is a speech-to-speech AI model that processes voice conversations entirely in the audio domain without text conversion. It delivers real-time responses in under 700ms with polyglot voices, one-million token context, and native support for nine languages including Portuguese and Hindi.
Key Features and Technical Capabilities
Speech-to-Speech Architecture
Nova 2 Sonic’s unified speech-to-speech architecture eliminates the traditional pipeline delay caused by multiple conversion steps. Independent testing shows the model responds in just over one second on average faster than both OpenAI’s GPT-4o and Google’s Gemini Flash 2.0. Real-world tests from developers report response latencies under 700 milliseconds, approaching true real-time conversation territory.
The model preserves acoustic features from input audio, meaning if you speak with excitement, the AI response matches your enthusiasm. This emotional adaptation happens automatically without requiring explicit instructions, creating more natural-feeling conversations than robotic text-to-speech alternatives.
Polyglot Voice Support
One of Nova 2 Sonic’s breakthrough features is polyglot voices a single voice that can speak multiple languages with native expressivity. The model now supports nine languages: English (American and British accents), Spanish, French, Italian, German, Portuguese, and Hindi. This expansion from the original five-language support makes Nova 2 Sonic more accessible for global applications.
Developers can choose between masculine-sounding and feminine-sounding expressive voices, with the ability to adjust tone, pace, and style for specific use cases. The model demonstrates cultural awareness by adapting responses based on linguistic and cultural contexts.
One-Million Token Context Window
Nova 2 Sonic includes a one-million token context window, a substantial increase that enables sustained interactions without losing conversation history. For reference, one million tokens can handle approximately 750,000 words, hours of audio, or hundreds of pages of documentation. This massive context capacity allows the model to maintain coherent conversations across complex, multi-turn dialogues without requiring developers to manually manage conversation state.
Cross-Modal Interaction
The model supports seamless switching between voice and text in the same session, giving users flexibility to type when speaking isn’t convenient. Nova 2 Sonic also introduces asynchronous tool calling, which allows the model to perform multi-step tasks and invoke external tools without interrupting conversation flow. This capability is critical for building practical voice assistants that need to look up information, make calculations, or interact with other systems while maintaining natural dialogue.
What’s New in Nova 2 Sonic vs Original Nova Sonic
Nova 2 Sonic builds on the foundation of the original Nova Sonic model launched earlier in 2025, adding several significant upgrades:
- Expanded language support: Added Portuguese and Hindi to the original five languages
- Polyglot voices: Same voice can now speak different languages with native expressivity
- Turn-taking controllability: Developers can set low, medium, or high pause sensitivity to customize when the model responds
- Cross-modal interaction: Users can switch between voice and text in the same conversation
- Asynchronous tool calling: Support for multi-step tasks without breaking conversation flow
- One-million token context: Massive expansion from the previous context limit
- Enhanced reasoning: Superior instruction following and tool invocation accuracy
Amazon Nova 2 Sonic vs Competitors
Nova 2 Sonic vs OpenAI GPT-4o Realtime
According to testing by research firm Artificial Analysis, Amazon Nova 2 Sonic responds faster than OpenAI’s GPT-4o Realtime voice model. More significantly, AWS claims Nova 2 Sonic costs nearly 80% less than GPT-4o for real-time voice interactions. While GPT-4o Realtime offers robust multimodal capabilities with text, audio, and vision inputs, Nova 2 Sonic focuses specifically on optimizing the speech-to-speech experience.
OpenAI’s model uses a WebSocket or WebRTC interface with a 32,000 token context window and 4,096 max output tokens. In contrast, Nova 2 Sonic’s one-million token context window provides substantially more conversation memory. GPT-4o launched its realtime preview in June 2024 and has since gained widespread adoption, but Nova 2 Sonic’s pricing advantage could shift developer preferences for voice-heavy applications.
Nova 2 Sonic vs Google Gemini 2.5 Flash
Google’s Gemini 2.5 Flash with native audio offers impressive voice quality with 30 HD voices across 24 languages more language coverage than Nova 2 Sonic’s nine languages. Gemini 2.5 Flash includes advanced features like “Proactive Audio” (responds only when relevant) and “Affective Dialog” (understands emotional expressions). The model also supports multi-speaker dialogue generation, creating two-person “NotebookLM-style” audio overviews from text.
However, Amazon’s integration advantages through AWS infrastructure and services like Amazon Connect give Nova 2 Sonic an edge for enterprise deployments already using AWS. Independent speed tests show Nova 2 Sonic outperforming Gemini Flash 2.0 on latency. Google’s model excels at multimodal generation tasks, while Nova 2 Sonic optimizes specifically for conversational speed and cost.
| Feature | Amazon Nova 2 Sonic | OpenAI GPT-4o Realtime | Google Gemini 2.5 Flash |
|---|---|---|---|
| Response Latency | <700ms | ~1 second | Variable |
| Context Window | 1M tokens | 32K tokens | Not specified |
| Languages Supported | 9 | Multiple | 24 |
| Voice Options | Masculine/Feminine | Multiple | 30 HD voices |
| Pricing Advantage | 80% cheaper than GPT-4o | Baseline | Not disclosed |
| Emotional Adaptation | Yes | Yes | Yes (Affective Dialog) |
| Cross-Modal (Voice+Text) | Yes | Yes | Yes |
Pricing and Cost Comparison
Amazon Nova 2 Sonic costs approximately $0.0034 per 1,000 input tokens and $0.0136 per 1,000 output tokens through Amazon Bedrock. For a voice assistant handling continuous conversations, this scales to roughly $7 per day for ten hours of active interaction. This represents nearly 80% cost savings compared to OpenAI’s GPT-4o Realtime API for equivalent voice interactions.
Pricing varies by AWS region, with availability currently in US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). Developers can access the model through Amazon Bedrock’s on-demand pricing without upfront commitments, or optimize costs further with reserved capacity for predictable workloads. The pay-as-you-go model means you only pay for actual tokens processed, making it cost-effective for both prototyping and production deployments.
Amazon Nova 2 Sonic pricing is approximately $0.0034 per 1K input tokens and $0.0136 per 1K output tokens. This equals roughly $7 per day for 10 hours of conversation, representing 80% cost savings versus OpenAI’s GPT-4o Realtime.
Integration Options and Developer Access
Amazon Bedrock Bidirectional Streaming API
Developers integrate Nova 2 Sonic through Amazon Bedrock’s HTTP/2-based bidirectional streaming API, which enables low-latency, real-time audio communication. The API supports progressive rendering of responses as they’re generated, context maintenance across multiple conversation turns without resending previous information, and thoughtful handling of interruptions and corrections. This streaming architecture minimizes perceived latency by starting audio playback before the entire response completes generation.
Telephony Provider Integration
Nova 2 Sonic seamlessly integrates with Amazon Connect for call center applications, plus leading third-party telephony providers including Vonage, Twilio, and AudioCodes. The model also works with open-source conversational AI frameworks like LiveKit and Pipecat, giving developers flexibility in choosing their infrastructure. This broad integration support means teams can add voice AI capabilities to existing communication systems without rebuilding their entire stack.
Available AWS Regions
As of December 2025, Amazon Nova 2 Sonic is available in four AWS regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). AWS typically expands regional availability over time based on demand, so additional regions may become available in 2026. Developers can access Nova 2 Sonic through the Amazon Bedrock console or programmatically via the AWS SDK.
Real-World Use Cases
Nova 2 Sonic targets several practical applications where real-time voice interaction creates business value:
- Customer support automation: Replace traditional IVR systems with natural voice assistants that understand context and handle complex queries without frustrating menu navigation
- Outbound marketing calls: Generate personalized voice campaigns with emotional adaptation that sounds human rather than robotic
- Voice-enabled personal assistants: Build AI companions that maintain conversation history across sessions and adapt to user speaking styles
- Interactive education: Create language learning applications where the AI tutor provides pronunciation feedback and cultural context
- Healthcare virtual assistants: Develop patient intake systems that handle medical terminology accurately while maintaining empathetic tone
- Smart home integration: Power voice interfaces for connected devices with low latency and background noise robustness
- Enterprise meeting assistants: Build voice-activated tools that take notes, summarize discussions, and answer questions during video calls
Pros and Cons
Pros:
- Industry-leading response latency under 700ms for real-time conversations
- 80% lower cost than OpenAI GPT-4o Realtime for equivalent workloads
- One-million token context window enables extended conversations
- Polyglot voices speak nine languages with native expressivity
- Emotional adaptation preserves and responds to user sentiment
- Seamless AWS integration for existing Bedrock and Connect users
- Cross-modal support allows mixing voice and text in same session
- Robust handling of background noise and diverse accents
Cons:
- Limited to four AWS regions at launch (December 2025)
- Fewer languages than Google Gemini 2.5 Flash (9 vs 24)
- Fewer voice options compared to competitors (masculine/feminine vs 30 HD voices)
- No standalone mobile SDK requires AWS infrastructure
- Documentation and testing resources still building out post-launch
- Requires AWS account and Bedrock access for experimentation
Technical Specifications
| Specification | Details |
|---|---|
| Model Type | Speech-to-Speech Foundation Model |
| Architecture | Unified audio-domain processing (no text conversion) |
| Response Latency | <700ms (developer testing) |
| Context Window | 1 million tokens |
| Languages | English (US/UK), Spanish, French, Italian, German, Portuguese, Hindi |
| Voice Types | Masculine-sounding and feminine-sounding expressive voices |
| Input Support | Streaming audio + text (cross-modal) |
| Output Support | Streaming audio with adaptive prosody |
| API Protocol | HTTP/2 bidirectional streaming |
| Noise Robustness | Background noise filtering + accent adaptation |
| Turn-Taking | Configurable pause sensitivity (low/medium/high) |
| Tool Calling | Asynchronous multi-step task support |
| Pricing (Input) | ~$0.0034 per 1K tokens |
| Pricing (Output) | ~$0.0136 per 1K tokens |
| Availability | 4 AWS regions (US East/West, Tokyo, Stockholm) |
| Integration | Bedrock API, Amazon Connect, Vonage, Twilio, AudioCodes, LiveKit, Pipecat |
| Launch Date | December 2, 2025 |
Frequently Asked Questions (FAQs)
What is the difference between speech-to-speech and text-to-speech AI?
Speech-to-speech models like Nova 2 Sonic process audio directly without converting to text, preserving acoustic features like tone and emotion. Traditional text-to-speech systems convert speech to text, process it with a language model, then convert back to audio, causing latency and losing emotional context.
Can Nova 2 Sonic handle interruptions during conversations?
Yes, Nova 2 Sonic detects user interruptions and non-verbal cues like laughter, hesitations, and inter-sentential pauses to enable natural turn-taking. Developers can adjust turn-taking sensitivity to low, medium, or high based on their use case requirements.
How does Nova 2 Sonic compare to ChatGPT’s voice mode?
Nova 2 Sonic responds faster than OpenAI’s GPT-4o Realtime (under 700ms vs ~1 second) and costs approximately 80% less for equivalent voice interactions. Nova 2 Sonic also offers a larger one-million token context window compared to GPT-4o’s 32,000 tokens.
Does Nova 2 Sonic require AWS infrastructure to use?
Yes, Nova 2 Sonic is available exclusively through Amazon Bedrock, requiring an AWS account and Bedrock access. However, it integrates with third-party telephony providers like Twilio and Vonage, allowing some deployment flexibility.
What are polyglot voices in Nova 2 Sonic?
Polyglot voices are a single voice that can speak multiple languages with native expressivity and pronunciation. This means the same voice character can switch between English, Spanish, French, and other supported languages while maintaining natural-sounding delivery for each language.
Can Nova 2 Sonic be used for call center automation?
Yes, Nova 2 Sonic integrates directly with Amazon Connect for call center applications, plus third-party providers like Vonage, Twilio, and AudioCodes. It handles streaming speech recognition with background noise robustness and natural dialog flow suitable for customer support automation.
What is the maximum conversation length Nova 2 Sonic can handle?
With its one-million token context window, Nova 2 Sonic can maintain conversations equivalent to approximately 750,000 words or hours of audio. This enables sustained interactions without losing conversation history or requiring manual state management.
Does Nova 2 Sonic support tool calling and function invocation?
Yes, Nova 2 Sonic supports asynchronous tool calling, allowing it to invoke external functions and tools while maintaining conversation flow. The model shows superior tool invocation accuracy compared to the original Nova Sonic.
Featured Snippet Boxes
What is Amazon Nova 2 Sonic?
A speech-to-speech AI model that processes voice conversations entirely in the audio domain without text conversion. It delivers real-time responses in under 700ms with polyglot voices, one-million token context, and native support for nine languages including Portuguese and Hindi.
How much does Nova 2 Sonic cost?
Approximately $0.0034 per 1K input tokens and $0.0136 per 1K output tokens. This equals roughly $7 per day for 10 hours of conversation, representing 80% cost savings versus OpenAI’s GPT-4o Realtime.
What languages does Nova 2 Sonic support?
Nine languages: English (American and British accents), Spanish, French, Italian, German, Portuguese, and Hindi. The model features polyglot voices that can speak multiple languages with native expressivity using the same voice.
How fast is Nova 2 Sonic?
It responds in under 700 milliseconds in real-world testing, with average response times just over one second faster than both OpenAI GPT-4o and Google Gemini Flash 2.0 according to research firm Artificial Analysis.
Where is Nova 2 Sonic available?
In four AWS regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). Developers access it through Amazon Bedrock’s bidirectional streaming API.
What’s new in Nova 2 Sonic vs the original?
It adds Portuguese and Hindi language support, polyglot voices, turn-taking controllability, cross-modal interaction between voice and text, asynchronous tool calling, one-million token context window, and enhanced reasoning compared to the original Nova Sonic.
