back to top
More
    HomeNewsAmazon Nova 2 Sonic: AWS Launches Speech-to-Speech AI With Polyglot Voices and...

    Amazon Nova 2 Sonic: AWS Launches Speech-to-Speech AI With Polyglot Voices and Million-Token Context

    Published on

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026, driven by Nexus Hyperfabric architecture, NVIDIA partnerships, and 800 Gbps...

    Amazon announced Nova 2 Sonic on December 2, 2025 a speech-to-speech model designed for natural, real-time conversational AI. Unlike traditional voice assistants that convert speech-to-text-to-speech, Nova 2 Sonic processes audio directly, delivering responses in under 700 milliseconds with industry-leading accuracy. The model now supports Portuguese and Hindi, features polyglot voices that speak multiple languages natively, and includes a one-million token context window for sustained conversations. Developers can integrate Nova 2 Sonic through Amazon Bedrock’s bidirectional streaming API, with pricing approximately 80% lower than OpenAI’s GPT-4o Realtime.

    Amazon Web Services just launched Nova 2 Sonic, a speech-to-speech model that processes voice conversations entirely in the audio domain with no text conversion required. Announced at AWS re:Invent 2025 on December 2, this upgrade brings polyglot voices, expanded language support, and a massive one-million token context window. For developers building voice assistants, customer support bots, or interactive AI applications, Nova 2 Sonic promises faster responses and lower costs than competing models from OpenAI and Google.

    What Is Amazon Nova 2 Sonic?

    Amazon Nova 2 Sonic is AWS’s second-generation speech-to-speech foundation model that enables real-time, human-like voice conversations through Amazon Bedrock. Unlike traditional voice AI systems that convert speech to text, process the text with a language model, then convert back to speech, Nova 2 Sonic handles the entire conversation in the audio domain. This unified architecture preserves acoustic features like tone, emotion, and speaking style from the input audio, resulting in responses that adapt to the user’s sentiment and energy level.

    The model delivers best-in-class streaming speech understanding with robustness to background noise, diverse accents, and speaking styles. It supports efficient dialog handling with natural turn-taking, including the ability to detect user interruptions and non-verbal cues like laughter, hesitations, and pauses. AWS claims Nova 2 Sonic offers superior reasoning, instruction following, and tool invocation accuracy compared to the original Nova Sonic model released earlier in 2025.

    Amazon Nova 2 Sonic is a speech-to-speech AI model that processes voice conversations entirely in the audio domain without text conversion. It delivers real-time responses in under 700ms with polyglot voices, one-million token context, and native support for nine languages including Portuguese and Hindi.​

    Key Features and Technical Capabilities

    Speech-to-Speech Architecture

    Nova 2 Sonic’s unified speech-to-speech architecture eliminates the traditional pipeline delay caused by multiple conversion steps. Independent testing shows the model responds in just over one second on average faster than both OpenAI’s GPT-4o and Google’s Gemini Flash 2.0. Real-world tests from developers report response latencies under 700 milliseconds, approaching true real-time conversation territory.

    The model preserves acoustic features from input audio, meaning if you speak with excitement, the AI response matches your enthusiasm. This emotional adaptation happens automatically without requiring explicit instructions, creating more natural-feeling conversations than robotic text-to-speech alternatives.

    Polyglot Voice Support

    One of Nova 2 Sonic’s breakthrough features is polyglot voices a single voice that can speak multiple languages with native expressivity. The model now supports nine languages: English (American and British accents), Spanish, French, Italian, German, Portuguese, and Hindi. This expansion from the original five-language support makes Nova 2 Sonic more accessible for global applications.

    Developers can choose between masculine-sounding and feminine-sounding expressive voices, with the ability to adjust tone, pace, and style for specific use cases. The model demonstrates cultural awareness by adapting responses based on linguistic and cultural contexts.

    One-Million Token Context Window

    Nova 2 Sonic includes a one-million token context window, a substantial increase that enables sustained interactions without losing conversation history. For reference, one million tokens can handle approximately 750,000 words, hours of audio, or hundreds of pages of documentation. This massive context capacity allows the model to maintain coherent conversations across complex, multi-turn dialogues without requiring developers to manually manage conversation state.

    Cross-Modal Interaction

    The model supports seamless switching between voice and text in the same session, giving users flexibility to type when speaking isn’t convenient. Nova 2 Sonic also introduces asynchronous tool calling, which allows the model to perform multi-step tasks and invoke external tools without interrupting conversation flow. This capability is critical for building practical voice assistants that need to look up information, make calculations, or interact with other systems while maintaining natural dialogue.

    What’s New in Nova 2 Sonic vs Original Nova Sonic

    Nova 2 Sonic builds on the foundation of the original Nova Sonic model launched earlier in 2025, adding several significant upgrades:

    • Expanded language support: Added Portuguese and Hindi to the original five languages
    • Polyglot voices: Same voice can now speak different languages with native expressivity
    • Turn-taking controllability: Developers can set low, medium, or high pause sensitivity to customize when the model responds
    • Cross-modal interaction: Users can switch between voice and text in the same conversation
    • Asynchronous tool calling: Support for multi-step tasks without breaking conversation flow
    • One-million token context: Massive expansion from the previous context limit
    • Enhanced reasoning: Superior instruction following and tool invocation accuracy

    Amazon Nova 2 Sonic vs Competitors

    Nova 2 Sonic vs OpenAI GPT-4o Realtime

    According to testing by research firm Artificial Analysis, Amazon Nova 2 Sonic responds faster than OpenAI’s GPT-4o Realtime voice model. More significantly, AWS claims Nova 2 Sonic costs nearly 80% less than GPT-4o for real-time voice interactions. While GPT-4o Realtime offers robust multimodal capabilities with text, audio, and vision inputs, Nova 2 Sonic focuses specifically on optimizing the speech-to-speech experience.

    OpenAI’s model uses a WebSocket or WebRTC interface with a 32,000 token context window and 4,096 max output tokens. In contrast, Nova 2 Sonic’s one-million token context window provides substantially more conversation memory. GPT-4o launched its realtime preview in June 2024 and has since gained widespread adoption, but Nova 2 Sonic’s pricing advantage could shift developer preferences for voice-heavy applications.

    Nova 2 Sonic vs Google Gemini 2.5 Flash

    Google’s Gemini 2.5 Flash with native audio offers impressive voice quality with 30 HD voices across 24 languages more language coverage than Nova 2 Sonic’s nine languages. Gemini 2.5 Flash includes advanced features like “Proactive Audio” (responds only when relevant) and “Affective Dialog” (understands emotional expressions). The model also supports multi-speaker dialogue generation, creating two-person “NotebookLM-style” audio overviews from text.

    However, Amazon’s integration advantages through AWS infrastructure and services like Amazon Connect give Nova 2 Sonic an edge for enterprise deployments already using AWS. Independent speed tests show Nova 2 Sonic outperforming Gemini Flash 2.0 on latency. Google’s model excels at multimodal generation tasks, while Nova 2 Sonic optimizes specifically for conversational speed and cost.

    FeatureAmazon Nova 2 SonicOpenAI GPT-4o RealtimeGoogle Gemini 2.5 Flash
    Response Latency<700ms ~1 second Variable
    Context Window1M tokens 32K tokens Not specified
    Languages SupportedMultiple24 
    Voice OptionsMasculine/Feminine Multiple30 HD voices 
    Pricing Advantage80% cheaper than GPT-4o BaselineNot disclosed
    Emotional AdaptationYes YesYes (Affective Dialog) 
    Cross-Modal (Voice+Text)Yes YesYes

    Pricing and Cost Comparison

    Amazon Nova 2 Sonic costs approximately $0.0034 per 1,000 input tokens and $0.0136 per 1,000 output tokens through Amazon Bedrock. For a voice assistant handling continuous conversations, this scales to roughly $7 per day for ten hours of active interaction. This represents nearly 80% cost savings compared to OpenAI’s GPT-4o Realtime API for equivalent voice interactions.

    Pricing varies by AWS region, with availability currently in US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). Developers can access the model through Amazon Bedrock’s on-demand pricing without upfront commitments, or optimize costs further with reserved capacity for predictable workloads. The pay-as-you-go model means you only pay for actual tokens processed, making it cost-effective for both prototyping and production deployments.

    Amazon Nova 2 Sonic pricing is approximately $0.0034 per 1K input tokens and $0.0136 per 1K output tokens. This equals roughly $7 per day for 10 hours of conversation, representing 80% cost savings versus OpenAI’s GPT-4o Realtime.

    Integration Options and Developer Access

    Amazon Bedrock Bidirectional Streaming API

    Developers integrate Nova 2 Sonic through Amazon Bedrock’s HTTP/2-based bidirectional streaming API, which enables low-latency, real-time audio communication. The API supports progressive rendering of responses as they’re generated, context maintenance across multiple conversation turns without resending previous information, and thoughtful handling of interruptions and corrections. This streaming architecture minimizes perceived latency by starting audio playback before the entire response completes generation.

    Telephony Provider Integration

    Nova 2 Sonic seamlessly integrates with Amazon Connect for call center applications, plus leading third-party telephony providers including Vonage, Twilio, and AudioCodes. The model also works with open-source conversational AI frameworks like LiveKit and Pipecat, giving developers flexibility in choosing their infrastructure. This broad integration support means teams can add voice AI capabilities to existing communication systems without rebuilding their entire stack.

    Available AWS Regions

    As of December 2025, Amazon Nova 2 Sonic is available in four AWS regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). AWS typically expands regional availability over time based on demand, so additional regions may become available in 2026. Developers can access Nova 2 Sonic through the Amazon Bedrock console or programmatically via the AWS SDK.

    Real-World Use Cases

    Nova 2 Sonic targets several practical applications where real-time voice interaction creates business value:

    • Customer support automation: Replace traditional IVR systems with natural voice assistants that understand context and handle complex queries without frustrating menu navigation
    • Outbound marketing calls: Generate personalized voice campaigns with emotional adaptation that sounds human rather than robotic
    • Voice-enabled personal assistants: Build AI companions that maintain conversation history across sessions and adapt to user speaking styles
    • Interactive education: Create language learning applications where the AI tutor provides pronunciation feedback and cultural context
    • Healthcare virtual assistants: Develop patient intake systems that handle medical terminology accurately while maintaining empathetic tone
    • Smart home integration: Power voice interfaces for connected devices with low latency and background noise robustness
    • Enterprise meeting assistants: Build voice-activated tools that take notes, summarize discussions, and answer questions during video calls

    Pros and Cons

    Pros:

    • Industry-leading response latency under 700ms for real-time conversations
    • 80% lower cost than OpenAI GPT-4o Realtime for equivalent workloads
    • One-million token context window enables extended conversations
    • Polyglot voices speak nine languages with native expressivity
    • Emotional adaptation preserves and responds to user sentiment
    • Seamless AWS integration for existing Bedrock and Connect users
    • Cross-modal support allows mixing voice and text in same session
    • Robust handling of background noise and diverse accents

    Cons:

    • Limited to four AWS regions at launch (December 2025)
    • Fewer languages than Google Gemini 2.5 Flash (9 vs 24)
    • Fewer voice options compared to competitors (masculine/feminine vs 30 HD voices)
    • No standalone mobile SDK requires AWS infrastructure
    • Documentation and testing resources still building out post-launch
    • Requires AWS account and Bedrock access for experimentation

    Technical Specifications

    SpecificationDetails
    Model TypeSpeech-to-Speech Foundation Model 
    ArchitectureUnified audio-domain processing (no text conversion) 
    Response Latency<700ms (developer testing) 
    Context Window1 million tokens 
    LanguagesEnglish (US/UK), Spanish, French, Italian, German, Portuguese, Hindi 
    Voice TypesMasculine-sounding and feminine-sounding expressive voices 
    Input SupportStreaming audio + text (cross-modal) 
    Output SupportStreaming audio with adaptive prosody 
    API ProtocolHTTP/2 bidirectional streaming 
    Noise RobustnessBackground noise filtering + accent adaptation 
    Turn-TakingConfigurable pause sensitivity (low/medium/high) 
    Tool CallingAsynchronous multi-step task support 
    Pricing (Input)~$0.0034 per 1K tokens 
    Pricing (Output)~$0.0136 per 1K tokens 
    Availability4 AWS regions (US East/West, Tokyo, Stockholm) 
    IntegrationBedrock API, Amazon Connect, Vonage, Twilio, AudioCodes, LiveKit, Pipecat 
    Launch DateDecember 2, 2025 

    Frequently Asked Questions (FAQs) 

    What is the difference between speech-to-speech and text-to-speech AI?
    Speech-to-speech models like Nova 2 Sonic process audio directly without converting to text, preserving acoustic features like tone and emotion. Traditional text-to-speech systems convert speech to text, process it with a language model, then convert back to audio, causing latency and losing emotional context.

    Can Nova 2 Sonic handle interruptions during conversations?
    Yes, Nova 2 Sonic detects user interruptions and non-verbal cues like laughter, hesitations, and inter-sentential pauses to enable natural turn-taking. Developers can adjust turn-taking sensitivity to low, medium, or high based on their use case requirements.

    How does Nova 2 Sonic compare to ChatGPT’s voice mode?
    Nova 2 Sonic responds faster than OpenAI’s GPT-4o Realtime (under 700ms vs ~1 second) and costs approximately 80% less for equivalent voice interactions. Nova 2 Sonic also offers a larger one-million token context window compared to GPT-4o’s 32,000 tokens.

    Does Nova 2 Sonic require AWS infrastructure to use?
    Yes, Nova 2 Sonic is available exclusively through Amazon Bedrock, requiring an AWS account and Bedrock access. However, it integrates with third-party telephony providers like Twilio and Vonage, allowing some deployment flexibility.

    What are polyglot voices in Nova 2 Sonic?
    Polyglot voices are a single voice that can speak multiple languages with native expressivity and pronunciation. This means the same voice character can switch between English, Spanish, French, and other supported languages while maintaining natural-sounding delivery for each language.

    Can Nova 2 Sonic be used for call center automation?
    Yes, Nova 2 Sonic integrates directly with Amazon Connect for call center applications, plus third-party providers like Vonage, Twilio, and AudioCodes. It handles streaming speech recognition with background noise robustness and natural dialog flow suitable for customer support automation.

    What is the maximum conversation length Nova 2 Sonic can handle?
    With its one-million token context window, Nova 2 Sonic can maintain conversations equivalent to approximately 750,000 words or hours of audio. This enables sustained interactions without losing conversation history or requiring manual state management.

    Does Nova 2 Sonic support tool calling and function invocation?
    Yes, Nova 2 Sonic supports asynchronous tool calling, allowing it to invoke external functions and tools while maintaining conversation flow. The model shows superior tool invocation accuracy compared to the original Nova Sonic.

    Featured Snippet Boxes

    What is Amazon Nova 2 Sonic?

    A speech-to-speech AI model that processes voice conversations entirely in the audio domain without text conversion. It delivers real-time responses in under 700ms with polyglot voices, one-million token context, and native support for nine languages including Portuguese and Hindi.

    How much does Nova 2 Sonic cost?

    Approximately $0.0034 per 1K input tokens and $0.0136 per 1K output tokens. This equals roughly $7 per day for 10 hours of conversation, representing 80% cost savings versus OpenAI’s GPT-4o Realtime.

    What languages does Nova 2 Sonic support?

    Nine languages: English (American and British accents), Spanish, French, Italian, German, Portuguese, and Hindi. The model features polyglot voices that can speak multiple languages with native expressivity using the same voice.

    How fast is Nova 2 Sonic?

    It responds in under 700 milliseconds in real-world testing, with average response times just over one second faster than both OpenAI GPT-4o and Google Gemini Flash 2.0 according to research firm Artificial Analysis.

    Where is Nova 2 Sonic available?

    In four AWS regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). Developers access it through Amazon Bedrock’s bidirectional streaming API.

    What’s new in Nova 2 Sonic vs the original?

    It adds Portuguese and Hindi language support, polyglot voices, turn-taking controllability, cross-modal interaction between voice and text, asynchronous tool calling, one-million token context window, and enhanced reasoning compared to the original Nova Sonic.

    Mohammad Kashif
    Mohammad Kashif
    Topics covers smartphones, AI, and emerging tech, explaining how new features affect daily life. Reviews focus on battery life, camera behavior, update policies, and long-term value to help readers choose the right gadgets and software.

    Latest articles

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026,...

    Qualcomm Insight Platform: How Edge AI Is Transforming Video Analytics

    Summary: Qualcomm Insight Platform transforms traditional surveillance into intelligent video analytics by processing AI...

    Meta Launches AI-Powered Support Hub for Facebook and Instagram Account Recovery

    Summary: Meta rolled out a centralized support hub on Facebook and Instagram globally, featuring...

    Snowflake and Anthropic’s $200 Million Partnership Brings Claude AI to Enterprise Data

    Snowflake and Anthropic expanded their partnership with a $200 million, multi-year agreement that integrates...

    More like this

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026,...

    Qualcomm Insight Platform: How Edge AI Is Transforming Video Analytics

    Summary: Qualcomm Insight Platform transforms traditional surveillance into intelligent video analytics by processing AI...

    Meta Launches AI-Powered Support Hub for Facebook and Instagram Account Recovery

    Summary: Meta rolled out a centralized support hub on Facebook and Instagram globally, featuring...