SDK: google-generativeai (pip install google-generativeai)
TL;DR
Adding voice to your AI chatbot means working with two different Gemini API approaches and two different Python SDKs. Use the standard API (google.generativeai) for speech-to-text transcription and the Live API (google.genai) for real-time text-to-speech synthesis. The best architectural decision you can make is transcribing voice to text first, then routing through your existing text-based orchestrator. That gives every feature voice support for free. Model names, endpoints, pricing, and SDK packages differ completely between the two surfaces.
The architecture that actually works
When teams add voice, they tend to build a parallel voice pipeline alongside their text pipeline. That doubles your maintenance surface and guarantees the two drift apart over time.
The pattern that holds up in production is almost disappointingly simple:
Voice Message → STT (Transcribe) → Text Orchestrator → Response Text → TTS → Audio Reply
Convert voice to text at the entry point and every downstream feature (RAG search, function calling, intent routing) automatically supports voice input. No additional code. This is the same composability principle that keeps systems maintainable: do one thing well at each layer.
Speech-to-text with the standard API
For transcribing pre-recorded audio (voice messages your bot receives), use the standard Gemini API via the google-generativeai SDK. You have a complete audio file, not a real-time stream, so the Live API would be overkill.
# SDK: google-generativeai (pip install google-generativeai)
import google.generativeai as genai
model = genai.GenerativeModel("gemini-3.1-flash")
audio_bytes = download_voice_message(message_id)
response = model.generate_content([
"Transcribe this audio to text accurately.",
{"mime_type": "audio/m4a", "data": audio_bytes}
])
transcript = response.text
In our tests, a 15-second voice message transcribed in ~2.1 seconds (median, n=50, gemini-3.1-flash, us-central1). Multi-language recognition handled mixed-language sentences within a single utterance without issues. You also skip the overhead of maintaining a persistent WebSocket connection.
Text-to-speech with the Live API
Generating spoken audio from text requires the Gemini Live API and, confusingly, a completely different SDK: google-genai (not google-generativeai). The model name is different too.
# SDK: google-genai (pip install google-genai)
from google import genai
client = genai.Client(
vertexai=True,
project="your-project",
location="us-central1" # MUST be regional, not global
)
async with client.aio.live.connect(
model="gemini-live-2.5-flash-native-audio"
) as session:
await session.send_client_content(
turns=[{"role": "user", "parts": [{"text": response_text}]}]
)
# Collect PCM audio chunks from the stream
pcm_data = await collect_audio_response(session)
The output is raw PCM audio. You’ll need ffmpeg to convert it to something usable like m4a before sending it back to users.
A note on the SDK split: the standard API uses google-generativeai (import google.generativeai). The Live API uses google-genai (from google import genai). These are distinct packages with different installation commands, authentication flows, and API surfaces. Mixing them up is the single most common integration mistake I’ve seen, and the error messages won’t help you figure out what went wrong.
Where things go wrong
Building against these APIs, I ran into several traps that weren’t documented anywhere obvious:
| Pitfall | What you’d expect | What actually happens |
|---|---|---|
| SDK package | One unified SDK | Two separate packages: google-generativeai vs google-genai |
| Model naming | Consistent convention | gemini-3.1-flash (standard) vs gemini-live-2.5-flash-native-audio (Live) |
| Endpoint | Global endpoint works | Live API requires a regional endpoint (us-central1). Global fails silently. |
Part.from_text() | Positional arg works | Must use keyword argument syntax or it throws unexpected errors |
| Audio format | Ready-to-use output | Raw PCM. Requires ffmpeg conversion to m4a/mp3. |
| Streaming | Both APIs stream | Standard is request/response. Only Live API streams bidirectionally. |
| Pricing | Similar cost model | Live API charges per session-second; standard charges per token |
What this costs
Most guides skip this part, which is a disservice. Standard API pricing is token-based, so it’s predictable and cheap for short transcriptions. Live API pricing is session-based: you pay for connection duration, not just output volume. For high-volume bots, Live API costs ran 3-5x higher per interaction in our testing, depending on response length and connection overhead. Check the current Gemini pricing docs before committing, because rates change.
Watch for platform quirks
If you’re integrating with messaging platforms, reply token constraints will bite you. LINE, for example, lets a reply token be used only once. If your bot sends a text reply and then wants to follow up with audio, you need a push message API for the second response because the reply token is already consumed.
This kind of platform-specific constraint is exactly why the transcribe-first architecture pays off. Your orchestrator handles the logic; your platform adapter handles the delivery quirks. Clean separation.
Before you ship
Monitor your end-to-end voice round-trip time. Users expect sub-3-second responses for voice interactions, and you’ll be surprised how quickly latency adds up across the transcription, orchestration, and synthesis steps.
Make sure your ffmpeg binary is available in your deployment environment (container images often omit it). Set up proper error handling for the Live API WebSocket connection too. It will drop under load if you’re not managing connections carefully.
One more thing: take breaks during long WebSocket debugging sessions. I’m serious. Fatigue introduces subtle bugs that cost more time than the break would have.
What to remember
Use the standard google-generativeai SDK for STT and google-genai with the Live API for TTS. Know which package you’re importing and why. They look similar enough to confuse you at 2am.
Hard-code us-central1 (or your nearest supported region) for Live API calls. The global Vertex AI endpoint doesn’t support them. This is the number one deployment failure I’ve seen.
Transcribe first, then route through your existing text orchestrator. Build the voice layer as a thin adapter, not a parallel pipeline. Your future self maintaining this system will thank you.