MVP Factory
ai startup development

SDK: google-generativeai (pip install google-generativeai)

KW
Krystian Wiewiór · · 5 min read

TL;DR

Adding voice to your AI chatbot means working with two different Gemini API approaches and two different Python SDKs. Use the standard API (google.generativeai) for speech-to-text transcription and the Live API (google.genai) for real-time text-to-speech synthesis. The best architectural decision you can make is transcribing voice to text first, then routing through your existing text-based orchestrator. That gives every feature voice support for free. Model names, endpoints, pricing, and SDK packages differ completely between the two surfaces.


The architecture that actually works

When teams add voice, they tend to build a parallel voice pipeline alongside their text pipeline. That doubles your maintenance surface and guarantees the two drift apart over time.

The pattern that holds up in production is almost disappointingly simple:

Voice Message → STT (Transcribe) → Text Orchestrator → Response Text → TTS → Audio Reply

Convert voice to text at the entry point and every downstream feature (RAG search, function calling, intent routing) automatically supports voice input. No additional code. This is the same composability principle that keeps systems maintainable: do one thing well at each layer.

Speech-to-text with the standard API

For transcribing pre-recorded audio (voice messages your bot receives), use the standard Gemini API via the google-generativeai SDK. You have a complete audio file, not a real-time stream, so the Live API would be overkill.

# SDK: google-generativeai (pip install google-generativeai)
import google.generativeai as genai

model = genai.GenerativeModel("gemini-3.1-flash")

audio_bytes = download_voice_message(message_id)

response = model.generate_content([
    "Transcribe this audio to text accurately.",
    {"mime_type": "audio/m4a", "data": audio_bytes}
])

transcript = response.text

In our tests, a 15-second voice message transcribed in ~2.1 seconds (median, n=50, gemini-3.1-flash, us-central1). Multi-language recognition handled mixed-language sentences within a single utterance without issues. You also skip the overhead of maintaining a persistent WebSocket connection.

Text-to-speech with the Live API

Generating spoken audio from text requires the Gemini Live API and, confusingly, a completely different SDK: google-genai (not google-generativeai). The model name is different too.

# SDK: google-genai (pip install google-genai)
from google import genai

client = genai.Client(
    vertexai=True,
    project="your-project",
    location="us-central1"  # MUST be regional, not global
)

async with client.aio.live.connect(
    model="gemini-live-2.5-flash-native-audio"
) as session:
    await session.send_client_content(
        turns=[{"role": "user", "parts": [{"text": response_text}]}]
    )
    # Collect PCM audio chunks from the stream
    pcm_data = await collect_audio_response(session)

The output is raw PCM audio. You’ll need ffmpeg to convert it to something usable like m4a before sending it back to users.

A note on the SDK split: the standard API uses google-generativeai (import google.generativeai). The Live API uses google-genai (from google import genai). These are distinct packages with different installation commands, authentication flows, and API surfaces. Mixing them up is the single most common integration mistake I’ve seen, and the error messages won’t help you figure out what went wrong.

Where things go wrong

Building against these APIs, I ran into several traps that weren’t documented anywhere obvious:

PitfallWhat you’d expectWhat actually happens
SDK packageOne unified SDKTwo separate packages: google-generativeai vs google-genai
Model namingConsistent conventiongemini-3.1-flash (standard) vs gemini-live-2.5-flash-native-audio (Live)
EndpointGlobal endpoint worksLive API requires a regional endpoint (us-central1). Global fails silently.
Part.from_text()Positional arg worksMust use keyword argument syntax or it throws unexpected errors
Audio formatReady-to-use outputRaw PCM. Requires ffmpeg conversion to m4a/mp3.
StreamingBoth APIs streamStandard is request/response. Only Live API streams bidirectionally.
PricingSimilar cost modelLive API charges per session-second; standard charges per token

What this costs

Most guides skip this part, which is a disservice. Standard API pricing is token-based, so it’s predictable and cheap for short transcriptions. Live API pricing is session-based: you pay for connection duration, not just output volume. For high-volume bots, Live API costs ran 3-5x higher per interaction in our testing, depending on response length and connection overhead. Check the current Gemini pricing docs before committing, because rates change.

Watch for platform quirks

If you’re integrating with messaging platforms, reply token constraints will bite you. LINE, for example, lets a reply token be used only once. If your bot sends a text reply and then wants to follow up with audio, you need a push message API for the second response because the reply token is already consumed.

This kind of platform-specific constraint is exactly why the transcribe-first architecture pays off. Your orchestrator handles the logic; your platform adapter handles the delivery quirks. Clean separation.

Before you ship

Monitor your end-to-end voice round-trip time. Users expect sub-3-second responses for voice interactions, and you’ll be surprised how quickly latency adds up across the transcription, orchestration, and synthesis steps.

Make sure your ffmpeg binary is available in your deployment environment (container images often omit it). Set up proper error handling for the Live API WebSocket connection too. It will drop under load if you’re not managing connections carefully.

One more thing: take breaks during long WebSocket debugging sessions. I’m serious. Fatigue introduces subtle bugs that cost more time than the break would have.

What to remember

Use the standard google-generativeai SDK for STT and google-genai with the Live API for TTS. Know which package you’re importing and why. They look similar enough to confuse you at 2am.

Hard-code us-central1 (or your nearest supported region) for Live API calls. The global Vertex AI endpoint doesn’t support them. This is the number one deployment failure I’ve seen.

Transcribe first, then route through your existing text orchestrator. Build the voice layer as a thin adapter, not a parallel pipeline. Your future self maintaining this system will thank you.


Share: Twitter LinkedIn