How to build a voice agent with ElevenLabs API and Python

1) FTC disclosure

This article contains an ElevenLabs affiliate link. If you sign up through that link, I may earn a commission at no extra cost to you.

2) TL;DR: what we are building + stack

If you searched build voice agent elevenlabs python, you likely want a tutorial that can actually run, not architecture slides. In this guide, we build a complete local voice agent that listens from your microphone, generates a response, and speaks back using ElevenLabs.

The build is intentionally simple but production-minded. We keep clear boundaries between STT, response generation, and TTS so you can swap components later without rewriting the entire project.

If you are still deciding whether the platform is worth paying for after the prototype stage, the full developer review goes deeper into quality, DX, and where ElevenLabs fits better than nearby alternatives.

What you will have at the end:

  1. A runnable CLI voice agent loop in Python
  2. Real microphone input with SpeechRecognition
  3. Real audio synthesis with ElevenLabs Python SDK
  4. Two response modes: rule and optional openai
  5. A test checklist to verify every stage locally

Stack used in this tutorial:

Expected effort:

3) Prerequisites (Python 3.9+, ElevenLabs account, basic async knowledge)

Before writing files, make sure your local environment is realistic for audio work. Most project failures at this stage are not code logic failures. They are environment mismatches.

Required:

System checks you should do first:

  1. Confirm Python version:
python --version
  1. Confirm microphone availability at OS level (outside Python first).
  2. Confirm outbound internet access for API requests.

Why these prerequisites matter:

4) Architecture overview (diagram text-based)

The architecture below is small enough for a tutorial and clean enough for a real MVP. Each block has one job.

+-------------------+      +-----------------------+      +------------------+
|  Microphone Input | ---> |  STT Layer            | ---> |  Agent Brain     |
|  (device audio)   |      |  SpeechRecognition    |      |  rule or OpenAI  |
+-------------------+      +-----------------------+      +------------------+
                                                                   |
                                                                   v
                    +--------------------------------------------------------------+
                    | TTS Layer (ElevenLabs Python SDK)                            |
                    | client.text_to_speech.convert(...) -> save audio file        |
                    +--------------------------------------------------------------+
                                                                   |
                                                                   v
                                                       +--------------------------+
                                                       | Local Output (MP3 file)  |
                                                       | Playback or downstream    |
                                                       +--------------------------+

Design decisions and why they are practical:

For developers shipping real products, this separation is more important than adding features too early.

5) Step 1: Setup + install dependencies

Create a new project and virtual environment.

mkdir voice-agent-elevenlabs-python
cd voice-agent-elevenlabs-python
python -m venv .venv

Activate environment:

# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

Install dependencies:

pip install elevenlabs SpeechRecognition pyaudio python-dotenv openai

Create a .env file:

ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_VOICE_ID=JBFqnCBsd6RMkjVDRZzb
ELEVENLABS_MODEL_ID=eleven_multilingual_v2
OPENAI_API_KEY=
OPENAI_MODEL=gpt-4.1-mini

Create requirements.txt for reproducibility:

elevenlabs
SpeechRecognition
pyaudio
python-dotenv
openai

Create this minimal project structure:

voice-agent-elevenlabs-python/
  .env
  requirements.txt
  tts_engine.py
  stt_engine.py
  brain.py
  main.py

Why we keep only six files now:

6) Step 2: ElevenLabs TTS integration (verified Python SDK snippet)

This section uses the verified ElevenLabs Python SDK pattern with ElevenLabs client and client.text_to_speech.convert(...).

Create tts_engine.py:

import os
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv
from elevenlabs import save
from elevenlabs.client import ElevenLabs


load_dotenv()


class TTSEngine:
    def __init__(self) -> None:
        api_key = os.getenv("ELEVENLABS_API_KEY", "").strip()
        if not api_key:
            raise ValueError("Missing ELEVENLABS_API_KEY")

        self.client = ElevenLabs(api_key=api_key)
        self.voice_id = os.getenv("ELEVENLABS_VOICE_ID", "JBFqnCBsd6RMkjVDRZzb")
        self.model_id = os.getenv("ELEVENLABS_MODEL_ID", "eleven_multilingual_v2")

    def synthesize_to_file(self, text: str, output_path: Optional[str] = None) -> str:
        if not text.strip():
            raise ValueError("Text cannot be empty")

        audio = self.client.text_to_speech.convert(
            voice_id=self.voice_id,
            text=text,
            model_id=self.model_id,
            output_format="mp3_44100_128",
        )

        if output_path is None:
            output_path = str(Path("output_reply.mp3").resolve())

        save(audio, output_path)
        return output_path


if __name__ == "__main__":
    engine = TTSEngine()
    path = engine.synthesize_to_file("Hello from your Python voice agent.")
    print(f"Audio saved to: {path}")

Run a direct TTS smoke test:

python tts_engine.py

If output_reply.mp3 is created, your TTS integration is valid. Do this before adding STT and agent logic so failures stay isolated.

Once that smoke test works, this production-oriented pricing breakdown helps you estimate what happens when the tutorial starts turning into real monthly usage.

7) Step 3: Add STT input layer (SpeechRecognition library)

Now we add speech input. The goal here is robust microphone capture with clear error behavior.

Create stt_engine.py:

import speech_recognition as sr


class STTEngine:
    def __init__(self) -> None:
        self.recognizer = sr.Recognizer()

    def listen_once(self, timeout: int = 8, phrase_time_limit: int = 12) -> str:
        with sr.Microphone() as source:
            print("Calibrating ambient noise...")
            self.recognizer.adjust_for_ambient_noise(source, duration=0.7)
            print("Listening now...")
            audio = self.recognizer.listen(
                source,
                timeout=timeout,
                phrase_time_limit=phrase_time_limit,
            )

        try:
            transcript = self.recognizer.recognize_google(audio)
            return transcript.strip()
        except sr.UnknownValueError:
            return ""
        except sr.RequestError as exc:
            raise RuntimeError(f"STT backend request failed: {exc}") from exc


if __name__ == "__main__":
    stt = STTEngine()
    text = stt.listen_once()
    if text:
        print(f"Transcript: {text}")
    else:
        print("No recognizable speech captured.")

Run STT test:

python stt_engine.py

Expected behavior:

If transcript quality is poor, tune duration, timeout, and phrase_time_limit before moving on.

8) Step 4: Connect LLM response layer (OpenAI or simple rule-based)

This module keeps your decision logic independent from speech IO. We support two modes:

Create brain.py:

import os
from typing import Literal

from dotenv import load_dotenv


load_dotenv()

Mode = Literal["rule", "openai"]


class AgentBrain:
    def __init__(self, mode: Mode = "rule") -> None:
        self.mode = mode
        self.openai_client = None
        self.model = os.getenv("OPENAI_MODEL", "gpt-4.1-mini")

        if mode == "openai":
            api_key = os.getenv("OPENAI_API_KEY", "").strip()
            if not api_key:
                raise ValueError("OPENAI_API_KEY is required for openai mode")

            from openai import OpenAI

            self.openai_client = OpenAI(api_key=api_key)

    def reply(self, user_text: str) -> str:
        clean_text = user_text.strip()
        if not clean_text:
            return "I could not hear you clearly. Please try again."

        if self.mode == "openai":
            return self._openai_reply(clean_text)

        return self._rule_reply(clean_text)

    def _rule_reply(self, text: str) -> str:
        t = text.lower()

        if "hello" in t or "hi" in t:
            return "Hello. I am your local developer voice agent."
        if "status" in t:
            return "System status: local pipeline running with SpeechRecognition and ElevenLabs."
        if "weather" in t:
            return "I do not have live weather yet. Next step is adding a weather API tool."
        if "quit" in t or "exit" in t:
            return "Acknowledged. Say goodbye and close the session."

        return f"You said: {text}. I am currently in rule mode."

    def _openai_reply(self, text: str) -> str:
        assert self.openai_client is not None

        response = self.openai_client.responses.create(
            model=self.model,
            input=(
                "You are a concise voice assistant for software developers. "
                "Answer with practical steps and keep replies under 3 short sentences.\n\n"
                f"User: {text}"
            ),
        )

        output = (response.output_text or "").strip()
        if output:
            return output
        return "I generated an empty reply. Please ask again with more context."

Why this design is useful in real projects:

9) Step 5: Run and test locally

Now wire all modules into one executable loop.

Create main.py:

import argparse
from pathlib import Path

from brain import AgentBrain
from stt_engine import STTEngine
from tts_engine import TTSEngine


def run(mode: str) -> None:
    stt = STTEngine()
    brain = AgentBrain(mode=mode)
    tts = TTSEngine()

    print(f"Voice agent started in {mode} mode")
    print("Speak naturally. Say 'quit' or 'exit' to stop.")

    while True:
        try:
            user_text = stt.listen_once()
        except Exception as exc:
            print(f"STT error: {exc}")
            continue

        if not user_text:
            print("No speech recognized. Try again.")
            continue

        print(f"User: {user_text}")
        reply_text = brain.reply(user_text)
        print(f"Agent: {reply_text}")

        output_file = Path("output_reply.mp3").resolve()
        try:
            saved = tts.synthesize_to_file(reply_text, str(output_file))
            print(f"Audio reply saved: {saved}")
        except Exception as exc:
            print(f"TTS error: {exc}")

        low = user_text.lower()
        if "quit" in low or "exit" in low:
            break

    print("Voice agent stopped")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--mode", choices=["rule", "openai"], default="rule")
    args = parser.parse_args()
    run(args.mode)


if __name__ == "__main__":
    main()

Run in rule mode first:

python main.py --mode rule

If rule mode works, run OpenAI mode:

python main.py --mode openai

Local validation checklist (do not skip):

  1. STT captures speech reliably in at least 5 consecutive turns
  2. Rule mode returns expected responses for known trigger phrases
  3. output_reply.mp3 is generated for each turn
  4. OpenAI mode works without changing any STT/TTS code
  5. Process exits cleanly when user says quit

At this point, you have a working voice-agent foundation that can be demoed to users or teammates.

10) Common errors + how to fix

This section saves real debugging time. Most issues are not complex, but they happen frequently.

Error: ModuleNotFoundError: No module named 'pyaudio'

Cause:

Fix:

  1. Verify active environment path with where python or which python
  2. Reinstall pyaudio in the active environment
  3. If install fails on Windows, install build tools and retry

Error: Missing ELEVENLABS_API_KEY

Cause:

Fix:

  1. Confirm .env is in project root
  2. Confirm exact key name: ELEVENLABS_API_KEY
  3. Add a quick debug print of current working directory when needed

Error: empty transcript even when microphone works

Cause:

Fix:

  1. Increase adjust_for_ambient_noise(..., duration=1.0)
  2. Increase timeout and phrase_time_limit
  3. Move microphone closer and reduce background noise

Error: STT backend request failed

Cause:

Fix:

  1. retry with exponential backoff
  2. keep a fallback input path (typed text mode)
  3. log failure counts so you can see if issue is systemic

Error: OpenAI mode returns auth or quota error

Cause:

Fix:

  1. run in --mode rule to keep product development unblocked
  2. verify OPENAI_API_KEY and account quotas
  3. add a startup check that validates OpenAI connection once

Error: generated MP3 exists but playback is inconsistent

Cause:

Fix:

  1. confirm file plays manually first
  2. write timestamped files during testing
  3. only add auto-play after synthesis is stable

11) Next steps: deploy, scale, swap components

Once local testing is stable, move from prototype to a service that survives real traffic.

Phase A: Harden reliability

Phase B: Reduce latency and cost

Phase C: Service architecture

Phase D: Product-level capabilities

A practical shipping strategy is: local monolith -> single API service -> modular services only when traffic demands it. That path keeps complexity proportional to real usage.

If you think your roadmap is likely to become streaming-first rather than file-generation first, the comparison with PlayHT shows where that alternative starts to make more architectural sense.

12) FAQ

Can I build this without OpenAI?

Yes. Rule mode is enough to run a full microphone -> response -> TTS loop. OpenAI is optional.

Which ElevenLabs method is used here?

This tutorial uses the verified Python SDK call: client.text_to_speech.convert(...).

Should I deploy as one app or multiple services first?

Start with one app for speed. Split services only after you have real usage and clear bottlenecks.

What should I optimize first for production?

Reliability before intelligence: retries, timeouts, logs, then quality tuning.

Can I swap SpeechRecognition later?

Yes. Because STT is isolated in stt_engine.py, you can replace it with another provider while preserving the rest of the pipeline.