Building Multi-Agent Systems with OpenAI Agents SDK

Single-agent architectures hit a ceiling fast. One agent juggling research, writing, editing, and optimization produces mediocre results at every stage. The solution is multi-agent systems — architectures where specialized agents collaborate, each focused on what it does best.

I've been building multi-agent systems in production for months. This guide covers the architecture patterns, implementation with OpenAI's Agents SDK, and the production lessons that documentation doesn't teach you.

Why Multi-Agent?

A single LLM call can do many things adequately. But "adequately" isn't production-quality. When you decompose a complex task into specialized roles, each agent can:

Carry a focused system prompt optimized for its specific function
Use only the tools relevant to its domain
Be evaluated and improved independently
Be swapped out or upgraded without affecting the rest of the pipeline

This is the same principle behind microservices. Specialization enables quality at scale.

The Architecture

Every multi-agent system I build follows a consistent pattern:

Orchestrator Agent. The coordinator. It receives the initial task, breaks it into subtasks, delegates to specialist agents, and assembles the final output. It decides the execution order and handles failures.

Specialist Agents. Domain-focused workers. A writer agent, an editor agent, a research agent, an SEO agent. Each has a narrow system prompt, a curated tool set, and clear input/output contracts.

Handoff Protocol. The mechanism by which one agent passes work to another. This is where most multi-agent systems break. Clean handoffs require structured data, not freeform text.

Implementation with OpenAI Agents SDK

The OpenAI Agents SDK makes building these systems straightforward. Here's a concrete example: a content production pipeline with three agents.

Defining the Agents

from agents import Agent, Runner, handoff

research_agent = Agent(
    name="Researcher",
    instructions="""You are a research specialist. Given a topic,
    find key facts, statistics, and expert perspectives.
    Return structured research briefs with citations.
    Be thorough but concise. No filler.""",
    tools=[web_search, document_reader],
)

writer_agent = Agent(
    name="Writer",
    instructions="""You are a technical writer. Given a research
    brief, produce a well-structured article. Write in a direct,
    authoritative voice. Every paragraph must earn its place.
    Use specific examples and data points from the research.""",
    tools=[],
)

editor_agent = Agent(
    name="Editor",
    instructions="""You are a senior editor. Review the draft for:
    - Factual accuracy against the research brief
    - Clarity and conciseness (cut every unnecessary word)
    - Logical flow and argument structure
    - SEO optimization for the target keyword
    Return the final polished version.""",
    tools=[seo_analyzer],
)

Orchestrating with Handoffs

The key to clean multi-agent execution is defining explicit handoffs. The SDK supports this natively:

orchestrator = Agent(
    name="ContentOrchestrator",
    instructions="""You coordinate content production.
    1. Send the topic to the Researcher for a research brief
    2. Send the brief to the Writer for a first draft
    3. Send the draft + brief to the Editor for final polish
    Always pass complete context between agents.""",
    handoffs=[
        handoff(research_agent),
        handoff(writer_agent),
        handoff(editor_agent),
    ],
)

result = Runner.run_sync(
    orchestrator,
    input="Write a 1500-word article on agentic AI in healthcare"
)
print(result.final_output)

Structured Handoff Data

Freeform text handoffs cause information loss. I define Pydantic models for inter-agent communication:

from pydantic import BaseModel

class ResearchBrief(BaseModel):
    topic: str
    key_facts: list[str]
    statistics: list[dict]
    expert_quotes: list[str]
    sources: list[str]

class ArticleDraft(BaseModel):
    title: str
    content: str
    word_count: int
    target_keywords: list[str]

Each agent's output is validated against these schemas. If an agent returns malformed data, the orchestrator catches it and retries — before it poisons the downstream pipeline.

Production Patterns

Error Handling and Retries

Agents fail. Models hallucinate. APIs time out. Production systems need graceful degradation:

import asyncio
from agents import Runner

async def run_with_retry(agent, input_text, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = await Runner.run(agent, input=input_text)
            return result.final_output
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)

Evaluation Loops

The editor agent isn't enough. I add a separate evaluation step that scores the output against quantitative criteria:

evaluator = Agent(
    name="QualityEvaluator",
    instructions="""Score the article on a 1-10 scale for:
    - Factual accuracy
    - Writing quality
    - SEO optimization
    - Completeness
    Return JSON with scores and specific improvement notes.
    If any score is below 7, the article needs revision.""",
)

If the evaluator flags issues, the orchestrator routes back to the relevant specialist. This creates a feedback loop that converges on quality.

Cost Management

Multi-agent systems multiply API calls. Every agent call costs tokens. Production systems need guardrails:

Token budgets per agent. The researcher gets 4K output tokens. The writer gets 8K. Enforce these limits.
Model tiering. Not every agent needs GPT-4o. The researcher benefits from strong reasoning. The formatter can run on GPT-4o-mini at a tenth the cost.
Caching. If the same research query runs twice, serve from cache. I use Redis with a 24-hour TTL for research briefs.

Observability

You cannot debug what you cannot see. Every agent call should log:

Input and output (truncated for storage)
Token usage and latency
Tool calls made
Handoff source and destination
Any errors or retries

I pipe all of this into a structured logging system. When something goes wrong — and it will — you need the full trace.

When Not to Use Multi-Agent

Multi-agent systems add complexity. Use them when:

The task genuinely has distinct phases requiring different capabilities
Quality requirements demand specialized evaluation at each stage
You need to iterate on individual components independently

Don't use them for simple tasks that a single well-prompted agent handles fine. Over-engineering is a real risk.

Getting Started

If you're building your first multi-agent system:

Start with two agents — a worker and an evaluator. Get the feedback loop right before adding more specialists.
Define structured interfaces between agents from day one. Don't let freeform text become your contract.
Instrument everything. Add logging before you add features.
Set token budgets early. Runaway costs in development become production nightmares.

The OpenAI Agents SDK handles the infrastructure — tool execution, handoffs, context management. Your job is designing the right agent roles, the right tool sets, and the right evaluation criteria.

I've been building these systems for production workloads and the results speak for themselves. If you're exploring multi-agent architectures for your business, I'm happy to walk through what a system would look like for your specific use case.