Testing AI Agents in Production: 4 Practical Strategies for Reliable Agent Pipelines

The bottom line: LangChain’s 2026 State of Agent Engineering report found that 57% of organizations have agents in production, but quality remains the #1 barrier to deployment at 32% of respondents source. Yet only 52% run offline evaluations and 37% run online evaluations. Most teams are shipping agents without a test strategy — this post gives you four copy-paste patterns that change that.

Testing AI agents is fundamentally different from testing traditional software. A regular function returns the same output for the same input every time. An agent might call different tools, generate different text, or arrive at the same answer through completely different reasoning paths across two identical runs.

This non-determinism breaks conventional testing assumptions. Assertions like assert result == "expected" don’t work when the model might say “The weather forecast calls for rain” one day and “There’s a 70% chance of precipitation” the next — both correct, both different.

The good news: after building and shipping agent systems across hundreds of test runs on an eval harness, four testing strategies consistently catch regressions before they reach production.


Strategy 1: Unit Testing with Mocked LLMs

The fastest way to validate agent components is replacing the LLM with a deterministic test model. Pydantic AI’s TestModel does exactly this — it’s “plain old procedural Python code that tries to generate data satisfying the JSON schema of a tool” source.

The pattern: override the agent’s model inside a context manager and test your tool logic, prompt templates, and error handling in isolation.

import pytest
from pydantic_ai import models
from pydantic_ai.models.test import TestModel

models.ALLOW_MODEL_REQUESTS = False  # Safety: block real LLM calls

async def test_weather_agent_uses_forecast_tool():
    conn = DatabaseConn()
    with weather_agent.override(model=TestModel()):
        result = await weather_agent.run(
            "What's the weather in Berlin tomorrow?",
            deps=WeatherService()
        )
    # TestModel returns hardcoded tool args — validates schema compliance
    assert "forecast" in result.data.lower()

This runs in milliseconds, needs no API key, and catches schema mismatches, broken tool wiring, and runtime errors before you ever call an LLM. The pytest framework documentation covers the fixture system that makes this pattern composable source.

When to use: Every agent component — tool registration, prompt templates, output parsing, error handling.

When it’s insufficient: TestModel generates dummy data that doesn’t test real-world input distributions. That’s when you reach for FunctionModel (same library) to simulate realistic tool-calling scenarios.


Strategy 2: Integration Testing with Controlled Fixtures

Unit tests verify components. Integration tests verify that components work together — tool A feeds into tool B, the agent routes correctly based on context, and multi-step workflows complete without dead ends.

The open-source ksankaran/ai-agent-testing repo demonstrates this with a test structure organized by layer:

# tests/test_integration.py
async def test_research_agent_full_workflow():
    """Agent searches, extracts, and summarizes — end to end."""
    agent = ResearchAgent(
        search_tool=mock_search_tool(returns=[
            {"title": "Result 1", "url": "https://github.com/ksankaran/ai-agent-testing"}
        ]),
        extract_tool=mock_extract_tool(returns="Full document text..."),
    )
    result = await agent.run("Research the history of PostgreSQL")
    
    assert result.summary is not None
    assert len(result.sources) >= 1
    assert result.confidence_score > 0.5

The key insight from that repo’s patterns: mock at the tool boundary, not the LLM boundary. This lets you test real agent decision logic — which tools it selects, in what order, with what parameters — without relying on a specific LLM output.


Strategy 3: LLM-as-Judge Evaluation

Some things can’t be tested with assertions — tone, safety, reasoning quality, completeness. This is where LLM-as-judge evaluation comes in: a secondary model (often a cheaper one) scores the primary agent’s output against a rubric.

LangChain’s survey found LLM-as-judge is used by 53% of teams running evaluations source. The standard pattern:

EVAL_RUBRIC = """
Score the agent output 1-5 on each dimension:
1. Correctness: Does the answer match the expected facts?
2. Completeness: Does it address all parts of the query?
3. Safety: Does it refuse harmful or out-of-scope requests?

Return JSON: {"correctness": int, "completeness": int, "safety": int}
"""

def evaluate_agent_output(query: str, agent_output: str, expected: str) -> dict:
    prompt = f"""
    Query: {query}
    Expected: {expected}
    Agent Output: {agent_output}
    
    {EVAL_RUBRIC}
    """
    response = eval_model.generate(prompt)
    return json.loads(response)

The critical practice: validate your judge. Run 20 scored examples against human review first — if your judge disagrees with humans more than 15% of the time, tune the rubric.


Strategy 4: CI/CD Integration with Regression Gates

Testing is only useful if it blocks bad deployments. The LangChain report shows 94% of production teams have observability set up source, but far fewer gate deployments on evaluation scores.

A practical regression gate:

# .github/workflows/agent-eval.yml
name: Agent Evaluation Gate
on: [pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run offline evals
        run: |
          python -m pytest tests/ -v --eval-dataset=regression_suite.json
      - name: Gate on score threshold
        run: |
          python scripts/check-eval-score.py --min-score 0.7

For production teams, this is the difference between “testing as a checkbox” and “testing as a safety net.” The 22.8% of production teams who still don’t evaluate at all source are shipping blind.


Edge Cases to Watch

  • Token dependence: A test that passes with GPT-4o may fail with a smaller model. Test with the model you deploy with.
  • Time-sensitive outputs: Date-aware agents need frozen time fixtures (freezegun is your friend).
  • Context window pressure: Long conversations change behavior. Include tests with near-capacity context.
  • Multi-agent cascades: Test each agent independently before testing the chain. Cascade failures multiply.

Verdict

You don’t need a perfect test suite to start — according to LangChain’s survey, 52% of teams run offline evals and still ship agents with reasonable quality source. Start with Strategy 1 (mocked unit tests), add integration tests for critical paths, then layer LLM-as-judge evaluation for quality dimensions. Gate on a minimum score in CI.

The GitHub repo ksankaran/ai-agent-testing has runnable examples for all four strategies — clone it, swap in your agents, and you’re testing by lunch.

Cross-link: For a deeper dive on evaluation frameworks specifically, see our Agent Eval Harness build log — it covers the 500-run experiment that informed these patterns. For more on the observability side, check out MCP Integration Patterns at our sister publication toolbrain.net.

← Back to all posts