Should You Write Tests with Language Models?

This cheat sheet has (obviously?) been made by a language model from the transcript of the PyBay 2025 talk by Pamela Fox.

Use LMs as fast test scaffolding, not as oracles: humans still define invariants, edge cases, and what “correct” means.

1. Reduce Redundancy

Use parametrization instead of near-duplicate tests, and centralize setup with fixtures that also clean up.

import pytest

@pytest.mark.parametrize(
    "order, expected_first",
    [("desc", "bee3"), ("asc", "bee1")]
)
def test_list_bees_sorted(client, seed_bees, order, expected_first):
    res = client.get(f"/bees?order={order}")
    assert res.status_code == 200
    data = res.json()
    assert data[0]["id"] == expected_first

import pytest

@pytest.fixture
def seed_bees(db_session):
    # Insert rows
    # db_session.add_all([...])
    yield
    # Guaranteed cleanup to avoid uniqueness/state bleed
    db_session.execute("DELETE FROM bees;")
    db_session.commit()

Key ideas:

Parametrize when only a few values differ.
Keep shared setup/teardown in fixtures, not copy-pasted into every test.

2. Make Fake Data Look Real and Diverse

Prefer libraries over hand-rolled strings; seed for reproducibility and control locales.

from faker import Faker

Faker.seed(1337)
faker = Faker(locale="en_US")  # or a list of controlled locales

def fake_person():
    return {
        "name": faker.name(),          # accents, hyphens, multi-part
        "email": faker.email(),
        "address": faker.address(),
        "lat": faker.latitude(),
        "lon": faker.longitude(),
    }

Also explicitly test edge cases:

Names with non-ASCII characters
Names with multiple middle names
Names with accents or hyphens
Very long or single-character names
Addresses and locales outside your usual region

LMs tend to give “John Doe” / “Jane Smith”; your tests shouldn’t.

3. Assert Behavior, Not Just Shape

Don’t only check that keys exist—validate values and constraints. For JSON APIs, snapshot full responses when that’s more expressive.

def test_bee_api_snapshot(client, snapshot):
    res = client.get("/bees?order=desc")
    assert res.status_code == 200
    data = res.json()
    snapshot.assert_match(data, "bees_desc.json")

Guidelines:

Snapshot only stable fields (no timestamps, random IDs, etc.), or filter before snapshotting.
Treat snapshot diffs like code changes: review them, don’t click “accept” blindly.

4. Treat Coverage as a Tool, Not a Target

Coverage highlights what you forgot to test; it does not prove correctness.

Run line and branch coverage.
Inspect uncovered lines and branches, especially around conditionals and error handling.
Add tests that exercise those paths with meaningful assertions.

Even at ~100% line coverage, you may still be missing behaviors—property-based tests (§5) and fuzzing (§6) help there.

5. Break the Happy Path with Property-Based Testing

Use Hypothesis to explore tricky inputs (empty/huge/negative/unicode, weird floats, etc.).

from hypothesis import given, strategies as st

@given(taxon_id=st.integers())
def test_taxon_endpoint_handles_any_int(client, taxon_id):
    res = client.get(f"/taxa/{taxon_id}")
    assert res.status_code in {200, 400, 404}  # never 500

You can also model realistic numeric ranges:

@given(
    lat=st.floats(min_value=-90, max_value=90, allow_nan=False),
    lon=st.floats(min_value=-180, max_value=180, allow_nan=False),
)
def test_bees_active_accepts_valid_coords(client, lat, lon):
    res = client.get("/bees/active", params={"lat": lat, "lon": lon})
    assert res.status_code in {200, 400, 404}  # but not 500

Add input validation so invalid inputs yield clear 4xx responses, not 500s.

6. Fuzz APIs End-to-End from the Spec

Given an OpenAPI/Swagger spec, use Schemathesis (built on Hypothesis) to generate inputs across routes/params and reproduce failures with minimal examples or curl commands.

Conceptual example:

import pytest
import schemathesis

@pytest.fixture
def web_app(app):
    return schemathesis.openapi.from_asgi("/openapi.json", app)

schema = schemathesis.pytest.from_fixture("web_app")

@schema.parametrize()
def test_openapi_specification(case):
    case.call_and_validate()

This explores your API like Hypothesis explores functions, and gives you a concrete request to reproduce each failure.

7. Make Tests Deterministic and Re-Runnable

Seed Faker, random, NumPy, and Hypothesis.
Freeze time if logic depends on “now”.
Ensure fixtures clean DB, files, and any in-memory/global state.
Run the suite twice and in parallel (locally/CI) to reveal hidden shared state.
Pin dependency versions and control environment variables (TZ, feature flags, API base URLs).

If a test passes sometimes and fails sometimes, assume there is a bug until proven otherwise.

8. Prefer Higher-Value Test Levels When Appropriate

For user-facing web apps, bias toward integration/E2E tests (e.g., Playwright) that cover real user journeys:

“User signs up, logs in, creates X, sees Y”
“User searches with weird filters and still gets a reasonable result or a clear error”

Support these with a smaller set of focused unit tests for core pure logic (parsers, calculations, transformations).

9. Use LMs as Generalists

Let an LM scaffold, but enforce specialist practices (pytest features, Faker, Hypothesis, snapshots), and always review/refactor.

Prompt stub (adapt as needed):

Write pytest tests for a FastAPI app. Use fixtures from conftest.py. Parametrize similar cases; use fixtures for DB setup/teardown; use Faker with a fixed seed and explicit locale; add at least one Hypothesis test per route’s inputs; write one snapshot test per JSON route (only snapshot stable fields); run coverage and add tests for uncovered lines/branches; ensure tests are idempotent and re-runnable.

LMs are good at volume; humans are responsible for correctness.

10. Practical Workflow

Establish robust fixtures (app/client/DB) with cleanup.
Generate a first pass with an LM using the prompt above.
Refactor: parameterize, centralize factories, seed Faker, add snapshots.
Add Hypothesis on risky inputs; fix validation to return 4xx, not 500.
Coverage pass: close gaps for important lines/branches.
Add E2E smoke tests for key flows (login, main user journeys).
Ensure determinism: seed, freeze time, run twice/parallel, pin deps.
Merge gates: green CI, coverage threshold met, snapshot diffs reviewed, human review of LM-generated tests.

References

Just because AI can write your tests, should it? – Pamela Fox (PyBay 2025)

Quick Checklist