AI Agents Are Only as Good as Your Test Suite — Here’s Why That Changes Everything About How You Write Tests

Jeffrey Way said it at PHPverse 2026: “If you’re leveraging AI agents, a full test suite is a requirement.” Here’s what that means in practice — the tests you need before you can trust an agent to touch your codebase, the ones that catch what AI gets wrong, and the coverage threshold that makes agentic development safe.

Jeffrey Way stood up at PHPverse 2026 and said something that should have been obvious but wasn’t being said often enough: if you’re leveraging AI agents, a full test suite is a requirement, not a nice-to-have. Most developers heard that and nodded along without changing anything about how they write tests. The statement deserves more than a nod. It’s a structural claim about how software gets built now, and it inverts a priority order most teams have held for a decade — tests as the thing you write after the feature works, time permitting, often skipped under deadline pressure.

That order doesn’t survive contact with agentic development. When a human writes a feature, the test suite documents intent and catches regressions. When an AI agent writes a feature, the test suite is the only thing standing between “the agent says this works” and “this actually works.” Those are not the same claim, and conflating them is how teams end up shipping confidently broken code.

What Changes When the Author Isn’t a Human

A human developer who writes a function generally knows what they intended it to do. If the function is wrong, they have a mental model to compare it against — they wrote the model, they can debug against it. An AI agent that writes a function has no persistent mental model across sessions. It generated plausible code based on the prompt and the context it had access to. It can self-correct within a session — modern agents run the test suite, see failures, and fix the implementation — but that self-correction is only as good as what the test suite catches.

Example: Claude Code workflow
1. Agent generates authentication middleware
2. Runs existing test suite → 3 tests fail
3. Analyzes failure logs
4. Fixes implementation
5. Re-runs tests → all pass
6. Presents final diff for review

This loop is the actual mechanism of modern agentic coding — plan, write, test, analyze failures, fix, retest. It’s also where the entire safety model lives. Step 2 only catches what step 2’s test suite is capable of catching. If the test suite doesn’t assert the right things, step 5 reports “all pass” on code that’s wrong in ways nobody is checking for. The agent isn’t lying. It’s accurately reporting that the tests you wrote are satisfied. The gap is between what you tested and what you needed.

This is the inversion: the test suite isn’t documentation of what a human already verified by writing the code carefully. It’s the verification itself. With a human author, the test suite is the second line of defense. With an agent author, it’s the first and sometimes the only line.

The Audit Finding That Should Change How You Read Agent Benchmarks

Independent analysis of SWE-Bench Verified — the standard benchmark for measuring whether AI agents can fix real GitHub issues — found that a significant proportion of the hardest verified tasks have test suites that wouldn’t actually catch the intended bug. Frontier models score well against these benchmarks partly because the bar the test suite sets is lower than it appears.

This finding generalizes past the benchmark. It’s not a flaw specific to SWE-Bench — it’s a structural fact about test suites in general: a test suite of unknown quality cannot reliably gate AI-generated code, because “the tests pass” and “the code is correct” are only the same statement when the test suite has actually closed the gap between them. A weak test suite makes broken code look shipped. An agent that’s good at making tests pass is not the same as an agent that’s good at writing correct code, unless your tests are good enough that the only way to pass them is to be correct.

The Test Categories That Actually Gate Agent Output

Not all tests provide equal protection against AI-generated mistakes. Some categories catch the specific failure modes agents are prone to; others provide much weaker coverage even at high line-coverage percentages.

Feature tests over unit tests for anything agent-touched

Unit tests verify a function in isolation. An agent can write a function that passes its own unit test perfectly while breaking the integration that function lives inside — wrong return type assumption, a side effect the agent didn’t model, an Eloquent relationship that behaves differently than the agent assumed.

// Unit test — passes, tells you almost nothing about correctness
public function test_calculate_discount_returns_a_float(): void
{
    $service = new DiscountService;
    $result  = $service->calculate(100, 0.1);

    $this->assertIsFloat($result);
}

// Feature test — exercises the actual code path an agent's change runs through
public function test_applying_a_discount_code_reduces_order_total(): void
{
    $order = Order::factory()->create(['total' => 100_00]);
    $code  = DiscountCode::factory()->create(['percentage' => 10]);

    $response = $this->actingAs($order->customer)
        ->postJson("/orders/{$order->id}/discount", ['code' => $code->code]);

    $response->assertOk();
    $this->assertEquals(90_00, $order->fresh()->total);
    $this->assertDatabaseHas('discount_redemptions', [
        'order_id'         => $order->id,
        'discount_code_id' => $code->id,
    ]);
}

The feature test asserts the thing that actually matters — the order total changed correctly, a redemption record exists, the response is correct. An agent refactoring DiscountService internals has to keep all of that true. The unit test only constrains the return type, which an agent can satisfy while breaking everything else.

Negative-path and edge-case tests — the category agents skip by default

When an agent writes a feature, it tends to write tests for the happy path it was reasoning about. It does not reliably write tests for the inputs nobody mentioned — empty strings, negative numbers, concurrent requests, malformed payloads, missing relationships. This is the category most worth writing yourself, deliberately, before letting an agent touch the surrounding code.

public function test_discount_code_cannot_be_applied_twice_to_the_same_order(): void
{
    $order = Order::factory()->create();
    $code  = DiscountCode::factory()->create();

    $this->actingAs($order->customer)
        ->postJson("/orders/{$order->id}/discount", ['code' => $code->code])
        ->assertOk();

    $response = $this->actingAs($order->customer)
        ->postJson("/orders/{$order->id}/discount", ['code' => $code->code]);

    $response->assertStatus(422);
    $response->assertJsonValidationErrors(['code']);
}

public function test_expired_discount_codes_are_rejected(): void
{
    $order = Order::factory()->create();
    $code  = DiscountCode::factory()->create(['expires_at' => now()->subDay()]);

    $response = $this->actingAs($order->customer)
        ->postJson("/orders/{$order->id}/discount", ['code' => $code->code]);

    $response->assertStatus(422);
}

public function test_discount_application_is_protected_against_race_conditions(): void
{
    $order = Order::factory()->create();
    $code  = DiscountCode::factory()->create(['max_uses' => 1]);

    // Simulate two concurrent requests for a single-use code
    [$response1, $response2] = [
        $this->actingAs($order->customer)->postJson("/orders/{$order->id}/discount", ['code' => $code->code]),
        $this->actingAs($order->customer)->postJson("/orders/{$order->id}/discount", ['code' => $code->code]),
    ];

    $successCount = collect([$response1, $response2])
        ->filter(fn ($r) => $r->status() === 200)
        ->count();

    $this->assertEquals(1, $successCount, 'Exactly one request should succeed');
}

These three tests encode business rules that an agent has no way of inferring from the codebase alone — “a code can’t be applied twice,” “expired codes are rejected,” “concurrent redemption of a single-use code must not double-apply.” Without these tests existing before the agent works on the discount system, there’s nothing stopping a refactor from silently breaking any of them. With these tests in place, an agent that breaks the race condition protection gets a failing test immediately, in the same loop where it’s already running tests and fixing failures.

Contract and schema tests — the category that catches silent API drift

When an agent refactors a controller or modifies a resource transformer, the response shape can change in ways that are individually reasonable but collectively break every client consuming that endpoint. Schema tests pin the contract.

public function test_order_resource_response_matches_the_expected_schema(): void
{
    $order = Order::factory()->has(OrderLine::factory()->count(3))->create();

    $response = $this->actingAs($order->customer)
        ->getJson("/api/orders/{$order->id}");

    $response->assertJsonStructure([
        'data' => [
            'id',
            'type',
            'attributes' => [
                'total',
                'status',
                'created_at',
            ],
            'relationships' => [
                'lines' => [
                    'data',
                ],
            ],
        ],
    ]);

    // Pin the types, not just the keys — an agent could rename
    // total from a float to a formatted string and the structure
    // test above would still pass
    $response->assertJsonPath('data.attributes.total', fn ($value) => is_int($value));
}

Structure alone isn’t enough — assertJsonStructure confirms the keys exist, not that an agent didn’t change total from an integer (cents) to a formatted string. The type assertion on top of the structure assertion is what actually pins the contract.

Database state assertions over response assertions alone

An agent can make an HTTP response look correct while the database ends up in the wrong state — the classic failure mode being a job that’s supposed to be queued but isn’t, or a model that’s updated in memory but the change isn’t persisted because of a missed save() call inside a conditional branch the agent didn’t fully trace.

public function test_completing_an_order_dispatches_the_confirmation_email_job(): void
{
    Queue::fake();

    $order = Order::factory()->create(['status' => 'pending']);

    $this->actingAs($order->customer)
        ->postJson("/orders/{$order->id}/complete")
        ->assertOk();

    // Assert the database state, not just the HTTP response
    $this->assertDatabaseHas('orders', [
        'id'     => $order->id,
        'status' => 'completed',
    ]);

    // Assert the side effect actually happened
    Queue::assertPushed(SendOrderConfirmationEmail::class, function ($job) use ($order) {
        return $job->order->id === $order->id;
    });
}

Both assertions matter independently. The response assertion alone tells you the controller returned 200. The database assertion tells you the side effect that 200 implies actually happened. An agent refactoring this flow can break either one while leaving the other intact — testing both is what closes the gap.

The Coverage Threshold That Makes Agentic Development Safe

Line coverage percentage is a weak proxy for the thing that actually matters, which is whether your tests would fail if an agent introduced the specific category of bug agents are prone to introducing. A codebase at 95% line coverage built entirely from unit tests that assert types and not behavior provides less real protection than a codebase at 70% coverage built from feature tests that assert business rules.

The threshold that matters is closer to a checklist than a percentage:

Before letting an agent modify a part of the codebase, confirm:

→ Every public-facing endpoint touched has a feature test
  asserting the full response, not just the status code

→ Every business rule has an explicit test for the rule being
  violated, not just the rule being followed
  ("can't redeem twice," "can't go negative," "can't exceed limit")

→ Every side effect (job dispatch, event fire, notification send,
  database write) has its own assertion, separate from the
  HTTP response assertion

→ Every external integration point (payment gateway, third-party
  API, webhook) has a test using a fake/mock that asserts the
  exact payload sent, not just that a call was made

→ Race conditions and concurrent access patterns relevant to the
  feature have at least one test, even if it's a basic one

→ The response schema for any API contract is pinned with type
  assertions, not just key-presence assertions

This isn’t a coverage number you can compute with a single command. It’s a property of the test suite that requires a human to evaluate — which is itself the argument for writing it before the agent starts working, not after. Once the checklist holds, an agent operating in that codebase has a genuine safety net. Before it holds, the agent’s “tests pass” signal is unreliable in exactly the ways that matter most.

Why “the Agent Writes Its Own Tests” Doesn’t Solve This

The instinct to let the agent generate the tests alongside the feature is reasonable but doesn’t fully close the gap. An agent reasoning about what to test is reasoning from the same context window it used to write the feature — if it didn’t model the race condition while writing the feature, it’s unlikely to write a test for the race condition either. The blind spots in the implementation and the blind spots in the self-generated tests correlate, because they come from the same incomplete model of the problem.

This is the practical argument for writing the edge-case and business-rule tests yourself, as a human, before agentic work begins on a piece of the codebase — not because the agent can’t write tests, but because the tests that matter most are exactly the ones an agent is least likely to think to write on its own. The agent is good at writing tests for what it understood about the task. The dangerous gaps are in what it didn’t understand it needed to handle.

Tests an agent reliably writes unprompted:
✅ Happy path assertions for the feature as described
✅ Basic validation error tests for fields explicitly mentioned
✅ Type and structure assertions

Tests an agent reliably misses unprompted:
❌ Race conditions and concurrent access
❌ Business rules implied by the domain but not stated in the prompt
❌ Edge cases at numeric boundaries (zero, negative, max int)
❌ Cross-cutting concerns (what happens to related records on delete)
❌ Backward compatibility for existing API consumers

The right workflow isn’t “agent writes the feature, agent writes the tests, ship it.” It’s “human writes the tests that encode business rules and edge cases, agent writes the feature against that suite, agent’s own generated happy-path tests add incremental coverage on top.” The human-written layer is the gate. The agent-written layer is supplementary.

What This Looks Like in a Real Pull Request

A practical workflow for a feature being built with significant agent involvement:

1. Human writes the feature test for the happy path
   — defines what "this feature works" means in concrete terms

2. Human writes 2-4 edge case / business rule tests
   — the ones that encode domain knowledge the agent doesn't have

3. Human writes the database state and side-effect assertions
   — pins what should be true after the action, beyond the response

4. Agent implements the feature against this test suite
   — runs the suite, fails, iterates, until everything passes

5. Agent (optionally) adds incremental tests for paths it discovered
   — additional coverage the human didn't think to write upfront

6. Human reviews the diff — code AND the agent-added tests
   — agent-written tests need the same scrutiny as agent-written code,
     since a weak test can rubber-stamp a weak implementation

Step 6 is easy to skip and shouldn’t be. An agent that’s struggling to make a feature work correctly can — not through malice, just through the mechanics of “make tests pass” — write a weaker test that’s easier to satisfy rather than fix the underlying implementation. A test that asserts $response->assertOk() instead of asserting the actual side effect is technically passing while providing none of the protection the original test intended. Reviewing agent-written tests with the same skepticism as agent-written code is part of the discipline, not an optional extra step.

The Statement in Full Context

Jeffrey Way’s framing at PHPverse 2026 places this inside a larger shift — that AI is changing the role of the developer, not eliminating the need for engineering judgment. The judgment moves. It moves away from “did I write this function correctly” and toward “did I specify, in a way a test suite can verify, what correct means for this part of the system.” That’s a different skill than the one most developers spent a career building, and it’s the skill the test suite requirement is actually asking for.

A full test suite, in this context, doesn’t mean 100% line coverage. It means a suite that has actually encoded what your application is supposed to do — the business rules, the edge cases, the side effects, the contracts — in a form an agent’s “run tests, see failure, fix code” loop can act on. Teams that have this in place can hand meaningful, multi-file changes to an agent and trust the result the way they’d trust a thorough code review. Teams that don’t have this in place are trusting an agent’s self-report that “the tests pass” — on a test suite that may not be checking the things that matter.

The requirement isn’t really about AI. It’s the same requirement good engineering always had — verify behavior, not just function signatures; test the rule being broken, not just the rule being followed; assert the side effect, not just the response code. AI agents didn’t invent the need for this. They removed the option of skipping it and getting away with it, because now there’s a system in the loop that will happily ship whatever your tests allow it to ship.