Writing Laravel Tests With AI Agents: The Workflow That Changed How My Whole Team Ships

AI agents need tests to work safely — and it turns out, they’re also remarkably good at writing them. Here’s the exact workflow: human writes the feature, agent writes the Pest tests, agent runs them, agent fixes failures. The feedback loop that cut our bug count in half.

There’s an irony in how AI agents changed our testing culture. For four years, our team wrote tests inconsistently — thoroughly when there was time, sparingly when there wasn’t, almost never for the edge cases that eventually caused production incidents. Then we started using agents for development. Jeffrey Way said it at PHPverse 2026: if you’re leveraging AI agents, a full test suite is a requirement. We heard that, internalized it, and discovered something nobody had mentioned: agents are better at writing tests than most developers are, and they’re faster, and they don’t skip the edge cases because they’re not tired or deadline-pressured.

The workflow that emerged isn’t “agent writes code, human writes tests.” It’s more precise than that, and the order matters: human writes the feature, agent writes the Pest tests against it, agent runs the suite, agent analyzes failures, agent fixes them. The human reviews the diff. The feature ships when everything passes and the human is satisfied with what the tests actually assert. The feedback loop is tight enough that the test suite writes itself as a side effect of writing features, and it writes itself well.

This post documents that workflow exactly — the prompts, the patterns, the Pest conventions the agent follows, the review criteria that keep agent-written tests from being rubber stamps, and the specific categories where agents outperform human developers at writing tests.

Why Agents Write Better Tests Than Tired Developers

The most honest explanation: agents don’t have the cognitive relationship with the code that the author does. A developer who just wrote a discount service has a mental model of how it works — and that mental model, correctly, doesn’t include the bugs. When they sit down to test it, they test the code they intended to write, not the code they actually wrote. Edge cases that would expose the gap between intent and implementation are the ones they’re least likely to think of, because their mental model doesn’t contain them.

An agent reading the feature code has no such model. It sees what the code does, not what the developer meant it to do. It reads the method signature, the branching logic, the database operations, the side effects — and generates tests for the paths it actually sees, including the paths the developer didn’t consciously think about. The failing tests it generates are often the first signal that the implementation has a gap.

Developer testing their own code:
  Likely to test:   happy path, the scenarios they built for
  Likely to skip:   edge cases at boundaries, race conditions,
                    business rules they assumed but didn't encode

Agent testing someone else's code:
  Likely to test:   every branch it can see in the implementation
  Likely to add:    boundary conditions, null inputs, invalid states
  Likely to catch:  mismatches between method signature and actual behaviour

This isn’t theoretical. The shift in our bug count came from the last category: mismatches between what a method claimed to do and what it actually did, exposed by tests that an agent wrote against the actual code, not against the developer’s memory of writing it.

The Workflow in Full

The loop has five steps. Every step has a defined owner and a defined output.

Step 1  Human writes the feature
        → Output: working implementation, no tests

Step 2  Agent reads the feature code and writes Pest tests
        → Output: test file covering happy path, edge cases,
                  negative paths, side effects, database state

Step 3  Agent runs the test suite
        → Output: pass/fail results with failure output

Step 4  Agent analyzes failures and fixes either the test or the implementation
        → Output: corrected implementation or corrected test,
                  with an explanation of which it changed and why

Step 5  Human reviews the diff
        → The feature, the tests, any implementation fixes
        → Asks: do the tests assert what matters?
                does the fix address the root cause?
        → Merges or requests changes

Step 4 is where the value compounds. When the agent runs a test and it fails, it has to make a judgment: is the test correct and the implementation wrong, or is the implementation correct and the test was asserting the wrong thing? Getting this judgment right is what separates a useful agent from one that games the loop by writing weaker tests. In practice, Claude Code makes this judgment well — it tends toward fixing the implementation rather than weakening the test, and it explains which path it took and why.

Step 5 is the human’s job and it can’t be delegated. The reviewer reads the diff the same way they’d review a pull request: does this test suite actually encode what the feature is supposed to do, or does it just verify that the code runs?

The Prompt That Gets Good Tests

The prompt matters more than most developers realize. A vague prompt gets vague tests. A specific prompt that tells the agent what the feature does, what the business rules are, and what failure looks like gets tests that actually protect against the things that break.

The prompt I use (adapted per feature):

"I've just finished implementing the [feature name]. Read the following files:
- [implementation file]
- [model file]
- [migration if relevant]
- [service provider if relevant]

Write a complete Pest test suite for this feature. Specifically:

1. Happy path tests for the primary use case
2. Edge case tests for:
   - Boundary conditions (empty, null, zero, max values)
   - Invalid inputs that should return validation errors
   - States the model could be in that would affect behavior
3. Side effect tests:
   - Every job that should be dispatched
   - Every event that should be fired
   - Every database state change that should occur
   - Every cache key that should be set or cleared
4. Authorization tests:
   - Authenticated user who should have access
   - Authenticated user who should be denied
   - Unauthenticated request
5. For this feature specifically, also test:
[feature-specific concerns — race conditions, tenant isolation,
    idempotency, webhook handling, etc.]

Use our Pest conventions:
- it() for all tests, not test()
- beforeEach() for shared setup using factories
- describe() blocks to group related tests
- Factories with named states where they exist in our codebase
- assertDatabaseHas() and assertDatabaseMissing() for state changes
- Queue::fake() / Event::fake() / Mail::fake() before any test
  that triggers side effects"

The last section — “for this feature specifically, also test” — is where the human’s domain knowledge enters the prompt. The agent will cover the standard categories reliably. The human knows which business rules are subtle, which race conditions exist in this domain, which tenant isolation assumptions are critical. That knowledge belongs in the prompt, not as an afterthought.

The Test Patterns That Agents Produce Well

Some patterns appear consistently in agent-written test suites that rarely appear in developer-written ones. These are the patterns worth recognizing, because they’re the ones most likely to catch real bugs.

The implementation-reads-differently-than-it-intends test

// The implementation
public function applyDiscount(Order $order, string $code): void
{
    $discount = DiscountCode::where('code', $code)
        ->where('expires_at', '>', now())
        ->first();

    if ($discount) {
        $order->update(['total' => $order->total - $discount->amount]);
    }
}

// The test an agent generates — not what the developer tests
it('does not apply a discount if the code has already expired', function () {
    $order    = Order::factory()->create(['total' => 100_00]);
    $discount = DiscountCode::factory()->create([
        'code'       => 'SAVE10',
        'amount'     => 10_00,
        'expires_at' => now()->subSecond(), // ← expired one second ago
    ]);

    (new DiscountService)->applyDiscount($order, 'SAVE10');

    expect($order->fresh()->total)->toBe(100_00); // unchanged
});

it('does not apply a discount that expires at exactly this second', function () {
    $order    = Order::factory()->create(['total' => 100_00]);
    $discount = DiscountCode::factory()->create([
        'code'       => 'SAVE10',
        'amount'     => 10_00,
        'expires_at' => now(), // ← expires now — boundary condition
    ]);

    (new DiscountService)->applyDiscount($order, 'SAVE10');

    // The query uses '>' not '>=' — expires_at = now() is excluded
    // Is that the correct business rule? This test makes it visible.
    expect($order->fresh()->total)->toBe(100_00);
});

The second test — expires_at at exactly now() — is the boundary condition test that exposes whether > or >= is the right comparison. The developer who wrote the implementation chose >. The agent generates a test that makes that choice explicit and assertable. If the business rule is “a code that expires right now is still valid,” this test fails and surfaces the mismatch.

The side effect isolation test

it('dispatches SendWelcomeEmail but not SendTrialReminder when a user registers', function () {
    Queue::fake();

    $this->postJson('/api/register', [
        'name'     => 'Sadique Ali',
        'email'    => 'sadique@example.com',
        'password' => 'password',
    ])->assertCreated();

    Queue::assertPushed(SendWelcomeEmail::class);
    Queue::assertNotPushed(SendTrialReminder::class); // ← negative assertion
});

The negative assertion — assertNotPushed — is the one developers rarely write. The agent writes it because it sees from the implementation which jobs are conditionally dispatched, and it tests both the positive and negative condition. A refactor that accidentally dispatches SendTrialReminder on registration fails this test immediately, not in a user’s inbox.

The database atomicity test

it('does not update the order total if saving the discount redemption fails', function () {
    Queue::fake();

    // Simulate a database constraint failure on the redemption record
    $this->mock(DiscountRedemption::class)
        ->shouldReceive('create')
        ->andThrow(new \Illuminate\Database\QueryException(
            'mysql', 'INSERT ...', [], new \Exception('Duplicate entry')
        ));

    $order    = Order::factory()->create(['total' => 100_00]);
    $discount = DiscountCode::factory()->create(['amount' => 10_00]);

    expect(fn () => (new DiscountService)->applyDiscount($order, $discount->code))
        ->toThrow(\Illuminate\Database\QueryException::class);

    // The order total must not have changed
    expect($order->fresh()->total)->toBe(100_00);
});

This test checks whether the discount application is atomic — if saving the redemption record fails, does the order total stay unchanged? Developers rarely write this test because writing it requires simulating a database failure. Agents write it because they see a sequence of database operations and generate a test for each operation failing. If the implementation isn’t wrapped in a transaction, this test fails and the developer knows before shipping.

The multi-tenancy isolation test

describe('tenant isolation', function () {
    it('a tenant admin cannot read another tenant\'s orders', function () {
        $tenantA = Tenant::factory()->create();
        $tenantB = Tenant::factory()->create();
        $order   = Order::factory()->for($tenantB)->create();

        $response = $this->actingAs($tenantA->owner)
            ->getJson("/api/orders/{$order->id}");

        $response->assertNotFound(); // 404, not 403 — don't reveal it exists
    });

    it('a tenant admin cannot modify another tenant\'s order', function () {
        $tenantA = Tenant::factory()->create();
        $tenantB = Tenant::factory()->create();
        $order   = Order::factory()->for($tenantB)->create(['status' => 'pending']);

        $response = $this->actingAs($tenantA->owner)
            ->patchJson("/api/orders/{$order->id}", ['status' => 'completed']);

        $response->assertNotFound();
        expect($order->fresh()->status)->toBe('pending'); // unchanged
    });
});

The 404 instead of 403 distinction — don’t reveal that a resource exists to a tenant who can’t access it — is a security detail that developers overlook. The agent writes the test that documents this expectation. If the implementation returns 403, the test fails and surfaces the question: should it? In a multi-tenant SaaS, usually not.

The Review Criteria: What Makes an Agent-Written Test Suite Worth Keeping

The agent produces a test suite. The human reviews it. The review has a specific checklist:

For every test in the suite, ask:

1. Does it assert behavior or implementation?
   ✅ expect($order->fresh()->total)->toBe(90_00)  ← behavior
   ❌ expect($service)->toBeInstanceOf(DiscountService::class)  ← implementation

2. Does it assert both the success and the relevant absence?
   ✅ Queue::assertPushed(WelcomeEmail::class)
   ✅ Queue::assertNotPushed(TrialReminder::class)
   ❌ Only the positive assertion

3. Does it assert database state, not just response status?
   ✅ $response->assertOk(); expect($order->fresh()->total)->toBe(90_00);
   ❌ $response->assertOk(); (status only)

4. Does it cover the negative path of each business rule?
   ✅ it('does not apply an expired code')
   ✅ it('does not apply a code that has reached its max uses')
   ❌ Only the happy path

5. Do the assertions use concrete values, not just types?
   ✅ expect($order->fresh()->total)->toBe(90_00)
   ❌ expect($order->fresh()->total)->toBeInt()

The fifth criterion is the one most agent-written tests fail when they do fail. Type assertions (toBeInt(), toBeString()) confirm that the code runs without errors. Value assertions (toBe(90_00)) confirm that the code produces the correct result. The difference is the difference between a test that passes on broken code and one that doesn’t.

When an agent writes a type assertion where a value assertion belongs, replace it. That’s the primary edit the human makes to agent-written tests — not adding missing tests, but strengthening assertions that are too weak.

When the Agent Fixes the Implementation

The most valuable thing that happens in Step 4 is when the agent runs a test, it fails, and the agent concludes the implementation is wrong — not the test. This happens reliably when:

Agent conclusion: "fix the implementation"
  → The test asserts something the feature specification clearly requires
  → The failure is a business rule violation
  → The fix is a logic correction

Agent conclusion: "fix the test"
  → The test asserts something the implementation never claimed to do
  → The failure is a wrong assumption in the test
  → The fix is correcting what was being asserted

In practice, Claude Code narrates which path it’s taking and why. That narration is the thing to read before accepting the diff:

"The test 'does not allow a second redemption of a single-use code' is failing
because the implementation doesn't check max_uses before applying the discount.
This is a business rule the test correctly encodes. I'm fixing the implementation
to add the max_uses check rather than removing the test."

vs.

"The test 'dispatches InventoryAlert when stock falls below threshold' is failing
because the test expects the alert to fire at exactly zero stock, but the
implementation fires it at or below the configured threshold (which is 5 in the
test database). I'm updating the test to match the documented threshold."

Both are legitimate outcomes. The second one reveals that the test was wrong — it was asserting a specific behavior the implementation never promised. Reading the narration tells you which is which without having to trace through the code yourself.

The Specific Test the Agent Always Writes That You Probably Don’t

After running this workflow across thirty-odd features over six months, one test category appears in almost every agent-written suite and almost never in developer-written ones: the unauthenticated access test.

it('returns 401 for unauthenticated requests to protected routes', function () {
    $order = Order::factory()->create();

    $this->getJson("/api/orders/{$order->id}")
         ->assertUnauthorized();

    $this->patchJson("/api/orders/{$order->id}", ['status' => 'completed'])
         ->assertUnauthorized();

    $this->deleteJson("/api/orders/{$order->id}")
         ->assertUnauthorized();
});

Every protected route. Every HTTP method. All in one test. Developers don’t write this test because they know the auth middleware is there — they added it. An agent doesn’t assume the middleware is working; it verifies it. The number of times this test has caught a route that was accidentally missing the auth:sanctum middleware on one HTTP method is embarrassingly high.

The Numbers After Six Months

Before this workflow: tests were written by the same developers who wrote the features, usually on the same day, usually covering the happy path, usually under deadline pressure. Our bug count in production for the six months before the workflow was 23 incidents traced to code that had tests.

After six months of the workflow: 11 incidents. The reduction wasn’t uniform — it was concentrated in a specific category: edge case failures that the developer hadn’t tested but the agent did. The atomicity test, the boundary condition test, the unauthenticated access test, the multi-tenant isolation test — these are the categories that moved the number. Not because the agent is smarter than the developer. Because the agent tests what it reads, not what it remembers meaning to write.

The side effect that nobody predicted: developer-written tests got better. When you read enough agent-written test suites, you start writing value assertions instead of type assertions. You start writing the negative path next to the positive path. You start writing the database state check next to the response status check. The agent modeled a testing discipline that was technically known but inconsistently practiced. Consistent exposure to it changed the practice.

The Workflow in Practice: A Concrete Example

Feature: a TransferOrder action that moves an order from one tenant to another, validates both tenants exist, dispatches a OrderTransferred event, updates the order, and creates an audit record.

Human writes the implementation. Hands it to the agent with the prompt template.

Agent produces:

use App\Actions\TransferOrder;
use App\Events\OrderTransferred;
use App\Models\{AuditRecord, Order, Tenant};

describe('TransferOrder action', function () {

    beforeEach(function () {
        $this->sourceTenat = Tenant::factory()->create();
        $this->targetTenant = Tenant::factory()->create();
        $this->order = Order::factory()->for($this->sourceTenat)->create();
    });

    it('transfers the order to the target tenant', function () {
        Event::fake();

        (new TransferOrder)->execute($this->order, $this->targetTenant);

        expect($this->order->fresh()->tenant_id)->toBe($this->targetTenant->id);
    });

    it('creates an audit record on transfer', function () {
        Event::fake();

        (new TransferOrder)->execute($this->order, $this->targetTenant);

        expect(AuditRecord::where([
            'action'    => 'order_transferred',
            'order_id'  => $this->order->id,
            'from_tenant_id' => $this->sourceTenant->id,
            'to_tenant_id'   => $this->targetTenant->id,
        ])->exists())->toBeTrue();
    });

    it('dispatches an OrderTransferred event', function () {
        Event::fake();

        (new TransferOrder)->execute($this->order, $this->targetTenant);

        Event::assertDispatched(OrderTransferred::class, function ($event) {
            return $event->order->id === $this->order->id
                && $event->fromTenant->id === $this->sourceTenant->id
                && $event->toTenant->id === $this->targetTenant->id;
        });
    });

    it('does not transfer if the target tenant does not exist', function () {
        Event::fake();

        expect(fn () => (new TransferOrder)->execute(
            $this->order,
            new Tenant(['id' => 99999]), // non-existent tenant
        ))->toThrow(\App\Exceptions\TenantNotFoundException::class);

        // Order must remain with source tenant
        expect($this->order->fresh()->tenant_id)->toBe($this->sourceTenant->id);

        // No audit record
        expect(AuditRecord::where('order_id', $this->order->id)->exists())->toBeFalse();

        // No event
        Event::assertNotDispatched(OrderTransferred::class);
    });

    it('does not transfer an order that is already with the target tenant', function () {
        Event::fake();

        expect(fn () => (new TransferOrder)->execute(
            $this->order,
            $this->sourceTenant, // same tenant
        ))->toThrow(\App\Exceptions\InvalidTransferException::class);

        Event::assertNotDispatched(OrderTransferred::class);
    });

    it('rolls back the order update if the audit record fails to save', function () {
        Event::fake();

        $this->mock(AuditRecord::class)
             ->shouldReceive('create')
             ->andThrow(new \Illuminate\Database\QueryException(
                 'mysql', '', [], new \Exception('Constraint violation')
             ));

        expect(fn () => (new TransferOrder)->execute(
            $this->order, $this->targetTenant
        ))->toThrow(\Illuminate\Database\QueryException::class);

        // Order must still belong to source tenant — transaction rolled back
        expect($this->order->fresh()->tenant_id)->toBe($this->sourceTenant->id);

        Event::assertNotDispatched(OrderTransferred::class);
    });

});

The atomicity test — rolls back if audit record fails — is the one the developer didn’t write. It’s also the one that found a real bug: the original implementation wasn’t wrapped in a DB::transaction(). The test failed. The agent fixed the implementation. The feature shipped correctly.

What This Changes About How You Think About Features

The workflow changes the sequence but it changes something more fundamental than sequence: it changes when you know a feature is done. Before the workflow, “done” meant the happy path worked in the browser. After the workflow, “done” means the agent ran the suite, the suite passes, the human reviewed the assertions and is satisfied they test the right things.

That standard is higher. It’s also achievable on every feature, not just the ones with time for a thorough test session. The agent absorbs the test-writing work without absorbing the domain knowledge work — and the domain knowledge work, encoded in the prompt and the review, is the part that actually requires a human.

The feedback loop is the product. The tests aren’t documentation of what the feature does. They’re the process by which you know it does what you think it does. Running that process through an agent that doesn’t share your cognitive blind spots is the change that moved the number.