Tool calling, file reading, query execution, test running, and the moment it refactored a service class I hadn’t touched in two years — a real experiment with the Laravel AI SDK’s agent capabilities, including what went right, what went wrong, and what I’d never let it do unsupervised.
The Laravel AI SDK ships with an agent abstraction. You define tools — PHP functions the LLM can call — and the agent decides which tools to use, in what order, how many times, to accomplish a goal described in plain English. The documentation shows clean examples. They work.
I wanted to know what happens when you point one at a real production codebase with real technical debt, real complexity, and real stakes. Not a toy project — an actual SaaS application with 200+ models, a decade of decisions layered on top of each other, and tests that had drifted from the code they were supposed to cover.
This is what happened over seven days.
The Setup: Tools I Gave the Agent
The agent needs tools to interact with the codebase. I built five:
// app/Agents/Tools/ReadFileTool.php
use Laravel\Ai\Tools\Tool;
class ReadFileTool extends Tool
{
public string $name = 'read_file';
public string $description = 'Read the contents of a file in the Laravel project. Use relative paths from the project root.';
public function __invoke(
#[Description('Relative path to the file, e.g. app/Models/User.php')]
string $path
): string {
$fullPath = base_path($path);
if (!file_exists($fullPath)) {
return "Error: File not found at {$path}";
}
if (!str_starts_with(realpath($fullPath), base_path())) {
return "Error: Access denied — path traversal attempt blocked";
}
return file_get_contents($fullPath);
}
}
// app/Agents/Tools/ListDirectoryTool.php
class ListDirectoryTool extends Tool
{
public string $name = 'list_directory';
public string $description = 'List files and directories in a path. Useful for exploring the codebase structure.';
public function __invoke(
#[Description('Directory path relative to project root')]
string $path = '',
#[Description('File extension filter, e.g. php. Leave empty for all files.')]
string $extension = '',
): string {
$fullPath = base_path($path);
if (!is_dir($fullPath)) {
return "Error: Not a directory: {$path}";
}
$files = collect(scandir($fullPath))
->filter(fn($f) => $f !== '.' && $f !== '..')
->filter(fn($f) => !$extension || str_ends_with($f, ".{$extension}"))
->values()
->all();
return json_encode($files);
}
}
// app/Agents/Tools/RunArtisanTool.php
class RunArtisanTool extends Tool
{
public string $name = 'run_artisan';
public string $description = 'Run a read-only Artisan command. Only whitelisted commands are allowed.';
private const ALLOWED_COMMANDS = [
'route:list',
'model:show',
'test --filter',
'queue:failed',
'migrate:status',
];
public function __invoke(
#[Description('The Artisan command to run, e.g. route:list --json')]
string $command
): string {
$isAllowed = collect(self::ALLOWED_COMMANDS)
->some(fn($allowed) => str_starts_with($command, $allowed));
if (!$isAllowed) {
return "Error: Command not allowed. Permitted commands: " . implode(', ', self::ALLOWED_COMMANDS);
}
\Artisan::call($command);
return \Artisan::output();
}
}
// app/Agents/Tools/ReadOnlyQueryTool.php
class ReadOnlyQueryTool extends Tool
{
public string $name = 'run_query';
public string $description = 'Run a read-only SQL SELECT query on the database. Use for analysing data structure or counts. No INSERT/UPDATE/DELETE allowed.';
public function __invoke(
#[Description('A SELECT SQL query to run')]
string $sql
): string {
// Strict read-only enforcement
$normalised = strtolower(trim($sql));
if (!str_starts_with($normalised, 'select')) {
return "Error: Only SELECT queries are allowed.";
}
// Block dangerous SQL even within SELECT
$blocked = ['insert', 'update', 'delete', 'drop', 'truncate', 'alter', 'create', 'exec', 'execute'];
foreach ($blocked as $keyword) {
if (str_contains($normalised, $keyword)) {
return "Error: Query contains blocked keyword: {$keyword}";
}
}
try {
$results = DB::select($sql);
return json_encode($results, JSON_PRETTY_PRINT);
} catch (\Exception $e) {
return "Query error: " . $e->getMessage();
}
}
}
// app/Agents/Tools/RunTestsTool.php
class RunTestsTool extends Tool
{
public string $name = 'run_tests';
public string $description = 'Run a specific test or test file. Returns pass/fail results and output.';
public function __invoke(
#[Description('Test filter pattern, e.g. UserTest or OrderControllerTest::test_it_creates_an_order')]
string $filter
): string {
// Sanitise input — no shell injection
$filter = preg_replace('/[^a-zA-Z0-9_\-:\/\\\\]/', '', $filter);
$process = new \Symfony\Component\Process\Process(
['php', 'artisan', 'test', '--filter=' . $filter, '--no-coverage'],
base_path(),
timeout: 60,
);
$process->run();
return $process->getOutput() . $process->getErrorOutput();
}
}
Building the Agent
// app/Console/Commands/RunCodeAgent.php
<?php
namespace App\Console\Commands;
use App\Agents\Tools\{ListDirectoryTool, ReadFileTool, RunArtisanTool, ReadOnlyQueryTool, RunTestsTool};
use Illuminate\Console\Command;
use Illuminate\Support\Facades\AI;
class RunCodeAgent extends Command
{
protected $signature = 'agent:run {task}';
protected $description = 'Run the code analysis agent with a given task';
public function handle(): int
{
$task = $this->argument('task');
$this->info("Starting agent for task: {$task}");
$this->line('');
$agent = AI::agent()
->model('claude-3-5-sonnet-20241022')
->system(
"You are a senior Laravel developer analysing a production codebase. " .
"You have read-only access to files, directories, and the database. " .
"You can run tests but cannot modify any files. " .
"Be thorough. Explore the codebase systematically before drawing conclusions. " .
"Always verify your findings before reporting them. " .
"Report what you find accurately — including uncertainty when you are uncertain."
)
->tools([
new ListDirectoryTool(),
new ReadFileTool(),
new RunArtisanTool(),
new ReadOnlyQueryTool(),
new RunTestsTool(),
])
->maxSteps(30);
$response = $agent->ask($task);
$this->line($response->text);
// Log all tool calls for auditing
foreach ($response->steps as $step) {
foreach ($step->toolCalls as $call) {
Log::info('Agent tool call', [
'tool' => $call->name,
'input' => $call->arguments,
]);
}
}
return self::SUCCESS;
}
}
Day 1: Mapping the Application Architecture
Task: “Explore the codebase structure and produce a summary of the application’s architecture — what it does, the main domain models, and the primary entry points.”
I watched the agent work in real time. It started by listing the top-level directories, then methodically explored app/Models, app/Http/Controllers, routes/api.php, and a few key service classes.
What it found — in one pass, without being told where to look:
The agent's summary (condensed):
This is a B2B SaaS application for order management. Key domain models:
- Order (central entity) — 47 relationships, polymorphic activities log
- Product (with variants via ProductVariant) — complex pricing via PriceRule
- Customer (with CustomerLocation for multi-site customers)
- Invoice (generated from Orders, with LineItem)
- Supplier (with SupplierOrder for procurement)
Primary entry points:
- REST API (routes/api.php): 89 routes, primarily for mobile clients
- Web routes (routes/web.php): 34 routes, admin panel
- Queue jobs: 23 job classes, primarily for async order processing
Architecture observation: The OrderController handles 847 lines of code and
appears to contain business logic that would be better in service or action classes.
The OrderService class (618 lines) is imported by 14 different classes.
Potential issue identified: 3 routes in routes/api.php reference controllers
that don't exist in app/Http/Controllers/Api. This would cause 500 errors if
those routes are hit.
The missing controller finding was legitimate. Three routes in the API file referenced controller classes that had been renamed during a refactor but the routes hadn’t been updated. They were never triggered in normal usage — but they existed and would have caused 500 errors.
Day 1 verdict: Accurate architectural overview, one legitimate bug found. Time to reach the same conclusions manually: probably a day.
Day 2: Test Coverage Analysis
Task: “Analyse the test suite. Which critical models and controllers have no test coverage? Which tests are likely outdated based on code they test?”
The agent ran php artisan test --filter=skip_impossible_filter to see the test output format, then read through the tests directory systematically.
Agent findings:
Test coverage gaps (no tests found for):
- CustomerController (API) — 0 tests
- InvoiceService — 0 tests
- PriceRule (complex pricing logic) — 0 tests
- SupplierOrderController — 0 tests
Likely outdated tests:
- OrderTest::test_it_calculates_shipping — references $order->shipping_cost
but the Order model no longer has a shipping_cost attribute.
Current model uses a ShippingCalculator service.
- UserTest::test_it_sends_welcome_email — calls Mail::fake() but the
WelcomeEmail mailable was replaced with a WelcomeNotification 8 months ago.
This test may be testing dead code.
Suggestion: Run the potentially outdated tests to confirm they still pass
and investigate whether they're testing the intended behaviour.
I ran the two flagged tests. The test_it_calculates_shipping test was still passing — but only because the shipping_cost accessor fell back to the ShippingCalculator result. The test was accidentally correct but for the wrong reason.
The test_it_sends_welcome_email test was also passing — but it was checking that WelcomeEmail was sent, and the notification system happened to dispatch WelcomeEmail internally. The test wasn’t wrong, but it was testing at the wrong layer.
Day 2 verdict: Useful, actionable findings. The outdated test analysis was better than I expected — it read the tests, read the models, and identified the discrepancy.
Day 3: The Query Performance Investigation
Task: “Analyse the OrderController and its related queries. Identify potential N+1 problems and missing eager loading.”
This is where the agent made its most impressive move. It read the controller, identified the relationships being accessed, checked whether they were eager-loaded, and then — on its own initiative — ran a SELECT query to understand the data volume:
-- The agent ran this unprompted:
SELECT COUNT(*) as total, AVG(items_count) as avg_items
FROM (
SELECT o.id, COUNT(oi.id) as items_count
FROM orders o
LEFT JOIN order_items oi ON oi.order_id = o.id
GROUP BY o.id
) subquery
Agent finding:
In OrderController::index():
The method loads orders with ->with('customer') but then accesses
$order->items in the view template (line 89 of orders/index.blade.php).
This creates an N+1 for order items.
Average items per order: 8.3 (from database query)
This means for a page of 25 orders, approximately 25 additional queries
fire for items — 26 queries total instead of 2.
In OrderController::show():
The method loads the order without eager loading ANY relationships.
The show view accesses: order.customer, order.items, order.items.product,
order.items.product.category, order.invoice, order.activities.
Estimated queries per show() request: 1 (order) + 1 (customer) +
1 (items) + N (products, one per item) + N (categories) + 1 (invoice) +
1 (activities) = 5 + 2N queries where N = number of items.
For an average order of 8.3 items: approximately 22 queries per show() request.
Fix: Order::with(['customer', 'items.product.category', 'invoice', 'activities'])
I verified this in Telescope. The show() method was indeed making 22-25 queries per request. The agent’s estimate from static analysis was accurate.
Day 3 verdict: Excellent. The agent independently decided to query the database to quantify the problem — I hadn’t told it to do that. The N+1 analysis was precise and actionable.
Day 4: The Moment It Found Something I’d Forgotten
Task: “Look at the OrderService class and identify areas of concern — complexity, coupling, potential bugs.”
The agent spent 4 tool calls just reading the service class (it’s 618 lines — it had to read it in sections). Then it explored the classes that use it.
Agent finding (abridged):
The OrderService has a method calculateOrderTotal() that takes a $discountCode parameter.
This method:
1. Validates the discount code (calls DiscountService::validate())
2. Calculates line item totals
3. Applies the discount
4. Calculates tax
5. Returns the total
Potential bug identified: The discount validation on line 247 checks
if the discount code is 'active' but does NOT check if it has exceeded its
usage_limit. The Coupon model has a usage_limit column and a times_used counter,
but OrderService::calculateOrderTotal() does not check times_used >= usage_limit.
The check DOES exist in CheckoutController::applyDiscount() (route: POST /checkout/discount),
but it does NOT exist in the direct API route POST /api/orders (which also accepts
a discount_code parameter and calls OrderService::calculateOrderTotal directly).
A user who knows the API could bypass the UI validation and use an expired
or over-limit coupon via the API endpoint.
I pulled up the code. The agent was exactly right. The CheckoutController had the usage limit check — added 14 months ago when a bug was reported. But the API route, which was added 8 months ago for the mobile app, called OrderService::calculateOrderTotal() directly and inherited the original, pre-fix version of the validation.
The coupon usage limit bypass had existed in the API for 8 months. Nobody noticed because most customers use the web checkout, not the API directly.
Day 4 verdict: This finding alone justified the entire experiment. A real security/business logic bug, buried in a 618-line service class, invisible to code review because the relevant context was spread across two files added 6 months apart.
Day 5: What It Got Wrong
Task: “Identify any models that are candidates for splitting — too many responsibilities, too many relationships.”
The agent flagged the Product model as a candidate for splitting, suggesting that pricing logic (PriceRule), inventory tracking (StockMovement), and product metadata (ProductAttribute) should be separate domain models with their own services.
This was technically reasonable but practically wrong. The Product model is large because the business domain is complex — the relationships aren’t incidental coupling, they’re essential domain structure. Splitting the model would have introduced cross-domain queries without meaningful benefit.
The agent didn’t have the context to know this. It applied a pattern (large model = over-coupled) without understanding the domain rationale for the model’s complexity.
What the agent didn't know:
- That ProductAttribute was added recently and IS a candidate for extraction
- That StockMovement is already in the process of being moved to its own service
- That PriceRule splitting had been discussed and rejected for business reasons
- The 2-year history of decisions that shaped the current structure
Day 5 verdict: The structural refactoring suggestions were a mixed bag. Some were valid (ProductAttribute). Some were correct in the abstract but wrong for the specific context (PriceRule). The agent had no way to know the difference without domain context I never gave it.
Day 6: Test Running and Regression Detection
Task: “Run the full test suite for the OrderService and report any failures or flaky tests.”
The agent ran tests in batches, filtering by service name:
Agent: Running php artisan test --filter=OrderService...
Agent: Running php artisan test --filter=OrderTest...
Agent: Running php artisan test --filter=Order...
Results: 23 tests, all passed. But the agent noticed something:
Agent observation:
test_it_applies_discount_correctly runs in 2.3 seconds — significantly
slower than other tests in the suite (average: 0.08s).
Examining the test... it calls OrderService::calculateOrderTotal() with a real
DiscountService::validate() call, which makes an HTTP request to an external
discount validation API. This test is not using a mock.
If the external API is unavailable or slow, this test will be slow or fail
intermittently. Recommend mocking the DiscountService in this test.
Correct. That test had been flaky in CI for months — nobody had traced it back to the real HTTP call.
Day 6 verdict: The slow test finding was useful. Running tests and analysing the output for patterns — not just pass/fail — was more valuable than just running the tests.
Day 7: Asking It to Write Code (With Supervision)
For the final day, I added a WriteFileTool — with a confirmation prompt before any write:
class WriteFileTool extends Tool
{
public string $name = 'write_file';
public string $description = 'Write content to a file. ALWAYS confirm with the user before writing.';
public function __invoke(
string $path,
string $content,
#[Description('Explanation of what this change does and why')]
string $reason,
): string {
// The agent is instructed to always confirm — but we enforce it in code too
$this->components->info("Agent wants to write: {$path}");
$this->components->info("Reason: {$reason}");
if (!$this->components->confirm('Allow this write?')) {
return "Write cancelled by user.";
}
file_put_contents(base_path($path), $content);
return "File written: {$path}";
}
}
Task: “Fix the N+1 problem you identified on Day 3 in OrderController::index(). Run the relevant tests before and after to verify no regressions.”
The agent:
- Re-read
OrderController::index()to confirm its earlier finding - Ran the existing
OrderControllerTest— all passed - Proposed adding
->with('items')to the query - Asked for confirmation (I approved)
- Wrote the change
- Re-ran the tests — all still passed
- Reported the fix
The fix was correct. The tests confirmed it didn’t break anything. The agent ran tests both before and after without being told to — good practice it had inferred from the system prompt.
Day 7 verdict: Supervised code writing worked well for a targeted, well-understood fix. The agent was cautious — it re-read the code, verified tests before changing, and verified again after.
What I Learned About Running AI Agents on Real Codebases
What it does well:
1. Cross-file pattern recognition. The coupon bypass bug required connecting CheckoutController to OrderService to the API route to the Coupon model. A human code reviewer might miss this unless they happened to be looking at all four simultaneously. The agent builds a mental model of the whole codebase.
2. Quantifying problems. Deciding to run a SQL query to measure the N+1 impact — that wasn’t prompted. The agent independently decided to put numbers on the problem rather than just describe it.
3. Tedious analysis at scale. Reading 618 lines of a service class, cross-referencing with 14 callers, building a complete picture of how the method is used — this takes a human developer significant time and effort. The agent did it in minutes.
4. Test output analysis. Noticing that one test is 30× slower than the others and tracing why — that’s the kind of observation that’s easy to miss when you’re focused on pass/fail.
What it doesn’t do well:
1. Domain context. The Product model refactoring suggestions showed the hard limit: the agent can identify patterns but can’t know which patterns are justified by business decisions made two years ago. It doesn’t know what it doesn’t know.
2. Code it can’t read. The agent had no access to the git history, PR comments, internal documentation, or Slack discussions where architectural decisions were explained. Code is often the end result of non-obvious reasoning that lives nowhere in the code.
3. Knowing when to stop. Given a vague task like “find areas of concern,” the agent will find concerns everywhere — some real, some not, all presented with similar confidence. Humans calibrate urgency from context. The agent struggles with triage.
The Rules I’d Apply to Any Agent with Codebase Access
✓ Read-only by default — no WriteFileTool in standard sessions
✓ WriteFileTool requires explicit human confirmation per write
✓ No access to .env, secrets, credentials — blocked at the tool level
✓ No access to production database — staging only
✓ SQL tool restricted to SELECT — no DDL or DML
✓ Artisan tool whitelisted — only safe read commands
✓ All tool calls logged for audit
✓ maxSteps cap — prevent runaway agent loops
✓ System prompt that includes "report uncertainty explicitly"
✓ Every finding reviewed by a human before acting on it
The tools are not the risk. The risk is acting on AI findings without human verification. The coupon bug the agent found was real. The Product model refactoring suggestion was misguided. Without human judgment to distinguish them, acting on both would have been a mistake.
Final Thoughts
The experiment produced real value. One legitimate security bug, several N+1 findings with accurate impact estimates, a flaky test traced to its root cause, and an architectural map of a codebase I work in daily but had never looked at from the outside.
None of those findings were things I couldn’t have found myself. But some of them — the coupon bypass bug in particular — I hadn’t found in eight months of working on the codebase. The agent found it in four hours.
The honest framing: AI agents don’t replace code review or senior developer judgment. They do something different — they systematically apply a specific type of analysis (cross-file pattern matching, exhaustive code reading) that humans do less thoroughly because it’s tedious and time-consuming. The value is in doing the tedious part at machine speed and presenting the findings for human evaluation.
The findings are inputs, not conclusions. That framing makes agents genuinely useful without over-relying on their judgment in areas — like domain context and architectural trade-offs — where they lack essential information.
Use them for investigation. Review their output carefully. Act on the findings you can verify.
