Stop Sending Your Documents to Third-Party APIs: Parsel Parses PDFs, Word Files, and Images Locally

PDFs, Word documents, spreadsheets, presentations, scanned images with OCR — Parsel extracts text, structured data, and coordinates locally on your server, with a fluent PHP API that feels like it was built for Laravel. No API keys. No third-party services. No per-page billing.


Every application that handles user-uploaded documents faces the same moment: “We need to extract text from this PDF.” And then comes the familiar chain of decisions. Call an external API? Which one? What happens to the data? What’s the per-page cost at scale? What if the document contains sensitive information — contracts, medical records, financial statements — that shouldn’t leave the server?

Parsel by ShipfastLabs answers all of those questions at once.

It’s a fast, open-source PHP document parser that processes files locally on your server — no external service, no data leaving your infrastructure, no per-call pricing. It supports PDFs, Word documents (docx), Excel spreadsheets (xlsx), PowerPoint presentations (pptx), and images (including OCR with Tesseract). And its API is the cleanest document-parsing interface available in the PHP ecosystem.


What Parsel Actually Does

Parsel is a PHP wrapper around liteparse (the lit binary) — a high-performance document parsing engine that does the heavy lifting. Parsel gives you a fluent PHP API on top of it, handling binary resolution, process management, streaming large documents, and a fake runner for testing.

What you can extract:

  • Plain text — the full document as a string, headers removed
  • Structured page data — every text item with its x, y, width, height, font name, font size, and OCR confidence score
  • Page screenshots — render pages as images
  • Metadata — document properties
  • Coordinates — exact position of every text element on every page

And it processes everything locally. Your invoices, contracts, and medical records stay on your server.


Installation

PHP Package

composer require shipfastlabs/parsel

Parsel requires PHP 8.4 or greater.

The lit Binary

Parsel delegates the actual parsing to the lit binary from liteparse. Install it using whichever toolchain is available in your environment:

# Via Cargo (Rust)
cargo install liteparse

# Via pip (Python)
pip install liteparse

# Via npm (Node.js)
npm i -g @llamaindex/liteparse

For Office documents and images, you also need:

  • LibreOffice — for .docx, .xlsx, .pptx conversion
  • ImageMagick — for image processing
  • Tesseract — for OCR support (multilingual)
# Ubuntu / Debian
apt-get install libreoffice imagemagick tesseract-ocr

# macOS
brew install libreoffice imagemagick tesseract

The Basics: Text Extraction

The simplest use case — extract plain text from a document:

use Shipfastlabs\Parsel;

// Extract all text from a PDF
$text = Parsel::file('invoice.pdf')->text();

// From a Word document
$text = Parsel::file('contract.docx')->text();

// From an Excel spreadsheet
$text = Parsel::file('report.xlsx')->text();

// From a PowerPoint presentation
$text = Parsel::file('deck.pptx')->text();

The same ->text() method works across all supported file types. You don’t need to know or care which parser handles each format — Parsel routes to the right engine based on the file extension.

Parsing Raw Bytes

For uploaded files or database blobs that haven’t been written to disk, use bytes(). Because there’s no filename to infer the type from, you provide the extension explicitly:

// From an uploaded file (Laravel)
$bytes = $request->file('document')->get();
$text  = Parsel::bytes($bytes, 'pdf')->text();

// From a database blob
$document = Document::find($id);
$text     = Parsel::bytes($document->file_content, 'docx')->text();

Structured Documents: Coordinates, Fonts, and Confidence Scores

When you need more than raw text — when you’re building a document intelligence feature, extracting specific fields from invoices, or validating form layouts — ->parse() returns a full Document object.

$document = Parsel::file('invoice.pdf')->parse();

// Document-level data
echo $document->text;        // full text
echo $document->pageCount(); // number of pages
print_r($document->metadata);

// Page-level data
foreach ($document->pages as $page) {
    echo "Page {$page->number}: {$page->width}×{$page->height}px\n";
    echo $page->text;

    // Item-level data — every text element with position
    foreach ($page->items as $item) {
        echo "{$item->text}";
        echo " @ ({$item->x}, {$item->y})";
        echo " size: {$item->width}×{$item->height}";
        echo " font: {$item->fontName} {$item->fontSize}pt";
        echo " confidence: {$item->confidence}\n";
    }
}

The Full Document Data Model

Document
├── text         → full document text as string
├── metadata     → document properties (author, title, creation date, etc.)
├── pageCount()  → total number of pages
└── pages[]
    ├── number   → 1-based page number
    ├── width    → page width in points/pixels
    ├── height   → page height in points/pixels
    ├── text     → full page text
    └── items[]
        ├── text       → the text content of this item
        ├── x          → horizontal position
        ├── y          → vertical position
        ├── width      → bounding box width
        ├── height     → bounding box height
        ├── fontName   → font family
        ├── fontSize   → font size in points
        └── confidence → OCR confidence score (0-100, when OCR is used)

As an Array

If you’re storing the parsed output, queuing it, or returning it as JSON:

$array = Parsel::file('document.pdf')->toArray();
// Serialisable, cacheable, queueable

Page Selection: Parse Only What You Need

For documents with dozens or hundreds of pages, parsing the whole thing when you only need a few is wasteful. Parsel’s page selection API is flexible and additive:

// Single page
Parsel::file('report.pdf')->page(7)->text();

// Specific pages
Parsel::file('report.pdf')->pages(1, 3, 5)->text();

// Page range
Parsel::file('report.pdf')->pageRange(1, 5)->text();

// Range as string + individual page
Parsel::file('report.pdf')->pages('1-5', 10)->text();

// Combined range and specific page
Parsel::file('report.pdf')->pageRange(1, 5)->page(10)->text();

// Cap total pages parsed
Parsel::file('report.pdf')->maxPages(50)->text();

Page selection methods are chainable and additive — you can combine them freely before calling ->text() or ->parse().


OCR: Scanned Documents and Images

By default, OCR is disabled to keep parsing fast. Enable it when you’re working with scanned PDFs, photographs of documents, or images:

// Basic OCR (uses system default language)
$text = Parsel::file('scanned-receipt.pdf')->withOcr()->text();

// OCR with options
$text = Parsel::file('french-document.pdf')
    ->withOcr(
        language:     'fra',                     // Tesseract language code
        tessdataPath: '/usr/share/tessdata',      // custom tessdata location
        serverUrl:    'http://localhost:8828/ocr', // remote OCR server
        workers:      8,                           // parallel workers
    )
    ->text();

// Image with OCR
$text = Parsel::file('invoice-photo.jpg')
    ->withOcr(language: 'eng')
    ->text();

// Explicitly disable OCR (for clarity in code that might be ambiguous)
$text = Parsel::file('digital-pdf.pdf')->withoutOcr()->text();

OCR confidence scores appear in item-level data when you use ->parse() — useful for filtering out low-confidence extractions or flagging documents that need human review.


Rendering Options

// Set DPI for rendering (higher = better quality, slower)
Parsel::file('document.pdf')->withDpi(300)->parse();

// Preserve small text that might otherwise be dropped
Parsel::file('document.pdf')->preserveSmallText()->text();

// Password-protected PDFs
Parsel::file('confidential.pdf')->withPassword('hunter2')->text();

// Specify a custom lit binary path for this call
Parsel::file('document.pdf')->withBinary('/usr/local/bin/lit')->text();

// Set a timeout for long-running parses
Parsel::file('large-document.pdf')->withTimeout(120)->text();

Global Configuration

For application-wide settings, configure Parsel once in a service provider:

// app/Providers/AppServiceProvider.php
use Shipfastlabs\Parsel;

public function boot(): void
{
    // Set the lit binary globally
    Parsel::usingBinary('/usr/local/bin/lit');

    // Set a global default timeout
    Parsel::defaultTimeout(120);
}

Passing Through Options

If Parsel doesn’t yet have a dedicated method for an option you need, you can pass it directly:

// Boolean flag
Parsel::file('document.pdf')->option('some-new-flag')->text();

// Option with a value
Parsel::file('document.pdf')->option('some-new-flag', 42)->text();

This forward-compatibility mechanism means Parsel stays useful even when lit ships new features before Parsel exposes them as dedicated methods.


Screenshots: Rendering Pages as Images

When you need visual representations of document pages — for thumbnails, previews, or visual verification:

// Render pages 1-5 as images into a directory
$screenshots = Parsel::file('document.pdf')
    ->pageRange(1, 5)
    ->screenshots('/tmp/parsel-pages');

// $screenshots contains the paths to the generated image files
foreach ($screenshots as $imagePath) {
    // upload to S3, generate thumbnails, etc.
}

Pass a dedicated output directory that doesn’t contain unrelated files — Parsel returns all image files found in the directory after parsing.


Saving Output to Disk

// Save as plain text
Parsel::file('document.pdf')->save('document.txt');

// Save as JSON (structured document data)
Parsel::file('document.pdf')->save('document.json');

When the path ends in .json, Parsel writes the full structured output. Any other extension writes plain text.


Streaming Large Documents

For large documents, loading everything into memory at once isn’t practical. lazyPages() processes one page at a time, keeping memory usage flat regardless of document size:

// Process a 500-page document without loading it all into memory
foreach (Parsel::file('large-report.pdf')->lazyPages() as $page) {
    foreach ($page->items as $item) {
        // Process item by item
        // Memory stays constant — only the current page is in memory
    }
}

This is the right approach for background jobs that process large documents or when you need to stream output to the user progressively.


Real-World Patterns

Invoice Processing Pipeline

class InvoiceProcessor
{
    public function extract(string $path): array
    {
        $document = Parsel::file($path)
            ->withDpi(150)
            ->withOcr(language: 'eng')
            ->parse();

        $extracted = [];

        foreach ($document->pages as $page) {
            foreach ($page->items as $item) {
                // Find items near specific coordinates
                // (top-right of an invoice is usually the invoice number)
                if ($item->y < 150 && $item->x > $page->width * 0.6) {
                    $extracted['header_region'][] = $item->text;
                }
            }
        }

        return [
            'full_text'     => $document->text,
            'page_count'    => $document->pageCount(),
            'header_region' => $extracted['header_region'] ?? [],
        ];
    }
}

Background Job with Streaming

class ParseLargeDocument implements ShouldQueue
{
    public function __construct(
        private string $documentPath,
        private int    $documentId,
    ) {}

    public function handle(): void
    {
        $pageContents = [];

        foreach (Parsel::file($this->documentPath)->lazyPages() as $page) {
            $pageContents[] = [
                'page'   => $page->number,
                'text'   => $page->text,
                'items'  => collect($page->items)->map(fn($i) => [
                    'text' => $i->text,
                    'x'    => $i->x,
                    'y'    => $i->y,
                ])->toArray(),
            ];
        }

        Document::find($this->documentId)->update([
            'parsed_content' => $pageContents,
            'parsed_at'      => now(),
        ]);
    }
}

Caching Expensive Parses

public function getParsedDocument(int $documentId): array
{
    return Cache::remember(
        "document:{$documentId}:parsed",
        now()->addDay(),
        function () use ($documentId) {
            $document = Document::findOrFail($documentId);
            return Parsel::bytes($document->file_content, $document->extension)
                ->withOcr()
                ->toArray();
        }
    );
}

Testing: The Fake Runner

Parsel ships with a fake runner that lets you test document parsing code without installing or running the real lit binary. This is essential for CI environments and for unit tests that don’t need the full binary stack.

use Shipfastlabs\Parsel;

it('extracts invoice data correctly', function () {
    $fakeOutput = file_get_contents(__DIR__ . '/fixtures/invoice-parsed.json');

    $fake = Parsel::fake([
        '--format json' => $fakeOutput,
    ]);

    $result = Parsel::file('invoice.pdf')->parse();

    // Assertions on the parsed result
    expect($result->pageCount())->toBe(2)
        ->and($result->pages[0]->items[0]->text)->toBe('INVOICE');

    // Assert that the right command was called
    expect($fake->recordedCommands()[0])->toContain('--format', 'json');
});

it('passes OCR options to the binary', function () {
    $fake = Parsel::fake(['invoice.pdf' => '{}']);

    Parsel::file('invoice.pdf')
        ->withOcr(language: 'fra')
        ->text();

    expect($fake->recordedCommands()[0])->toContain('fra');
});

Response keys are matched as substrings of the command line. When multiple responses match, the longest matching key wins — giving you flexible, specific control over mock responses.


Binary Resolution Order

When Parsel needs to call lit, it resolves the binary in this order:

1. withBinary() per-call override
2. Parsel::usingBinary() global configuration
3. PARSEL_LIT_BINARY environment variable
4. lit on the system PATH

If none of these resolve, Parsel throws a BinaryNotFoundException. For production deployments, set PARSEL_LIT_BINARY in your .env to make the binary path explicit:

# .env
PARSEL_LIT_BINARY=/usr/local/bin/lit

When to Use Parsel

Parsel is the right choice when:

  • Privacy is non-negotiable. Documents processed locally never leave your server. No cloud API sees your users’ contracts, medical records, or financial statements.
  • Cost scales with document volume. Per-page billing from cloud APIs becomes expensive at scale. Parsel’s cost is fixed: the server that runs it.
  • You need structured output. Coordinates, font metadata, and OCR confidence scores for document intelligence use cases.
  • You’re working with diverse file types. One API for PDFs, Word, Excel, PowerPoint, and images — rather than separate integrations for each.
  • Large documents need streaming. lazyPages() keeps memory usage constant regardless of document size.

Quick Reference

// Basic text extraction
Parsel::file('doc.pdf')->text()
Parsel::bytes($bytes, 'pdf')->text()

// Structured output
Parsel::file('doc.pdf')->parse()            // → Document
Parsel::file('doc.pdf')->toArray()          // → array

// Page selection
->page(7)
->pages(1, 3, 5)
->pages('1-5', 10)
->pageRange(1, 5)
->maxPages(50)

// OCR
->withOcr()
->withOcr(language: 'fra', workers: 4)
->withoutOcr()

// Rendering
->withDpi(300)
->preserveSmallText()
->withPassword('secret')
->withTimeout(120)
->withBinary('/path/to/lit')

// Output
->save('output.txt')          // plain text
->save('output.json')         // structured JSON
->screenshots('/tmp/pages')   // page images

// Streaming
->lazyPages()                 // generator, one page at a time

// Global config
Parsel::usingBinary('/path/to/lit')
Parsel::defaultTimeout(120)

// Testing
Parsel::fake(['--format json' => $fakeOutput])

Final Thoughts

Document parsing is one of those features where the “obvious” solution — call an external API — hides significant costs in terms of money, data privacy, and vendor dependency. Parsel makes the better solution just as easy.

The API is genuinely elegant. Parsel::file('invoice.pdf')->withOcr()->text() is as simple as it gets. The structured document output with coordinates and font metadata opens up document intelligence use cases that would be difficult to build any other way. The fake runner means tests work in CI without a binary installed. The lazyPages() API means large documents are a non-issue.

For any PHP application that processes documents — SaaS platforms, legal tech, fintech, healthcare, logistics — Parsel deserves a serious look.

Leave a Reply

Your email address will not be published. Required fields are marked *