Extract Structured Data with Laravel AI SDK and JsonSchema

The Problem
The Solution: JsonSchema + AI SDK
What’s Happening Here
Schema Features
Real-World Use Cases
Why This Beats Regex or DOM Parsing

📖 2 minutes read

Need to extract structured data from messy HTML, PDFs, or unstructured text? Laravel’s AI SDK has a killer feature: structured output that lets you define exactly what data you want and get back a validated, type-safe array.

The Problem

You’re scraping product data from old supplier websites. Some pages have tables, others have random HTML layouts. You need to extract product names, prices, features, and locations consistently.

The Solution: JsonSchema + AI SDK

use function Laravel\Ai\agent;
use Illuminate\Contracts\JsonSchema\JsonSchema;
use Laravel\Ai\Files\Document;

$html = file_get_contents($productUrl);
$document = Document::fromString($html, 'text/plain');

$response = agent(
    instructions: 'Extract product information from the HTML',
    schema: fn (JsonSchema $schema) => [
        'name' => $schema->string()->required(),
        'price' => $schema->number()->min(0)->required(),
        'features' => $schema->array()->items($schema->string()),
        'location' => $schema->object([
            'city' => $schema->string()->required(),
            'state' => $schema->string()->required(),
        ])->withoutAdditionalProperties(),
    ],
)->prompt(
    'Extract all product details from this page.',
    attachments: [$document],
);

$data = $response->toArray();
// ['name' => 'Pro Widget', 'price' => 29.99, ...]

What’s Happening Here

You define a JsonSchema describing your exact data structure
The LLM reads the messy HTML and extracts the data
The AI SDK validates the response against your schema
You get back a clean, type-safe array matching your spec

The schema acts as a contract. If the LLM tries to return invalid data (missing required fields, wrong types, extra keys), the SDK catches it.

Schema Features

You can define:

Types: string(), number(), integer(), boolean(), array(), object()
Constraints: min(), max(), enum(['Option1', 'Option2'])
Requirements: required(), nullable()
Nested structures: Objects within arrays, arrays within objects
Descriptions: description('...') to guide the LLM

Real-World Use Cases

Data migration: Extract structured records from legacy system exports
Content extraction: Pull article metadata from blog posts
Invoice parsing: Extract line items, totals, dates from PDF invoices
Form filling: Auto-populate forms from uploaded documents

Why This Beats Regex or DOM Parsing

Traditional scraping breaks when layouts change. LLMs understand meaning, not structure. They can find “the product price” even if it moves from a table to a div to a JSON blob embedded in a script tag.

The JsonSchema ensures you still get reliable, validated output—even when the source format is chaos.

Daryle De Silva

VP of Technology

11+ years building and scaling web applications. Writing about what I learn in the trenches.

Extract Structured Data with Laravel AI SDK and JsonSchema

Table of Contents

The Problem

The Solution: JsonSchema + AI SDK

What’s Happening Here

Schema Features

Real-World Use Cases

Why This Beats Regex or DOM Parsing

Daryle De Silva

Comments

Leave a Reply Cancel reply

More posts

Breaking CLI Commands into Chunked Web APIs to Avoid Timeouts

Verify Before Removing Defensive Fallback Code

Debugging Collection Pipelines with Tinker

Register Custom Corcel Models for WordPress Post Types

Extract Structured Data with Laravel AI SDK and JsonSchema

Table of Contents ▼

The Problem

The Solution: JsonSchema + AI SDK

What’s Happening Here

Schema Features

Real-World Use Cases

Why This Beats Regex or DOM Parsing

Daryle De Silva

Related Articles

UpdateRequest Extends StoreRequest Pattern for DRY Validation

Building Resilient API Fallback Chains in Laravel

Handling Zero vs Null in API Responses: The != null Pattern

Comments

Leave a Reply Cancel reply

More posts

Breaking CLI Commands into Chunked Web APIs to Avoid Timeouts

Verify Before Removing Defensive Fallback Code

Debugging Collection Pipelines with Tinker

Register Custom Corcel Models for WordPress Post Types

Table of Contents