Extract Structured Data with Laravel AI SDK and JsonSchema

📖 2 minutes read

Need to extract structured data from messy HTML, PDFs, or unstructured text? Laravel’s AI SDK has a killer feature: structured output that lets you define exactly what data you want and get back a validated, type-safe array.

The Problem

You’re scraping product data from old supplier websites. Some pages have tables, others have random HTML layouts. You need to extract product names, prices, features, and locations consistently.

The Solution: JsonSchema + AI SDK

use function Laravel\Ai\agent;
use Illuminate\Contracts\JsonSchema\JsonSchema;
use Laravel\Ai\Files\Document;

$html = file_get_contents($productUrl);
$document = Document::fromString($html, 'text/plain');

$response = agent(
    instructions: 'Extract product information from the HTML',
    schema: fn (JsonSchema $schema) => [
        'name' => $schema->string()->required(),
        'price' => $schema->number()->min(0)->required(),
        'features' => $schema->array()->items($schema->string()),
        'location' => $schema->object([
            'city' => $schema->string()->required(),
            'state' => $schema->string()->required(),
        ])->withoutAdditionalProperties(),
    ],
)->prompt(
    'Extract all product details from this page.',
    attachments: [$document],
);

$data = $response->toArray();
// ['name' => 'Pro Widget', 'price' => 29.99, ...]

What’s Happening Here

  1. You define a JsonSchema describing your exact data structure
  2. The LLM reads the messy HTML and extracts the data
  3. The AI SDK validates the response against your schema
  4. You get back a clean, type-safe array matching your spec

The schema acts as a contract. If the LLM tries to return invalid data (missing required fields, wrong types, extra keys), the SDK catches it.

Schema Features

You can define:

  • Types: string(), number(), integer(), boolean(), array(), object()
  • Constraints: min(), max(), enum(['Option1', 'Option2'])
  • Requirements: required(), nullable()
  • Nested structures: Objects within arrays, arrays within objects
  • Descriptions: description('...') to guide the LLM

Real-World Use Cases

  • Data migration: Extract structured records from legacy system exports
  • Content extraction: Pull article metadata from blog posts
  • Invoice parsing: Extract line items, totals, dates from PDF invoices
  • Form filling: Auto-populate forms from uploaded documents

Why This Beats Regex or DOM Parsing

Traditional scraping breaks when layouts change. LLMs understand meaning, not structure. They can find “the product price” even if it moves from a table to a div to a JSON blob embedded in a script tag.

The JsonSchema ensures you still get reliable, validated output—even when the source format is chaos.

Daryle De Silva

VP of Technology

11+ years building and scaling web applications. Writing about what I learn in the trenches.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *