Table of Contents
Need to extract structured data from messy HTML, PDFs, or unstructured text? Laravel’s AI SDK has a killer feature: structured output that lets you define exactly what data you want and get back a validated, type-safe array.
The Problem
You’re scraping product data from old supplier websites. Some pages have tables, others have random HTML layouts. You need to extract product names, prices, features, and locations consistently.
The Solution: JsonSchema + AI SDK
use function Laravel\Ai\agent;
use Illuminate\Contracts\JsonSchema\JsonSchema;
use Laravel\Ai\Files\Document;
$html = file_get_contents($productUrl);
$document = Document::fromString($html, 'text/plain');
$response = agent(
instructions: 'Extract product information from the HTML',
schema: fn (JsonSchema $schema) => [
'name' => $schema->string()->required(),
'price' => $schema->number()->min(0)->required(),
'features' => $schema->array()->items($schema->string()),
'location' => $schema->object([
'city' => $schema->string()->required(),
'state' => $schema->string()->required(),
])->withoutAdditionalProperties(),
],
)->prompt(
'Extract all product details from this page.',
attachments: [$document],
);
$data = $response->toArray();
// ['name' => 'Pro Widget', 'price' => 29.99, ...]
What’s Happening Here
- You define a JsonSchema describing your exact data structure
- The LLM reads the messy HTML and extracts the data
- The AI SDK validates the response against your schema
- You get back a clean, type-safe array matching your spec
The schema acts as a contract. If the LLM tries to return invalid data (missing required fields, wrong types, extra keys), the SDK catches it.
Schema Features
You can define:
- Types:
string(),number(),integer(),boolean(),array(),object() - Constraints:
min(),max(),enum(['Option1', 'Option2']) - Requirements:
required(),nullable() - Nested structures: Objects within arrays, arrays within objects
- Descriptions:
description('...')to guide the LLM
Real-World Use Cases
- Data migration: Extract structured records from legacy system exports
- Content extraction: Pull article metadata from blog posts
- Invoice parsing: Extract line items, totals, dates from PDF invoices
- Form filling: Auto-populate forms from uploaded documents
Why This Beats Regex or DOM Parsing
Traditional scraping breaks when layouts change. LLMs understand meaning, not structure. They can find “the product price” even if it moves from a table to a div to a JSON blob embedded in a script tag.
The JsonSchema ensures you still get reliable, validated output—even when the source format is chaos.
Leave a Reply