Table of Contents
📖 3 minutes read
Use Laravel AI’s JsonSchema builder with agents to extract structured data from unstructured HTML. Define your schema with detailed descriptions, constraints, and nested objects for reliable extraction.
The Problem
Traditional regex/DOM parsing is brittle and breaks when HTML structure changes. You need to extract structured data from web pages reliably.
The Solution
Laravel AI’s JsonSchema + agent pattern provides LLM-based extraction with strict type validation:
use Laravel\Ai\Files\Document;
use Illuminate\Contracts\JsonSchema\JsonSchema;
use function Laravel\Ai\agent;
$document = Document::fromString($htmlContent, 'text/plain');
$response = agent(
instructions: 'You are a data extraction assistant. Extract structured information from the attached document.',
schema: fn (JsonSchema $schema) => [
'title' => $schema->string()
->description('Main title of the content')
->required(),
'items' => $schema->array()
->description('List of related items')
->min(1)
->items(
$schema->object([
'name' => $schema->string()->required(),
'value' => $schema->integer()->min(0)->nullable(),
'metadata' => $schema->object([
'lat' => $schema->number()->min(-90)->max(90),
'lng' => $schema->number()->min(-180)->max(180),
])->withoutAdditionalProperties()->nullable(),
])->withoutAdditionalProperties()
),
'category' => $schema->string()
->enum(['Type1', 'Type2'])
->required(),
],
)->prompt(
'Extract data from the attached document.',
attachments: [$document],
);
$data = $response->toArray();
Key Techniques
1. Use withoutAdditionalProperties() to Prevent Hallucination
$schema->object([
'name' => $schema->string()->required(),
'count' => $schema->integer()->min(1)->nullable(),
])->withoutAdditionalProperties() // Prevents adding unexpected fields
2. Add Min/Max Constraints for Numbers
'elevation' => $schema->integer()->min(0)->max(10000),
'latitude' => $schema->number()->min(-90)->max(90),
3. Provide Detailed Descriptions
'difficulty' => $schema->string()
->description('Difficulty level: beginner, intermediate, or expert')
->enum(['beginner', 'intermediate', 'expert'])
->required(),
Why This Matters
- Resilient to format variations: LLMs understand content semantically, not just structure
- Type-safe output: JsonSchema ensures validated, structured data
- Prevents hallucination:
withoutAdditionalProperties()+ constraints = strict validation - Self-documenting: Schema descriptions double as documentation
Real-World Example
Extracting product data from an e-commerce site:
$schema->object([
'title' => $schema->string()->required(),
'price' => $schema->number()->min(0)->required(),
'stock_status' => $schema->string()
->enum(['in_stock', 'out_of_stock', 'pre_order'])
->required(),
'specs' => $schema->object([
'brand' => $schema->string()->nullable(),
'model' => $schema->string()->nullable(),
'dimensions' => $schema->object([
'length' => $schema->number()->min(0),
'width' => $schema->number()->min(0),
'height' => $schema->number()->min(0),
])->withoutAdditionalProperties()->nullable(),
])->withoutAdditionalProperties(),
])->withoutAdditionalProperties()
The payoff: When the HTML changes (and it will), your extraction continues working because the LLM understands the meaning of the content, not just its structure.
Leave a Reply