Written by Zhicheng Zhang
on April 15, 2026
Project

Joseph Perrier Ingestion Engine

This project is a modular web crawler and structured data ingestion pipeline designed to extract high-quality data from the Joseph Perrier winery website, a real-world data source characterized by inconsistent structure, fragmented content, and high-latency responses.

Unlike simple scrapers, this system is built to handle challenging web data conditions, including:

Semi-structured and inconsistent HTML layouts
Fragmented multi-page entity representation (winery data distributed across sections)
Multi-language routing (French to English normalization)
Rich media extraction from heterogeneous DOM patterns
Deterministic crawling with strict deduplication under noisy link graphs

To address these challenges, the system adopts a modular and extensible architecture, with a particular focus on the parsing layer. Instead of relying on a monolithic parser, it uses a pluggable extractor-based design (Strategy Pattern), where each extractor encapsulates the logic for a specific field or semantic block. These extractors are orchestrated as a unified execution framework, enabling independent evolution, fault isolation, and flexible composition of parsing logic.

This design avoids tight coupling between HTML structure and parsing logic, making the system resilient to layout changes and robust under partially structured and evolving data sources.

More broadly, the system is structured as a reusable crawling framework rather than a one-off scraper, with a strong emphasis on extensibility, deterministic behavior, and resilience to unstable, real-world web data conditions.

Performance Snapshot

On a high-latency and inconsistently structured real-world website, the system demonstrates:

~95% duplicate filtering via canonicalization and deterministic scheduling
<1% error rate with retry-based fault tolerance
Reconstruction of 28 normalized entities (products and winery) from 50+ pages
Extraction of 150+ media assets across heterogeneous page structures
Stable operation under network-bound conditions (~10s P95 latency)

Architecture

High-Level Pipeline

The crawler follows a structured pipeline:

Seed URLs
↓
Frontier (URL queue + dedupe)
↓
Fetcher (HTTP client with retry)
↓
Parser (page classification + routing)
↓
Extractor Layer (Pluggable Strategy-based parsing)
↓
Structured Entities (Winery / Product / Media)
↓
Batch Write (persistence)

Crawl Orchestration

The crawling process is coordinated by a central orchestrator that:

Initializes the crawl with seed URLs
Manages concurrent fetching and parsing
Controls crawl scope via URL filtering
Handles language routing (FR → EN)
Buffers and flushes extracted data in batches

The system uses a thread pool for parallel crawling, enabling efficient traversal while maintaining control over concurrency.

Frontier (URL Scheduling)

The frontier manages the set of URLs to be crawled.

Key properties:

Queue-based traversal (BFS-style)
Deduplication via normalized URLs
Separation between:
- Seen URLs (already processed)
- Queued URLs (pending processing)

This prevents redundant crawling and avoids infinite loops caused by URL variations such as query parameters.

URL Normalization

URL normalization is a foundational component of the crawler.

Normalization includes:

Removing query parameters
Removing fragments
Stripping trailing slashes

This ensures that logically identical pages map to a single canonical URL, which is critical for:

Deduplication
Crawl correctness
Stable data ingestion

Media URLs use a slightly different normalization strategy to preserve meaningful query parameters when necessary.

Fetcher (HTTP Layer)

The fetcher is responsible for retrieving page content.

Features:

Persistent HTTP session
Automatic redirect handling
Retry with exponential backoff
Configurable timeout and headers

The fetcher returns both the response body and the final resolved URL, which is used for canonicalization and deduplication.

Language-Aware Crawling

The crawler detects and handles multilingual content.

Strategy:

Detect French pages using HTML attributes or URL patterns
Extract the corresponding English version using language switch links
Enqueue the English page instead of parsing the French version

This guarantees:

Consistent language across all extracted data
Avoidance of duplicate ingestion across locales

Parser (Page Classification)

The parser routes pages to specific parsing logic based on URL patterns.

Supported page types:

Product detail pages
Product catalog/listing pages
Winery pages (homepage, history, family)

This routing layer ensures that:

Each page is parsed with the correct logic
Parsing complexity is isolated per page type
The system remains extensible for additional page types

Extractor Architecture

Design Overview

The core parsing logic is implemented using a modular extractor system based on the Strategy pattern, where each extractor encapsulates a specific parsing strategy for a field or semantic block.

Each extractor:

Targets a specific field or semantic block
Defines its own DOM selector
Implements extraction logic independently
Returns structured data, media, and discovered URLs

The parser composes multiple extractors dynamically and aggregates their outputs.

Execution Model

For a given page:

A predefined list of extractors is executed
Each extractor contributes partial results
Outputs are merged into a unified structured representation, with later extractors able to override or enrich previously extracted fields

This design provides:

Strong separation of concerns
Fault isolation (failure in one extractor does not break others)
Ease of extension (new fields require only a new extractor)

Product Extraction

Product detail pages are parsed into highly structured entities.

Extracted fields include:

Basic identity (name, URL, description)
Technical attributes (dosage, aging, temperature, blend)
Grape composition (percentages per variety)
Awards and ratings (structured pairs)
Data sheet links (PDF)

Many of these fields are derived from semi-structured DOM patterns such as key-value blocks or alternating nodes, requiring custom parsing logic rather than relying solely on static CSS selectors.

Winery Extraction

Winery data is distributed across multiple pages and must be assembled incrementally.

Sources:

Homepage: general description
History page: timeline of events
Family page: key individuals and descriptions

Each page produces a partial record, which is later merged into a complete winery entity.

Media Extraction

Media extraction is treated as a first-class concern.

The system extracts media from multiple sources:

Image tags (img)
Video tags (video, source)
Picture elements
CSS background images
Custom attributes used by the site

Additional logic includes:

Resolving srcset to select the highest quality asset
Normalizing media URLs
Deduplicating media per page

Link Discovery

The crawler continuously discovers new URLs during parsing.

Sources of discovered URLs:

Anchor tags within the page
Extractors emitting additional links

Discovered URLs are:

Converted to absolute URLs
Filtered to ensure they are internal
Normalized before being added to the frontier

Crawl Scope Control

To prevent uncontrolled crawling, the system restricts URLs based on allowed path prefixes.

Only relevant sections of the website are crawled, such as:

Product listings
Product detail pages
Winery-related pages

This ensures:

Efficient crawling
No drift into irrelevant or infinite sections

Data Model

The system produces three primary entity types: Winery, Product, and Media.

Winery

Represents a winery as a composite entity assembled from multiple pages.

Key attributes:

Name and website
Description
Family structure (key individuals and roles)
Historical timeline (year-event pairs)

The model supports incremental enrichment, where different pages contribute different parts of the final entity.

Product

Represents a structured wine product with rich attributes.

Key features:

Strong normalization of technical fields (e.g., numeric dosage instead of raw text)
Structured grape composition (percentages per variety)
Parsed awards and ratings
Support for semi-structured key-value extraction

This design enables downstream querying and analysis rather than simple text storage.

Media

Represents media assets associated with pages.

Attributes:

Media type (image or video)
URL (normalized)
Source page URL

Media records are deduplicated and retain provenance for traceability.

Data Ingestion Strategy

The system follows a progressive structuring and merge-based ingestion approach:

Raw HTML is fetched from pages
Extractors convert DOM content into structured fields
Page-level results are aggregated into intermediate objects
Partial records from different pages are merged
Final structured entities are persisted

A key aspect of the system is its ability to handle partial and evolving data:

Records can be incomplete when first extracted
Later crawls can enrich existing data
Merging logic ensures no loss of information

Setup

Install Dependencies


pip install -r requirements.txt

Environment Configuration

Create a .env file with:

SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_key
BASE_URL=https://www.josephperrier.com/

TIMEOUT=20
MAX_WORKERS=8
BATCH_SIZE=25

Initialize Database


python scripts/init_db.py

Run the Crawler


python -m src.cli.main crawl

You can also run the crawler directly:

python src/main.py

The crawler starts from the configured base URL and recursively discovers and processes relevant pages.

CLI Usage

After setting up the environment and initializing the database, you can use the CLI to run the crawler and inspect the results.

Run the crawler

python -m src.cli.main crawl

This will start crawling from the configured base URL and populate the database.

List products

python -m src.cli.main list-products

You can control how many results are returned:

python -m src.cli.main list-products --limit 10

Show a specific product

python -m src.cli.main show-product "cuvée royale brut"

This will display all stored fields for the selected product.

Notes

Make sure your .env file is correctly configured before running the CLI.
All commands should be run from the project root directory.

Example Run

Input:


https://www.josephperrier.com/

Output:

Structured winery data assembled from multiple pages
Product catalog with detailed attributes
Media assets extracted from across the site
Internal links discovered and traversed

Key Design Decisions

1. Extractor-Based Parsing

Instead of a monolithic parser, the system uses small, composable extractors.

Benefits:

Easier maintenance
Independent evolution of parsing logic
Better fault isolation

2. Deterministic Crawling

Strict URL normalization and deduplication ensure:

No duplicate crawling
Stable ingestion behavior
Predictable crawl graph

3. Structure-Aware Parsing

Many fields are not directly extractable via simple selectors.

The system handles:

Key-value pairing
Alternating DOM patterns
Nested content structures

4. Multi-Page Entity Assembly

Entities such as wineries are not confined to a single page.

The system:

Extracts partial records
Merges them across pages
Produces a complete structured entity

5. Robustness to Real-World Web Variability

The crawler is designed to handle:

Inconsistent HTML structures
Missing fields
Layout variations across pages

This is achieved through:

Defensive extraction logic
Independent extractors
Merge-based data modeling

Future Improvements

Generalize crawler to support multiple winery sites
Introduce distributed crawling (queue-based architecture)
Add persistent deduplication (e.g., Redis)
Improve observability (metrics and tracing)
Implement incremental recrawling strategies

→ Top