Joseph Perrier Ingestion Engine

This project is a modular web crawler and structured data ingestion pipeline designed to extract high-quality data from the Joseph Perrier winery website, a real-world data source characterized by inconsistent structure, fragmented content, and high-latency responses.

Unlike simple scrapers, this system is built to handle challenging web data conditions, including:

To address these challenges, the system adopts a modular and extensible architecture, with a particular focus on the parsing layer. Instead of relying on a monolithic parser, it uses a pluggable extractor-based design (Strategy Pattern), where each extractor encapsulates the logic for a specific field or semantic block. These extractors are orchestrated as a unified execution framework, enabling independent evolution, fault isolation, and flexible composition of parsing logic.

This design avoids tight coupling between HTML structure and parsing logic, making the system resilient to layout changes and robust under partially structured and evolving data sources.

More broadly, the system is structured as a reusable crawling framework rather than a one-off scraper, with a strong emphasis on extensibility, deterministic behavior, and resilience to unstable, real-world web data conditions.

Performance Snapshot

On a high-latency and inconsistently structured real-world website, the system demonstrates:

Architecture

High-Level Pipeline

The crawler follows a structured pipeline:


Seed URLs
↓
Frontier (URL queue + dedupe)
↓
Fetcher (HTTP client with retry)
↓
Parser (page classification + routing)
↓
Extractor Layer (Pluggable Strategy-based parsing)
↓
Structured Entities (Winery / Product / Media)
↓
Batch Write (persistence)

Crawl Orchestration

The crawling process is coordinated by a central orchestrator that:

The system uses a thread pool for parallel crawling, enabling efficient traversal while maintaining control over concurrency.

Frontier (URL Scheduling)

The frontier manages the set of URLs to be crawled.

Key properties:

This prevents redundant crawling and avoids infinite loops caused by URL variations such as query parameters.

URL Normalization

URL normalization is a foundational component of the crawler.

Normalization includes:

This ensures that logically identical pages map to a single canonical URL, which is critical for:

Media URLs use a slightly different normalization strategy to preserve meaningful query parameters when necessary.

Fetcher (HTTP Layer)

The fetcher is responsible for retrieving page content.

Features:

The fetcher returns both the response body and the final resolved URL, which is used for canonicalization and deduplication.

Language-Aware Crawling

The crawler detects and handles multilingual content.

Strategy:

This guarantees:

Parser (Page Classification)

The parser routes pages to specific parsing logic based on URL patterns.

Supported page types:

This routing layer ensures that:

Extractor Architecture

Design Overview

The core parsing logic is implemented using a modular extractor system based on the Strategy pattern, where each extractor encapsulates a specific parsing strategy for a field or semantic block.

Each extractor:

The parser composes multiple extractors dynamically and aggregates their outputs.

Execution Model

For a given page:

This design provides:

Product Extraction

Product detail pages are parsed into highly structured entities.

Extracted fields include:

Many of these fields are derived from semi-structured DOM patterns such as key-value blocks or alternating nodes, requiring custom parsing logic rather than relying solely on static CSS selectors.

Winery Extraction

Winery data is distributed across multiple pages and must be assembled incrementally.

Sources:

Each page produces a partial record, which is later merged into a complete winery entity.

Media Extraction

Media extraction is treated as a first-class concern.

The system extracts media from multiple sources:

Additional logic includes:

The crawler continuously discovers new URLs during parsing.

Sources of discovered URLs:

Discovered URLs are:

Crawl Scope Control

To prevent uncontrolled crawling, the system restricts URLs based on allowed path prefixes.

Only relevant sections of the website are crawled, such as:

This ensures:

Data Model

The system produces three primary entity types: Winery, Product, and Media.

Winery

Represents a winery as a composite entity assembled from multiple pages.

Key attributes:

The model supports incremental enrichment, where different pages contribute different parts of the final entity.

Product

Represents a structured wine product with rich attributes.

Key features:

This design enables downstream querying and analysis rather than simple text storage.

Media

Represents media assets associated with pages.

Attributes:

Media records are deduplicated and retain provenance for traceability.

Data Ingestion Strategy

The system follows a progressive structuring and merge-based ingestion approach:

  1. Raw HTML is fetched from pages
  2. Extractors convert DOM content into structured fields
  3. Page-level results are aggregated into intermediate objects
  4. Partial records from different pages are merged
  5. Final structured entities are persisted

A key aspect of the system is its ability to handle partial and evolving data:

Setup

Install Dependencies


pip install -r requirements.txt

Environment Configuration

Create a .env file with:


SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_key
BASE_URL=https://www.josephperrier.com/

TIMEOUT=20
MAX_WORKERS=8
BATCH_SIZE=25

Initialize Database


python scripts/init_db.py

Run the Crawler


python -m src.cli.main crawl

You can also run the crawler directly:

python src/main.py

The crawler starts from the configured base URL and recursively discovers and processes relevant pages.

CLI Usage

After setting up the environment and initializing the database, you can use the CLI to run the crawler and inspect the results.

Run the crawler

python -m src.cli.main crawl

This will start crawling from the configured base URL and populate the database.

List products

python -m src.cli.main list-products

You can control how many results are returned:

python -m src.cli.main list-products --limit 10

Show a specific product

python -m src.cli.main show-product "cuvée royale brut"

This will display all stored fields for the selected product.

Notes

Example Run

Input:


https://www.josephperrier.com/

Output:

Key Design Decisions

1. Extractor-Based Parsing

Instead of a monolithic parser, the system uses small, composable extractors.

Benefits:

2. Deterministic Crawling

Strict URL normalization and deduplication ensure:

3. Structure-Aware Parsing

Many fields are not directly extractable via simple selectors.

The system handles:

4. Multi-Page Entity Assembly

Entities such as wineries are not confined to a single page.

The system:

5. Robustness to Real-World Web Variability

The crawler is designed to handle:

This is achieved through:

Future Improvements