<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://zhichzhang.dev//feed.xml" rel="self" type="application/atom+xml" /><link href="https://zhichzhang.dev//" rel="alternate" type="text/html" /><updated>2026-05-12T03:39:40+00:00</updated><id>https://zhichzhang.dev//feed.xml</id><title type="html">Zhicheng Zhang</title><subtitle>Zhicheng Zhang&apos;s Personl Website</subtitle><author><name>Zhicheng Zhang</name><email>zhicheng.zhang.cs@gmail.com</email></author><entry><title type="html">Joseph Perrier Ingestion Engine</title><link href="https://zhichzhang.dev//2026-04-15/joseph-perrier-ingestion-engine" rel="alternate" type="text/html" title="Joseph Perrier Ingestion Engine" /><published>2026-04-15T22:30:25+00:00</published><updated>2026-04-15T22:30:25+00:00</updated><id>https://zhichzhang.dev//2026-04-15/joseph-perrier-ingestion-engine</id><content type="html" xml:base="https://zhichzhang.dev//2026-04-15/joseph-perrier-ingestion-engine"><![CDATA[<!--excerpt-->

<p><a href="https://github.com/zhichzhang/jp-ingestion-engine">This project</a> is a modular web crawler and structured data ingestion pipeline designed to extract high-quality data from the Joseph Perrier winery website, a <strong>real-world data source characterized by inconsistent structure, fragmented content, and high-latency responses</strong>.</p>

<p>Unlike simple scrapers, this system is built to handle challenging web data conditions, including:</p>

<ul>
  <li>Semi-structured and inconsistent HTML layouts</li>
  <li>Fragmented multi-page entity representation (winery data distributed across sections)</li>
  <li>Multi-language routing (French to English normalization)</li>
  <li>Rich media extraction from heterogeneous DOM patterns</li>
  <li>Deterministic crawling with strict deduplication under noisy link graphs</li>
</ul>

<p>To address these challenges, the system adopts a <strong>modular and extensible architecture</strong>, with a particular focus on the parsing layer. Instead of relying on a monolithic parser, it uses a <strong>pluggable extractor-based design (Strategy Pattern)</strong>, where each extractor encapsulates the logic for a specific field or semantic block. These extractors are orchestrated as a unified execution framework, enabling independent evolution, fault isolation, and flexible composition of parsing logic.</p>

<p>This design avoids tight coupling between HTML structure and parsing logic, making the system resilient to layout changes and robust under partially structured and evolving data sources.</p>

<p>More broadly, the system is structured as a reusable crawling framework rather than a one-off scraper, with a strong emphasis on <strong>extensibility, deterministic behavior, and resilience to unstable, real-world web data conditions</strong>.</p>

<h3 id="performance-snapshot">Performance Snapshot</h3>

<p>On a high-latency and inconsistently structured real-world website, the system demonstrates:</p>

<ul>
  <li>~95% duplicate filtering via canonicalization and deterministic scheduling</li>
  <li>&lt;1% error rate with retry-based fault tolerance</li>
  <li>Reconstruction of 28 normalized entities (products and winery) from 50+ pages</li>
  <li>Extraction of 150+ media assets across heterogeneous page structures</li>
  <li>Stable operation under network-bound conditions (~10s P95 latency)</li>
</ul>

<h2 id="architecture">Architecture</h2>

<h3 id="high-level-pipeline">High-Level Pipeline</h3>

<p>The crawler follows a structured pipeline:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Seed URLs
↓
Frontier (URL queue + dedupe)
↓
Fetcher (HTTP client with retry)
↓
Parser (page classification + routing)
↓
Extractor Layer (Pluggable Strategy-based parsing)
↓
Structured Entities (Winery / Product / Media)
↓
Batch Write (persistence)

</code></pre></div></div>

<h3 id="crawl-orchestration">Crawl Orchestration</h3>

<p>The crawling process is coordinated by a central orchestrator that:</p>

<ul>
  <li>Initializes the crawl with seed URLs</li>
  <li>Manages concurrent fetching and parsing</li>
  <li>Controls crawl scope via URL filtering</li>
  <li>Handles language routing (FR → EN)</li>
  <li>Buffers and flushes extracted data in batches</li>
</ul>

<p>The system uses a thread pool for parallel crawling, enabling efficient traversal while maintaining control over concurrency.</p>

<h3 id="frontier-url-scheduling">Frontier (URL Scheduling)</h3>

<p>The frontier manages the set of URLs to be crawled.</p>

<p>Key properties:</p>

<ul>
  <li>Queue-based traversal (BFS-style)</li>
  <li>Deduplication via normalized URLs</li>
  <li>Separation between:
    <ul>
      <li>Seen URLs (already processed)</li>
      <li>Queued URLs (pending processing)</li>
    </ul>
  </li>
</ul>

<p>This prevents redundant crawling and avoids infinite loops caused by URL variations such as query parameters.</p>

<h3 id="url-normalization">URL Normalization</h3>

<p>URL normalization is a foundational component of the crawler.</p>

<p>Normalization includes:</p>

<ul>
  <li>Removing query parameters</li>
  <li>Removing fragments</li>
  <li>Stripping trailing slashes</li>
</ul>

<p>This ensures that logically identical pages map to a single canonical URL, which is critical for:</p>

<ul>
  <li>Deduplication</li>
  <li>Crawl correctness</li>
  <li>Stable data ingestion</li>
</ul>

<p>Media URLs use a slightly different normalization strategy to preserve meaningful query parameters when necessary.</p>

<h3 id="fetcher-http-layer">Fetcher (HTTP Layer)</h3>

<p>The fetcher is responsible for retrieving page content.</p>

<p>Features:</p>

<ul>
  <li>Persistent HTTP session</li>
  <li>Automatic redirect handling</li>
  <li>Retry with exponential backoff</li>
  <li>Configurable timeout and headers</li>
</ul>

<p>The fetcher returns both the response body and the final resolved URL, which is used for canonicalization and deduplication.</p>

<h3 id="language-aware-crawling">Language-Aware Crawling</h3>

<p>The crawler detects and handles multilingual content.</p>

<p>Strategy:</p>

<ul>
  <li>Detect French pages using HTML attributes or URL patterns</li>
  <li>Extract the corresponding English version using language switch links</li>
  <li>Enqueue the English page instead of parsing the French version</li>
</ul>

<p>This guarantees:</p>

<ul>
  <li>Consistent language across all extracted data</li>
  <li>Avoidance of duplicate ingestion across locales</li>
</ul>

<h3 id="parser-page-classification">Parser (Page Classification)</h3>

<p>The parser routes pages to specific parsing logic based on URL patterns.</p>

<p>Supported page types:</p>

<ul>
  <li>Product detail pages</li>
  <li>Product catalog/listing pages</li>
  <li>Winery pages (homepage, history, family)</li>
</ul>

<p>This routing layer ensures that:</p>

<ul>
  <li>Each page is parsed with the correct logic</li>
  <li>Parsing complexity is isolated per page type</li>
  <li>The system remains extensible for additional page types</li>
</ul>

<h2 id="extractor-architecture">Extractor Architecture</h2>

<h3 id="design-overview">Design Overview</h3>

<p>The core parsing logic is implemented using a modular extractor system based on the Strategy pattern, where each extractor encapsulates a specific parsing strategy for a field or semantic block.</p>

<p>Each extractor:</p>

<ul>
  <li>Targets a specific field or semantic block</li>
  <li>Defines its own DOM selector</li>
  <li>Implements extraction logic independently</li>
  <li>Returns structured data, media, and discovered URLs</li>
</ul>

<p>The parser composes multiple extractors dynamically and aggregates their outputs.</p>

<h3 id="execution-model">Execution Model</h3>

<p>For a given page:</p>

<ul>
  <li>A predefined list of extractors is executed</li>
  <li>Each extractor contributes partial results</li>
  <li>Outputs are merged into a unified structured representation, with later extractors able to override or enrich previously extracted fields</li>
</ul>

<p>This design provides:</p>

<ul>
  <li>Strong separation of concerns</li>
  <li>Fault isolation (failure in one extractor does not break others)</li>
  <li>Ease of extension (new fields require only a new extractor)</li>
</ul>

<h3 id="product-extraction">Product Extraction</h3>

<p>Product detail pages are parsed into highly structured entities.</p>

<p>Extracted fields include:</p>

<ul>
  <li>Basic identity (name, URL, description)</li>
  <li>Technical attributes (dosage, aging, temperature, blend)</li>
  <li>Grape composition (percentages per variety)</li>
  <li>Awards and ratings (structured pairs)</li>
  <li>Data sheet links (PDF)</li>
</ul>

<p>Many of these fields are derived from semi-structured DOM patterns such as key-value blocks or alternating nodes, requiring custom parsing logic rather than relying solely on static CSS selectors.</p>

<h3 id="winery-extraction">Winery Extraction</h3>

<p>Winery data is distributed across multiple pages and must be assembled incrementally.</p>

<p>Sources:</p>

<ul>
  <li>Homepage: general description</li>
  <li>History page: timeline of events</li>
  <li>Family page: key individuals and descriptions</li>
</ul>

<p>Each page produces a partial record, which is later merged into a complete winery entity.</p>

<h3 id="media-extraction">Media Extraction</h3>

<p>Media extraction is treated as a first-class concern.</p>

<p>The system extracts media from multiple sources:</p>

<ul>
  <li>Image tags (<code class="language-plaintext highlighter-rouge">img</code>)</li>
  <li>Video tags (<code class="language-plaintext highlighter-rouge">video</code>, <code class="language-plaintext highlighter-rouge">source</code>)</li>
  <li>Picture elements</li>
  <li>CSS background images</li>
  <li>Custom attributes used by the site</li>
</ul>

<p>Additional logic includes:</p>

<ul>
  <li>Resolving <code class="language-plaintext highlighter-rouge">srcset</code> to select the highest quality asset</li>
  <li>Normalizing media URLs</li>
  <li>Deduplicating media per page</li>
</ul>

<h3 id="link-discovery">Link Discovery</h3>

<p>The crawler continuously discovers new URLs during parsing.</p>

<p>Sources of discovered URLs:</p>

<ul>
  <li>Anchor tags within the page</li>
  <li>Extractors emitting additional links</li>
</ul>

<p>Discovered URLs are:</p>

<ul>
  <li>Converted to absolute URLs</li>
  <li>Filtered to ensure they are internal</li>
  <li>Normalized before being added to the frontier</li>
</ul>

<h3 id="crawl-scope-control">Crawl Scope Control</h3>

<p>To prevent uncontrolled crawling, the system restricts URLs based on allowed path prefixes.</p>

<p>Only relevant sections of the website are crawled, such as:</p>

<ul>
  <li>Product listings</li>
  <li>Product detail pages</li>
  <li>Winery-related pages</li>
</ul>

<p>This ensures:</p>

<ul>
  <li>Efficient crawling</li>
  <li>No drift into irrelevant or infinite sections</li>
</ul>

<h2 id="data-model">Data Model</h2>

<p>The system produces three primary entity types: Winery, Product, and Media.</p>

<h3 id="winery">Winery</h3>

<p>Represents a winery as a composite entity assembled from multiple pages.</p>

<p>Key attributes:</p>

<ul>
  <li>Name and website</li>
  <li>Description</li>
  <li>Family structure (key individuals and roles)</li>
  <li>Historical timeline (year-event pairs)</li>
</ul>

<p>The model supports incremental enrichment, where different pages contribute different parts of the final entity.</p>

<h3 id="product">Product</h3>

<p>Represents a structured wine product with rich attributes.</p>

<p>Key features:</p>

<ul>
  <li>Strong normalization of technical fields (e.g., numeric dosage instead of raw text)</li>
  <li>Structured grape composition (percentages per variety)</li>
  <li>Parsed awards and ratings</li>
  <li>Support for semi-structured key-value extraction</li>
</ul>

<p>This design enables downstream querying and analysis rather than simple text storage.</p>

<h3 id="media">Media</h3>

<p>Represents media assets associated with pages.</p>

<p>Attributes:</p>

<ul>
  <li>Media type (image or video)</li>
  <li>URL (normalized)</li>
  <li>Source page URL</li>
</ul>

<p>Media records are deduplicated and retain provenance for traceability.</p>

<h2 id="data-ingestion-strategy">Data Ingestion Strategy</h2>

<p>The system follows a progressive structuring and merge-based ingestion approach:</p>

<ol>
  <li>Raw HTML is fetched from pages</li>
  <li>Extractors convert DOM content into structured fields</li>
  <li>Page-level results are aggregated into intermediate objects</li>
  <li>Partial records from different pages are merged</li>
  <li>Final structured entities are persisted</li>
</ol>

<p>A key aspect of the system is its ability to handle partial and evolving data:</p>

<ul>
  <li>Records can be incomplete when first extracted</li>
  <li>Later crawls can enrich existing data</li>
  <li>Merging logic ensures no loss of information</li>
</ul>

<h2 id="setup">Setup</h2>

<h3 id="install-dependencies">Install Dependencies</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
pip install -r requirements.txt

</code></pre></div></div>

<h3 id="environment-configuration">Environment Configuration</h3>

<p>Create a <code class="language-plaintext highlighter-rouge">.env</code> file with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_key
BASE_URL=https://www.josephperrier.com/

TIMEOUT=20
MAX_WORKERS=8
BATCH_SIZE=25

</code></pre></div></div>

<h3 id="initialize-database">Initialize Database</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
python scripts/init_db.py

</code></pre></div></div>

<h3 id="run-the-crawler">Run the Crawler</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
python -m src.cli.main crawl

</code></pre></div></div>

<p>You can also run the crawler directly:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python src/main.py
</code></pre></div></div>

<p>The crawler starts from the configured base URL and recursively discovers and processes relevant pages.</p>

<h2 id="cli-usage">CLI Usage</h2>

<p>After setting up the environment and initializing the database, you can use the CLI to run the crawler and inspect the results.</p>

<h3 id="run-the-crawler-1">Run the crawler</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> src.cli.main crawl
</code></pre></div></div>

<p>This will start crawling from the configured base URL and populate the database.</p>

<h3 id="list-products">List products</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> src.cli.main list-products
</code></pre></div></div>

<p>You can control how many results are returned:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> src.cli.main list-products <span class="nt">--limit</span> 10
</code></pre></div></div>

<h3 id="show-a-specific-product">Show a specific product</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> src.cli.main show-product <span class="s2">"cuvée royale brut"</span>
</code></pre></div></div>

<p>This will display all stored fields for the selected product.</p>

<h3 id="notes">Notes</h3>

<ul>
  <li>Make sure your <code class="language-plaintext highlighter-rouge">.env</code> file is correctly configured before running the CLI.</li>
  <li>All commands should be run from the project root directory.</li>
</ul>

<h2 id="example-run">Example Run</h2>

<p>Input:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
https://www.josephperrier.com/

</code></pre></div></div>

<p>Output:</p>

<ul>
  <li>Structured winery data assembled from multiple pages</li>
  <li>Product catalog with detailed attributes</li>
  <li>Media assets extracted from across the site</li>
  <li>Internal links discovered and traversed</li>
</ul>

<h2 id="key-design-decisions">Key Design Decisions</h2>

<h3 id="1-extractor-based-parsing">1. Extractor-Based Parsing</h3>

<p>Instead of a monolithic parser, the system uses small, composable extractors.</p>

<p>Benefits:</p>

<ul>
  <li>Easier maintenance</li>
  <li>Independent evolution of parsing logic</li>
  <li>Better fault isolation</li>
</ul>

<h3 id="2-deterministic-crawling">2. Deterministic Crawling</h3>

<p>Strict URL normalization and deduplication ensure:</p>

<ul>
  <li>No duplicate crawling</li>
  <li>Stable ingestion behavior</li>
  <li>Predictable crawl graph</li>
</ul>

<h3 id="3-structure-aware-parsing">3. Structure-Aware Parsing</h3>

<p>Many fields are not directly extractable via simple selectors.</p>

<p>The system handles:</p>

<ul>
  <li>Key-value pairing</li>
  <li>Alternating DOM patterns</li>
  <li>Nested content structures</li>
</ul>

<h3 id="4-multi-page-entity-assembly">4. Multi-Page Entity Assembly</h3>

<p>Entities such as wineries are not confined to a single page.</p>

<p>The system:</p>

<ul>
  <li>Extracts partial records</li>
  <li>Merges them across pages</li>
  <li>Produces a complete structured entity</li>
</ul>

<h3 id="5-robustness-to-real-world-web-variability">5. Robustness to Real-World Web Variability</h3>

<p>The crawler is designed to handle:</p>

<ul>
  <li>Inconsistent HTML structures</li>
  <li>Missing fields</li>
  <li>Layout variations across pages</li>
</ul>

<p>This is achieved through:</p>

<ul>
  <li>Defensive extraction logic</li>
  <li>Independent extractors</li>
  <li>Merge-based data modeling</li>
</ul>

<h2 id="future-improvements">Future Improvements</h2>

<ul>
  <li>Generalize crawler to support multiple winery sites</li>
  <li>Introduce distributed crawling (queue-based architecture)</li>
  <li>Add persistent deduplication (e.g., Redis)</li>
  <li>Improve observability (metrics and tracing)</li>
  <li>Implement incremental recrawling strategies</li>
</ul>]]></content><author><name>Zhicheng Zhang</name></author><category term="Project" /><summary type="html"><![CDATA[A modular backend data ingestion system for extracting and structuring semi-structured web data, designed with extensible parsing and resilient processing under real-world conditions.]]></summary></entry><entry><title type="html">Wrapping Up My Time at Prox Shopping</title><link href="https://zhichzhang.dev//2026-03-20/prox-shopping-intern" rel="alternate" type="text/html" title="Wrapping Up My Time at Prox Shopping" /><published>2026-03-20T23:14:00+00:00</published><updated>2026-03-20T23:14:00+00:00</updated><id>https://zhichzhang.dev//2026-03-20/prox-shopping-intern</id><content type="html" xml:base="https://zhichzhang.dev//2026-03-20/prox-shopping-intern"><![CDATA[<!--excerpt-->

<div style="text-align: center; margin: 1.5rem 0 2rem;">
  <div style="
    width: 140px;
    height: 140px;
    border-radius: 50%;
    overflow: hidden;
    margin: 0 auto;
  ">
    <img src="https://www.joinprox.com/assets/prox-logo-Cg5H0Ryj.png" alt="Prox Shopping" style="
        width: 100%;
        height: 100%;
        object-fit: cover;
        display: block;
      " />
  </div>
</div>

<p>Over the past few months at Prox Shopping, I worked on backend systems that sat much closer to real product data than anything I had handled before. The work covered ingestion, attribution, and product resolution, and while each part had its own shape, they all revolved around the same problem: making the system behave sensibly even when the inputs were messy, incomplete, or simply not as clean as one would like.</p>

<p>What I learned most was that backend work is often defined less by the visible features it enables than by the invisible decisions that hold those features together. Details such as failure boundaries, data modeling, and observability tend to look small at first, yet they become the difference between something that merely works in a demo and something that can survive real traffic with some degree of composure. That was the part of the experience that stayed with me the most.</p>

<p>I am grateful to the team for the support, patience, and trust they gave me throughout the internship. It was a genuinely valuable stretch of time, and it sharpened both my engineering judgment and my understanding of what it means to build systems that people can actually rely on.</p>]]></content><author><name>Zhicheng Zhang</name></author><category term="Diary" /><summary type="html"><![CDATA[Worked as a backend engineering intern at Prox Shopping, focusing on ingestion, attribution, and product resolution systems, and gained a deeper understanding of designing reliable systems under real-world constraints.]]></summary></entry><entry><title type="html">Wrapping Up My Fogsight Internship</title><link href="https://zhichzhang.dev//2025-08-01/fogsight-intern" rel="alternate" type="text/html" title="Wrapping Up My Fogsight Internship" /><published>2025-08-01T22:30:25+00:00</published><updated>2025-08-01T22:30:25+00:00</updated><id>https://zhichzhang.dev//2025-08-01/fogsight-intern</id><content type="html" xml:base="https://zhichzhang.dev//2025-08-01/fogsight-intern"><![CDATA[<!--excerpt-->

<div style="text-align: center; margin: 2rem 0 2.5rem;">
  <div style="
    width: 320px;
    height: 200px;
    margin: 0 auto;
    display: flex;
    align-items: center;
    justify-content: center;
  ">
    <img src="https://fogsight.ai/assets/logo-BdlAC-4E.png" alt="Fogsight" style="
        width: 100%;
        height: 100%;
        object-fit: contain;
        display: block;
      " />
  </div>
</div>

<p>This internship at Fogsight was my first time working in an industry setting, and it turned out to be much more challenging than I initially expected. Instead of building something from scratch, I spent most of my time working on a React-based refactor of the existing frontend, which meant understanding and reshaping a system that was already in use.</p>

<p>The refactor itself involved reorganizing the client-side structure, improving maintainability, and making the overall system easier to extend. What made it difficult was not just the technical part, but figuring out how to evolve an existing codebase without breaking assumptions that were already embedded in it. That process forced me to think more carefully about system boundaries, data flow, and how frontend architecture interacts with backend behavior.</p>

<p>I was fortunate to work under the guidance of my mentor, Zhengwentai Sun, who helped me navigate both the technical challenges and the broader engineering context behind the work. With his support, I was eventually able to complete the refactor and see the system in a much more structured form.</p>

<p>Looking back, this experience changed how I think about frontend engineering. It is no longer just about building interfaces, but about designing systems that can evolve over time while remaining reliable. Finishing the refactor felt genuinely rewarding, especially given how uncertain it felt at the beginning.</p>]]></content><author><name>Zhicheng Zhang</name></author><category term="Diary" /><summary type="html"><![CDATA[Backend engineering intern at Prox Shopping, working on ingestion, attribution, and product resolution systems with a focus on reliability.]]></summary></entry></feed>