I build client-owned patent data infrastructure: on-prem pipelines that turn USPTO/EPO PDF and XML releases into byte-verifiable, deduplicated, ClickHouse-ready corpora for legal, search, analytics, and AI workflows.
Explore Patent Corpus Infrastructure
I build infrastructure that turns large, messy technical document estates into compact, typed, verifiable data substrates for enterprise teams.
I design and build client-owned patent corpus infrastructure for organizations that need high-volume patent data to become searchable, joinable, verifiable, and usable across internal workflows.
The core system turns ongoing USPTO and EPO patent releases into a private ClickHouse corpus with deterministic file identity, deduplicated document state, page-level structure, PDF/XML alignment, structured metadata, regex-derived facts, and portable Parquet packages. Legal, analytics, product, search, AI, and competitive-intelligence teams can work from the same source-of-truth data layer.
My current commercial focus is an on-prem patent intelligence data layer for patent-heavy enterprises. It compiles raw patent-office source material into owned infrastructure: BLAKE3-addressed files, UUID-isolated ingest snapshots, null-free corpus tables, explicit sentinel states, page-aligned text, table geometry, jurisdiction-aware metadata, regex capture facts, and ClickHouse/Parquet outputs.
This gives internal teams capabilities that compound:
raw patent releases
-> deterministic ingest
-> content identity
-> canonical deduplication
-> PDF/XML/page alignment
-> fixed-schema ClickHouse tables
-> Parquet packages
-> legal/search/AI/analytics workflows
The important value is operational. A team can hash any patent PDF and join it to the corpus. A review package can be exported as Parquet and hashed. A weekly competitor-monitoring slice can be regenerated and compared automatically. A legal team can open the exact PDF behind a ClickHouse row. A data team can build Python tools on the same stable corpus. An AI team can create embeddings, retrieval indexes, summaries, translations, and evaluation sets from a consistent data substrate.
My work combines production MLOps, data engineering, and corpus-scale systems design:
This work fits enterprises with large patent estates, active competitor monitoring, recurring legal review, internal search needs, AI/search initiatives, portfolio analytics, or cross-team document workflows. The system creates a private patent data substrate that internal teams can extend through SQL, Python, dashboards, APIs, embeddings, alerts, and review tools.
If your organization works with large patent corpora and needs owned infrastructure rather than one-off exports, I am open to focused commercial discussions.