Hi, I'm Tobias Klein.

Patent Corpus Infrastructure Architect

I build client-owned patent data infrastructure: on-prem pipelines that turn USPTO/EPO PDF and XML releases into byte-verifiable, deduplicated, ClickHouse-ready corpora for legal, search, analytics, and AI workflows.

Explore Patent Corpus Infrastructure
's Picture

Turning Patent Documents Into Enterprise Data Infrastructure

I specialize in deterministic corpus-construction systems for high-volume technical documents. My current focus is private patent intelligence infrastructure: BLAKE3 content identity, UUID-based ingest isolation, DuckDB staging, page/XML/PDF alignment, fixed-schema ClickHouse tables, Parquet work packages, and Python-ready analytical workflows. The result is an owned data layer that lets enterprise teams search, join, verify, package, enrich, and operationalize patent data at scale.


I build infrastructure that turns large, messy technical document estates into compact, typed, verifiable data substrates for enterprise teams.

Patent Corpus Infrastructure

I design and build client-owned patent corpus infrastructure for organizations that need high-volume patent data to become searchable, joinable, verifiable, and usable across internal workflows.

The core system turns ongoing USPTO and EPO patent releases into a private ClickHouse corpus with deterministic file identity, deduplicated document state, page-level structure, PDF/XML alignment, structured metadata, regex-derived facts, and portable Parquet packages. Legal, analytics, product, search, AI, and competitive-intelligence teams can work from the same source-of-truth data layer.

What the System Enables

  • Exact document identity: every PDF receives a full-length BLAKE3 content hash, giving teams a byte-level join key below filenames, patent IDs, folder names, and source packages.
  • Canonical deduplication: repeated files collapse into stable corpus identity while source paths, original names, and duplicate observations remain queryable.
  • ClickHouse serving layer: the enterprise receives fast SQL access to document-level, page-level, metadata, text, table, and regex facts.
  • Denormalized analytical rows: document context is broadcast onto evidence rows, so a hash join returns immediately useful patent context.
  • Portable Parquet work packages: any SQL-defined subset can become a typed, compressed, hash-checkable artifact for legal review, presentations, outside counsel, dashboards, or ML experiments.
  • Python-ready workflows: ClickHouse, DuckDB, PyArrow, Pandas, and NumPy can operate directly on exported tables and Parquet slices.
  • Stable automation: recurring exports can be hashed, compared, skipped, published, or regenerated based on deterministic package identity.

Current Product Focus

My current commercial focus is an on-prem patent intelligence data layer for patent-heavy enterprises. It compiles raw patent-office source material into owned infrastructure: BLAKE3-addressed files, UUID-isolated ingest snapshots, null-free corpus tables, explicit sentinel states, page-aligned text, table geometry, jurisdiction-aware metadata, regex capture facts, and ClickHouse/Parquet outputs.

This gives internal teams capabilities that compound:

raw patent releases
-> deterministic ingest
-> content identity
-> canonical deduplication
-> PDF/XML/page alignment
-> fixed-schema ClickHouse tables
-> Parquet packages
-> legal/search/AI/analytics workflows

The important value is operational. A team can hash any patent PDF and join it to the corpus. A review package can be exported as Parquet and hashed. A weekly competitor-monitoring slice can be regenerated and compared automatically. A legal team can open the exact PDF behind a ClickHouse row. A data team can build Python tools on the same stable corpus. An AI team can create embeddings, retrieval indexes, summaries, translations, and evaluation sets from a consistent data substrate.

Technical Foundation

My work combines production MLOps, data engineering, and corpus-scale systems design:

  • DuckDB staging for relational run state, validation gates, and transformation control
  • ClickHouse materialization for enterprise-scale search, joins, aggregations, dashboards, and internal APIs
  • BLAKE3 hashing for content identity, deduplication, corpus joins, and package verification
  • Parquet exports for compact, typed, portable analytical handoffs
  • Python integrations with PyArrow, Pandas, NumPy, and ClickHouse clients
  • Hydra-style configuration, reproducible execution, structured logging, and CI-oriented development
  • PDF/XML/text/table extraction pipelines with page-level alignment and explicit validity states

Commercial Fit

This work fits enterprises with large patent estates, active competitor monitoring, recurring legal review, internal search needs, AI/search initiatives, portfolio analytics, or cross-team document workflows. The system creates a private patent data substrate that internal teams can extend through SQL, Python, dashboards, APIs, embeddings, alerts, and review tools.

If your organization works with large patent corpora and needs owned infrastructure rather than one-off exports, I am open to focused commercial discussions.

Connect on LinkedIn