Tobias Klein Machine Learning Engineer Portfolio

I build infrastructure that turns large, messy technical document estates into compact, typed, verifiable data substrates for enterprise teams.

Patent Corpus Infrastructure

I design and build client-owned patent corpus infrastructure for organizations that need high-volume patent data to become searchable, joinable, verifiable, and usable across internal workflows.

The core system turns ongoing USPTO and EPO patent releases into a private ClickHouse corpus with deterministic file identity, deduplicated document state, page-level structure, PDF/XML alignment, structured metadata, regex-derived facts, and portable Parquet packages. Legal, analytics, product, search, AI, and competitive-intelligence teams can work from the same source-of-truth data layer.

What the System Enables

Exact document identity: every PDF receives a full-length BLAKE3 content hash, giving teams a byte-level join key below filenames, patent IDs, folder names, and source packages.
Canonical deduplication: repeated files collapse into stable corpus identity while source paths, original names, and duplicate observations remain queryable.
ClickHouse serving layer: the enterprise receives fast SQL access to document-level, page-level, metadata, text, table, and regex facts.
Denormalized analytical rows: document context is broadcast onto evidence rows, so a hash join returns immediately useful patent context.
Portable Parquet work packages: any SQL-defined subset can become a typed, compressed, hash-checkable artifact for legal review, presentations, outside counsel, dashboards, or ML experiments.
Python-ready workflows: ClickHouse, DuckDB, PyArrow, Pandas, and NumPy can operate directly on exported tables and Parquet slices.
Stable automation: recurring exports can be hashed, compared, skipped, published, or regenerated based on deterministic package identity.

Current Product Focus

My current commercial focus is an on-prem patent intelligence data layer for patent-heavy enterprises. It compiles raw patent-office source material into owned infrastructure: BLAKE3-addressed files, UUID-isolated ingest snapshots, null-free corpus tables, explicit sentinel states, page-aligned text, table geometry, jurisdiction-aware metadata, regex capture facts, and ClickHouse/Parquet outputs.

This gives internal teams capabilities that compound:

raw patent releases
-> deterministic ingest
-> content identity
-> canonical deduplication
-> PDF/XML/page alignment
-> fixed-schema ClickHouse tables
-> Parquet packages
-> legal/search/AI/analytics workflows

The important value is operational. A team can hash any patent PDF and join it to the corpus. A review package can be exported as Parquet and hashed. A weekly competitor-monitoring slice can be regenerated and compared automatically. A legal team can open the exact PDF behind a ClickHouse row. A data team can build Python tools on the same stable corpus. An AI team can create embeddings, retrieval indexes, summaries, translations, and evaluation sets from a consistent data substrate.

Technical Foundation

My work combines production MLOps, data engineering, and corpus-scale systems design:

DuckDB staging for relational run state, validation gates, and transformation control
ClickHouse materialization for enterprise-scale search, joins, aggregations, dashboards, and internal APIs
BLAKE3 hashing for content identity, deduplication, corpus joins, and package verification
Parquet exports for compact, typed, portable analytical handoffs
Python integrations with PyArrow, Pandas, NumPy, and ClickHouse clients
Hydra-style configuration, reproducible execution, structured logging, and CI-oriented development
PDF/XML/text/table extraction pipelines with page-level alignment and explicit validity states

Commercial Fit

This work fits enterprises with large patent estates, active competitor monitoring, recurring legal review, internal search needs, AI/search initiatives, portfolio analytics, or cross-team document workflows. The system creates a private patent data substrate that internal teams can extend through SQL, Python, dashboards, APIs, embeddings, alerts, and review tools.

If your organization works with large patent corpora and needs owned infrastructure rather than one-off exports, I am open to focused commercial discussions.

Connect on LinkedIn

Latest Projects

Transformations as the Backbone of a Modular MLOps Pipeline

Poor code organization leads to "pipeline spaghetti," where data ingestion, cleaning, feature engineering, and modeling code are tangled together. This tangle often arises when code is developed in a linear fashion (for example, in one giant notebook) rather than separated into reusable modules for each pipeline stage. The result is code that is hard to test or reuse.

Hi, I'm Tobias Klein.

Patent Corpus Infrastructure Architect