Evidence-Driven Patent PDF Processing Before Extraction

From PDF extraction to patent data delivery through strict, executable validation.

Tags: patent-data pdf validation data-engineering uspto

Category: data engineering

Evidence-Driven Patent PDF Processing Before Extraction

Evidence before conversion

Companies with large patent portfolios do not need another system that says it can extract text from PDFs.

They need a system that turns difficult patent documents into reliable, analyst-ready data.

That difference starts before OCR, before text conversion, and before field extraction.

Each PDF is opened with a low-level PDF parser. The pipeline collects the file path, validity flag, page count, PDF version, attachment count, metadata dictionary fields, page sizes, page widths, page heights, and page indexes. This is the raw evidence layer.

There are no fallbacks in this path. If the PDF cannot produce a valid page count, the downstream structure cannot be built from it. If the page count is less than one, the file is marked invalid. Otherwise, the page count becomes the controlling value for the rest of the metadata extraction.

Page geometry is collected by iterating over the exact page-count range:

page_sizes = [
    pdf_doc.get_page_size(i) for i in range(pdf_file_metadata["page_count"])
]

That line matters. The page count is not decorative metadata. It defines the page-size extraction range. If the page count were wrong for even one PDF, the geometry extraction would fail, under-collect, overrun, or produce a structure that could not be trusted. There is no fallback branch that guesses the missing geometry.

Widths, heights, and indexes are derived directly from the collected page-size list:

pdf_file_metadata["page_widths"] = [page_size[0] for page_size in page_sizes]
pdf_file_metadata["page_heights"] = [page_size[1] for page_size in page_sizes]
pdf_file_metadata["index"] = [*range(len(page_sizes))]

The index is not an independently invented field. It is tied to the page-size sequence produced from the parser-derived page count. That gives the pipeline a concrete pre-conversion structure: for each page, there must be a page index, a width, and a height.

This pre-conversion structure becomes a routing condition later. If conversion produces one page fewer, one page more, or any page structure that does not match the parser-derived page count, the conversion is marked unsuccessful. The PDF is not absorbed into the corpus. It is tracked through provenance, rerun automatically when appropriate, and reprocessed only if the failure is not caused by the source PDF itself.

This is deliberately unforgiving. The system does not compensate for missing pages, patch around mismatched counts, or allow a partially converted document into the analytical dataset. A page-count mismatch means the extraction failed.

Invariants and control signals

This is why the later subject-field validation has weight.

The page count used in the query is not a loose metadata value copied from an arbitrary field. It is the same parser-derived value that controls page-size extraction, page-index construction, conversion validation, corpus routing, and provenance tracking. It must already be correct for the document to proceed.

The discovered invariant then compares that parser-derived page count to the value encoded in the PDF subject field:

countif((page_count != main."substring"(subject, 2, 3)))  count_star()
--------------------------------------------------------  ------------
0                                                         7923

This is a row-level Boolean test. For every PDF, the pipeline compares the page count against the exact subject segment. countif counts every mismatch. The result is zero only if all 7,923 files pass.

The significance is that two independent signals agree completely. The page count comes from the low-level parser and governs downstream conversion checks. The subject value comes from the PDF metadata dictionary. Across the full batch, they resolve to the same page count every time.

That is not a casual pattern. It is a structural invariant.

A second invariant proves the USPTO title pattern:

countif((((len(title) - 2) - len(regexp_extract(title, '[0]{4,6}'))) = 18))
---------------------------------------------------------------------------
7923

The rule is exact: remove the fixed US prefix, remove the zero-padding block defined as [0]{4,6}, and the remaining identifier must be 18 characters. Across the full batch, it is.

A third control signal comes from the PDF creation timestamp. This value is captured from raw PDF metadata before OCR, before text conversion, and before field extraction:

SELECT max(CAST(substring(creationdate,3) AS BIGINT)) FROM file;
max(CAST(main."substring"(creationdate, 3) AS BIGINT))
------------------------------------------------------
20260319171846
SELECT min(CAST(substring(creationdate,3) AS BIGINT)) FROM file;
min(CAST(main."substring"(creationdate, 3) AS BIGINT))
------------------------------------------------------
20260309165515

The substring(creationdate,3) is deliberate. The PDF date prefix is fixed, so the ordered timestamp begins at the third position. No regex is used because the deterministic operation is simpler, faster, and more suitable for scale.

That metadata timestamp becomes an independent boundary for later extraction. After the PDF is converted and a filing-date candidate is extracted from the conversion output, the candidate must satisfy the file-level chronology:

filing_date_candidate_from_conversion_output <= pdf_creation_timestamp_from_metadata

If it does not, the extraction is invalid. Not low-confidence. Not queued for review. Invalid.

From extraction to delivery

This is why the approach is different from standard OCR, PDF conversion, or generic document AI.

Those systems usually stop at extraction: text, fields, entities, coordinates, normalized values. This system goes further. It discovers the document-standard structure, expresses it as commands, tests it across the full batch, and promotes only the patterns that hold without exception.

That is the difference between extracting and delivering.

Extracting means producing candidate output. Delivering means proving which output is allowed to enter the data model.

The pipeline is built from exact checks: parser-derived page count against subject structure, title length against prefix and padding rules, metadata timestamps against extracted filing dates, file identifiers against metadata identifiers, page geometry against per-page indexes, and conversion output against pre-conversion page facts. Each check is independent. Each check is executable. Each check either holds or fails.

There is no hidden tolerance for almost-correct patterns. No try/except path that turns structural failure into usable data. No human review loop required to make the result acceptable. If a rule fails once within its intended scope, it is removed or the feature is redesigned.

This is the standard that makes the system suitable for client infrastructure. The work is done before deployment: identify the invariant, prove it on real documents, encode it as an assertion, and let the pipeline enforce it at scale.

The value is not that the system can read a patent PDF. Many tools can produce converted text. The value is that the system can determine which extracted patent facts are structurally valid, connect them to cleaner sources such as XML and API records, reject impossible values, route failed conversions correctly, preserve provenance, and deliver data analysts can use without re-checking every document.

We do not stop at extraction. We deliver patent data that has survived the document’s own evidence.