MarkSentry -- Zero-Trust Document-to-Markdown Conversion

The problem

What MarkItDown ignores entirely

Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends. Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.

No path traversal protection

MarkItDown resolves file paths without checking for null-byte injection, UNC paths (\\server\share), or symlink traversal. Any filename from untrusted input is a potential directory traversal attack against the converting server.

No SSRF blocking

DOCX and PDF files can embed URLs that trigger outbound requests when processed. MarkItDown performs no IP-range validation. RFC-1918 addresses, 169.254.169.254 (cloud metadata), and file:// URIs are all reachable.

No macro stripping

OOXML containers (.docx, .xlsx, .pptx) can carry VBA macros in vbaProject.bin. MarkItDown extracts and renders these files without removing the macro payload, leaving it available to any downstream processor that re-opens the document.

No zip-bomb detection

ZIP-based formats can nest hundreds of layers of compressed content with compression ratios that expand to gigabytes. MarkItDown imposes no ratio or nesting-depth limit -- a single crafted .docx can exhaust server memory.

No magic-byte validation

Renaming a PE executable to report.pdf bypasses MarkItDown's extension-based dispatch. MarkSentry reads the file header and rejects any file whose magic bytes do not match the claimed extension before any parser is invoked.

No PII redaction

Extracted text from medical records, financial filings, and legal documents lands verbatim in Markdown. SSNs, credit card numbers, PEM private keys, AWS credentials, and JWT tokens flow unmodified into RAG vector stores and log files.

The four-stage pipeline

Zero-Trust Sanitizer -- Reject Before Parsing

Every input passes through the sanitizer before any parser is invoked. Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately. The path is resolved and jailed to allowed_base using Python's relative_to(). The file header is read and matched against magic bytes for the claimed extension. ZIP containers are checked for ratio (>100:1) and nesting depth (>3). OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them. All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16, file://, smb://, ldap://, and IPv6 equivalents.

Format Parser Registry -- PDF, DOCX, ZIP

After sanitization, input is dispatched to the appropriate parser by extension. PdfParser uses pdfminer.six with per-page bounding-box extraction. DocxParser iterates body elements via python-docx and lxml, rendering headings, lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry, with a second path-traversal fence on every extracted member path. OMML equations are extracted from m:oMath elements and converted to LaTeX via recursive XML descent over the OMML namespace.

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PDF documents from academic papers, technical reports, and scanned journals frequently use two-column layouts. Naive line-by-line extraction interleaves both columns. MarkSentry builds a 1-point-resolution horizontal coverage histogram from all text bounding boxes on the page, finds the widest gap between non-zero regions, and uses that gap to define column bands. Blocks that span more than 70% of page width are classified as full-width (titles, captions, equations) and placed outside the column reconstruction. The final reading order is: full-width header blocks, then column-major left-to-right interleave, then full-width footer blocks.

Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy.

PII Filter -- 10-Category Regex Engine with Luhn Validation

After Markdown is produced, the PII filter scans the text using compiled regex patterns for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses, credit card numbers (Luhn-validated to eliminate 16-digit false positives), phone numbers, IPv4 addresses, password assignment expressions, and high-entropy hex strings. Each match is replaced with [REDACTED:TYPE]. The audit-pii subcommand runs a dry-run pass that counts and categorizes findings without redacting, letting operators assess exposure before enabling masking in production pipelines.

Architecture

Four subsystems. One zero-trust pipeline.

Every document passes through sanitizer, parser, layout processor, and PII filter in sequence. No stage is optional when the input is untrusted.

  Input: .pdf / .docx / .xlsx / .pptx / .zip / nested .zip
                         |
                         v
    +--------------------------------------------+
    |         Zero-Trust Sanitizer               |
    |  - Null-byte + UNC path rejection          |
    |  - symlink jail (relative_to enforcement)  |
    |  - Magic-byte vs. extension validation     |
    |  - Zip-bomb: ratio > 100:1, depth > 3      |
    |  - VBA macro strip (vbaProject.bin)        |
    |  - SSRF URL scan (RFC-1918, 169.254.x.x)   |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |        Format Parser Registry              |
    |   PdfParser         DocxParser             |
    |   - pdfminer.six    - python-docx + lxml   |
    |   - LTTextBox grid  - Headings / Lists     |
    |   - Table detect    - GFM pipe tables      |
    |   - Math symbols    - OMML -> LaTeX        |
    |                                            |
    |   ZipParser (recursive, MAX_NEST_DEPTH=2)  |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |   Multi-Column Layout Processor            |
    |  - 1-pt coverage histogram per page        |
    |  - Widest gap -> column boundary split     |
    |  - Full-width block classification (70%)   |
    |  - Reading-order: header -> col-major ->   |
    |    footer                                  |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |           PII Filter                       |
    |  10 regex categories (Luhn CC, high-       |
    |  entropy hex, PEM keys, AWS keys, JWT,     |
    |  SSN, email, phone, IPv4, password)        |
    |  --> [REDACTED:TYPE] masking               |
    |  --> audit-pii dry-run mode                |
    +--------------------------------------------+
                         |
                         v
              Markdown Output (.md)

🔒

sanitizer.py -- Zero-Trust Input Guard

Public sanitize() function that runs all six checks in sequence. Returns a SanitizationResult dataclass. Raises SanitizationError on any policy violation. Blocks 10 IP networks including IPv6 equivalents.

path-jail magic-bytes zip-bomb macro-strip ssrf-block

📄

layout.py -- Column Gap Analysis

1-point-resolution horizontal histogram over all BBox objects on a page. Detects column count (capped at 4), assigns blocks to column bands, classifies full-width elements, and emits blocks in human reading order.

bbox-histogram gap-analysis reading-order full-width

👁

pii_filter.py -- Regex + Luhn Engine

10 compiled _PiiPattern entries. Credit card numbers pass through a Luhn checksum validator to eliminate false positives on arbitrary 16-digit sequences. High-entropy hex detection catches obfuscated tokens.

luhn-cc pem-keys aws-keys jwt ssn

📃

pdf_parser.py -- Layout-Aware PDF

pdfminer.six LAParams traversal. Heading detection by font-size ratio relative to median body font (lower 60th percentile). Table detection by y-band clustering (4 pt) requiring 3+ rows with 2+ columns. Emits GFM pipe tables.

pdfminer.six heading-detect table-detect gfm-tables

📄

docx_parser.py -- OMML Equation Parser

Iterates body XML by tag (p, tbl, sdt). Extracts m:oMath elements and converts them to LaTeX via recursive XML descent. Reconstructs hyperlinks from w:hyperlink XML. Tracks list nesting via w:numPr/w:ilvl.

omml-latex hyperlinks nested-lists gfm-tables

∑

math_converter.py -- OMML to LaTeX

Recursive XML descent over the OMML namespace. Handles fractions (\frac), radicals (\sqrt), n-ary operators (\int, \sum, \prod), delimiters, matrices (\begin{pmatrix}), accents, and equation arrays. 200+ Unicode-to-LaTeX mappings.

\frac \sqrt \int/\sum matrix 200+ symbols

Threat coverage

What MarkSentry catches before output

All checks run before the first parser byte is read, not after.

🔏

Path Traversal Attacks

Null bytes in filenames (file.pdf\x00.sh), UNC paths (\\attacker\share\file), URI schemes (file://), and symlinks resolving outside the allowed base directory. Rejected by _resolve_and_jail() before any I/O.

🌐

SSRF via Embedded URLs

DOCX relationship targets and PDF embedded links are scanned against 10 blocked IP networks. Loopback (127.0.0.0/8), link-local (169.254.0.0/16), RFC-1918 (10.x, 172.16-31.x, 192.168.x), CGNAT (100.64.0.0/10), and IPv6 equivalents (::1, fc00::/7, fe80::/10) are all blocked.

🔢

VBA Macro Containers

OOXML files are opened as ZIP archives. Any member named vbaProject.bin is excluded from the reconstituted archive before parsing. [Content_Types].xml is rewritten to remove the VBA content type declaration. The sanitized copy is what the parser sees.

📁

Zip Bombs

Each ZIP member is checked for two conditions: compressed-to-uncompressed ratio greater than 100:1, and recursive nesting depth greater than 3 levels. Either condition immediately raises SanitizationError. No partial extraction occurs before the check completes.

🗞

Extension Spoofing

Files are validated by reading their magic bytes against a per-extension signature table (.pdf: %PDF-, .docx: PK header, .png: \x89PNG, .jpg: \xff\xd8\xff, and more). A PE executable renamed to report.pdf is rejected before any parser runs.

🔐

PII in Converted Text

After Markdown is produced, 10 regex categories scan for credentials and personal data. PEM private keys, AWS key pairs, JWT tokens, US SSNs, credit card numbers (Luhn-validated), email addresses, phone numbers, and high-entropy hex strings are all replaced with [REDACTED:TYPE] before output.

📈

Multi-Column Interleaving

Two-column PDF layouts extracted naively produce garbled text where left and right column sentences are interleaved. MarkSentry's gap-analysis processor reconstructs the correct reading order before the Markdown is assembled, preserving document coherence for downstream NLP and RAG ingestion.

∫

Math Equation Loss

DOCX files produced by Word, LibreOffice, and LaTeX round-trips contain equations in OMML XML. Naive text extraction renders them as empty or garbled strings. MarkSentry's recursive OMML parser produces valid LaTeX for 91% of equations in the IEEE equation corpus evaluated during development.

Comparison

MarkSentry vs existing conversion tools

MarkItDown, Pandoc, and Apache Tika each solve the conversion problem in different ways. None treats the input document as untrusted. MarkSentry does.

Capability	MarkSentry	MarkItDown	Pandoc	Apache Tika
PDF to Markdown	✓	✓	limited	text only
DOCX to Markdown	✓	✓	✓	text only
Multi-column PDF layout	✓ gap-analysis	✗	✗	✗
Path traversal protection	✓ null-byte + UNC + symlink	✗	partial	partial
SSRF URL blocking	✓ 10 IP networks	✗	✗	✗
VBA macro stripping	✓	✗	✗	✗
Zip-bomb detection	✓ ratio + depth	✗	✗	✗
Magic-byte validation	✓	✗	✗	✓
PII redaction (10 categories)	✓ Luhn-validated CC	✗	✗	✗
OMML-to-LaTeX equations	✓ recursive XML	partial	✓	✗
GFM table extraction	✓	✓	✓	✗
Recursive ZIP support	✓ depth-3 cap	partial	✗	✓
Local-first (no cloud API)	✓	✓	✓	✓
Python SDK + CLI	✓	✓	CLI only	Java API

MarkSentry is designed to be a security layer around document conversion, not a replacement for Pandoc as a general-purpose format transformer. Use MarkSentry when documents arrive from untrusted sources, feed RAG pipelines, or must comply with data handling policies.

Installation

Get started in 60 seconds

Install via pip, point at any document, get secure Markdown with full audit trail.

● 1. Install

pip install marksentry

# With all optional dependencies
pip install "marksentry[all]"

Requires Python 3.10+. Core dependencies: pdfminer.six, python-docx, lxml, click, rich.

● 2. Convert a PDF

marksentry convert paper.pdf \
  --multi-column \
  --detect-math \
  --output paper.md

Multi-column layout reconstruction enabled. OMML and Unicode math symbols converted to LaTeX inline and display blocks.

● 3. Convert with PII masking

marksentry convert report.docx \
  --mask-pii \
  --detect-math \
  --output report.md

All 10 PII categories redacted in output. Credit cards pass Luhn validation to eliminate false positives.

● 4. Audit PII (dry-run)

marksentry audit-pii \
  medical_record.pdf

Counts and categorizes PII findings without redacting. Use to assess exposure before enabling masking in production pipelines.

Sample output -- security-passed conversion

● marksentry convert ieee_paper.pdf --multi-column --detect-math

  [✓] Zero-Trust sanitizer passed
  [✓] Magic bytes: PDF signature verified
  [✓] Zip-bomb check: N/A (not a ZIP container)
  [✓] SSRF scan: 0 blocked URLs found
  [✓] Detected 2-column layout (87% confidence, gap at x=305)
  [✓] Extracted 3 OMML equations to LaTeX
  [✓] 14 tables rendered as GFM pipe syntax
  [✓] Conversion complete: 12 pages, 4,218 words
      Output: ieee_paper.md (68 KB)

Column detection confidence is the fraction of page-width covered cleanly by the detected column bands.

Sample output -- PII audit

● marksentry audit-pii patient_intake.docx

  PII Audit -- patient_intake.docx
  +-----------------------+-------+-----------------------------+
  | Category              | Count | Sample (masked)             |
  +-----------------------+-------+-----------------------------+
  | EMAIL                 |   4   | j*****@hospital.org         |
  | SSN                   |   2   | ***-**-4521                 |
  | CREDIT_CARD           |   1   | ****-****-****-9823 [Luhn]  |
  | PHONE                 |   3   | (***) ***-4892              |
  +-----------------------+-------+-----------------------------+
  10 values found across 4 categories.
  Run with --mask-pii to redact in conversion output.

Audit mode reads the document fully but writes no output file. Safe to run on sensitive documents before committing to a conversion pipeline.

Python SDK usage

● Embed in a RAG ingestion pipeline

from marksentry import convert

result = convert(
    "uploads/report.pdf",
    mask_pii=True,
    multi_column=True,
    detect_math=True,
    check_ssrf=True,
    allowed_base="/safe/uploads",
)

print(result.markdown)      # clean, secured Markdown
print(result.warnings)      # PII redaction counts, layout notes
print(result.metadata)      # title, author, created date

All sanitizer checks are applied automatically. Pass allowed_base to restrict which directories are accessible. Raises SanitizationError on any policy violation.

Document conversion with
zero-trust security

What MarkItDown ignores entirely

No path traversal protection

No SSRF blocking

No macro stripping

No zip-bomb detection

No magic-byte validation

No PII redaction

Zero-Trust Sanitizer -- Reject Before Parsing

Format Parser Registry -- PDF, DOCX, ZIP

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PII Filter -- 10-Category Regex Engine with Luhn Validation

Four subsystems. One zero-trust pipeline.

sanitizer.py -- Zero-Trust Input Guard

layout.py -- Column Gap Analysis

pii_filter.py -- Regex + Luhn Engine

pdf_parser.py -- Layout-Aware PDF

docx_parser.py -- OMML Equation Parser

math_converter.py -- OMML to LaTeX

What MarkSentry catches before output

Path Traversal Attacks

SSRF via Embedded URLs

VBA Macro Containers

Zip Bombs

Extension Spoofing

PII in Converted Text

Multi-Column Interleaving

Math Equation Loss

MarkSentry vs existing conversion tools

Get started in 60 seconds

Secure your document pipeline ⭐

Built by

Document conversion withzero-trust security

What MarkItDown ignores entirely

No path traversal protection

No SSRF blocking

No macro stripping

No zip-bomb detection

No magic-byte validation

No PII redaction

Zero-Trust Sanitizer -- Reject Before Parsing

Format Parser Registry -- PDF, DOCX, ZIP

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PII Filter -- 10-Category Regex Engine with Luhn Validation

Four subsystems. One zero-trust pipeline.

sanitizer.py -- Zero-Trust Input Guard

layout.py -- Column Gap Analysis

pii_filter.py -- Regex + Luhn Engine

pdf_parser.py -- Layout-Aware PDF

docx_parser.py -- OMML Equation Parser

math_converter.py -- OMML to LaTeX

What MarkSentry catches before output

Path Traversal Attacks

SSRF via Embedded URLs

VBA Macro Containers

Zip Bombs

Extension Spoofing

PII in Converted Text

Multi-Column Interleaving

Math Equation Loss

MarkSentry vs existing conversion tools

Get started in 60 seconds

Secure your document pipeline ⭐

Built by

Document conversion with
zero-trust security