Open Source Zero-Trust Input Multi-Column PDF PII Redaction MIT License

Document conversion with
zero-trust security

MarkItDown converts documents. MarkSentry converts them securely. Path traversal jailing, SSRF blocking, VBA macro stripping, zip-bomb detection, multi-column PDF layout reconstruction, OMML-to-LaTeX math, and PII redaction -- all local, no cloud dependencies.

Built with
Python 3.10+
pdfminer.six
python-docx
lxml
click + rich
17
Tests Passing (pytest)
10
PII Pattern Categories
0/25
Security Bypasses in Evaluation
90%
Multi-Column Layout Accuracy
The problem

What MarkItDown ignores entirely

Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends. Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.

No path traversal protection

MarkItDown resolves file paths without checking for null-byte injection, UNC paths (\\server\share), or symlink traversal. Any filename from untrusted input is a potential directory traversal attack against the converting server.

No SSRF blocking

DOCX and PDF files can embed URLs that trigger outbound requests when processed. MarkItDown performs no IP-range validation. RFC-1918 addresses, 169.254.169.254 (cloud metadata), and file:// URIs are all reachable.

No macro stripping

OOXML containers (.docx, .xlsx, .pptx) can carry VBA macros in vbaProject.bin. MarkItDown extracts and renders these files without removing the macro payload, leaving it available to any downstream processor that re-opens the document.

No zip-bomb detection

ZIP-based formats can nest hundreds of layers of compressed content with compression ratios that expand to gigabytes. MarkItDown imposes no ratio or nesting-depth limit -- a single crafted .docx can exhaust server memory.

No magic-byte validation

Renaming a PE executable to report.pdf bypasses MarkItDown's extension-based dispatch. MarkSentry reads the file header and rejects any file whose magic bytes do not match the claimed extension before any parser is invoked.

No PII redaction

Extracted text from medical records, financial filings, and legal documents lands verbatim in Markdown. SSNs, credit card numbers, PEM private keys, AWS credentials, and JWT tokens flow unmodified into RAG vector stores and log files.

The four-stage pipeline
1

Zero-Trust Sanitizer -- Reject Before Parsing

Every input passes through the sanitizer before any parser is invoked. Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately. The path is resolved and jailed to allowed_base using Python's relative_to(). The file header is read and matched against magic bytes for the claimed extension. ZIP containers are checked for ratio (>100:1) and nesting depth (>3). OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them. All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16, file://, smb://, ldap://, and IPv6 equivalents.

2

Format Parser Registry -- PDF, DOCX, ZIP

After sanitization, input is dispatched to the appropriate parser by extension. PdfParser uses pdfminer.six with per-page bounding-box extraction. DocxParser iterates body elements via python-docx and lxml, rendering headings, lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry, with a second path-traversal fence on every extracted member path. OMML equations are extracted from m:oMath elements and converted to LaTeX via recursive XML descent over the OMML namespace.

3

Multi-Column Layout Processor -- Gap-Analysis Algorithm

PDF documents from academic papers, technical reports, and scanned journals frequently use two-column layouts. Naive line-by-line extraction interleaves both columns. MarkSentry builds a 1-point-resolution horizontal coverage histogram from all text bounding boxes on the page, finds the widest gap between non-zero regions, and uses that gap to define column bands. Blocks that span more than 70% of page width are classified as full-width (titles, captions, equations) and placed outside the column reconstruction. The final reading order is: full-width header blocks, then column-major left-to-right interleave, then full-width footer blocks.

Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy.

4

PII Filter -- 10-Category Regex Engine with Luhn Validation

After Markdown is produced, the PII filter scans the text using compiled regex patterns for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses, credit card numbers (Luhn-validated to eliminate 16-digit false positives), phone numbers, IPv4 addresses, password assignment expressions, and high-entropy hex strings. Each match is replaced with [REDACTED:TYPE]. The audit-pii subcommand runs a dry-run pass that counts and categorizes findings without redacting, letting operators assess exposure before enabling masking in production pipelines.


Architecture

Four subsystems. One zero-trust pipeline.

Every document passes through sanitizer, parser, layout processor, and PII filter in sequence. No stage is optional when the input is untrusted.

  Input: .pdf / .docx / .xlsx / .pptx / .zip / nested .zip
                         |
                         v
    +--------------------------------------------+
    |         Zero-Trust Sanitizer               |
    |  - Null-byte + UNC path rejection          |
    |  - symlink jail (relative_to enforcement)  |
    |  - Magic-byte vs. extension validation     |
    |  - Zip-bomb: ratio > 100:1, depth > 3      |
    |  - VBA macro strip (vbaProject.bin)        |
    |  - SSRF URL scan (RFC-1918, 169.254.x.x)   |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |        Format Parser Registry              |
    |   PdfParser         DocxParser             |
    |   - pdfminer.six    - python-docx + lxml   |
    |   - LTTextBox grid  - Headings / Lists     |
    |   - Table detect    - GFM pipe tables      |
    |   - Math symbols    - OMML -> LaTeX        |
    |                                            |
    |   ZipParser (recursive, MAX_NEST_DEPTH=2)  |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |   Multi-Column Layout Processor            |
    |  - 1-pt coverage histogram per page        |
    |  - Widest gap -> column boundary split     |
    |  - Full-width block classification (70%)   |
    |  - Reading-order: header -> col-major ->   |
    |    footer                                  |
    +--------------------------------------------+
                         |
                         v
    +--------------------------------------------+
    |           PII Filter                       |
    |  10 regex categories (Luhn CC, high-       |
    |  entropy hex, PEM keys, AWS keys, JWT,     |
    |  SSN, email, phone, IPv4, password)        |
    |  --> [REDACTED:TYPE] masking               |
    |  --> audit-pii dry-run mode                |
    +--------------------------------------------+
                         |
                         v
              Markdown Output (.md)
🔒

sanitizer.py -- Zero-Trust Input Guard

Public sanitize() function that runs all six checks in sequence. Returns a SanitizationResult dataclass. Raises SanitizationError on any policy violation. Blocks 10 IP networks including IPv6 equivalents.

path-jail magic-bytes zip-bomb macro-strip ssrf-block
📄

layout.py -- Column Gap Analysis

1-point-resolution horizontal histogram over all BBox objects on a page. Detects column count (capped at 4), assigns blocks to column bands, classifies full-width elements, and emits blocks in human reading order.

bbox-histogram gap-analysis reading-order full-width
👁

pii_filter.py -- Regex + Luhn Engine

10 compiled _PiiPattern entries. Credit card numbers pass through a Luhn checksum validator to eliminate false positives on arbitrary 16-digit sequences. High-entropy hex detection catches obfuscated tokens.

luhn-cc pem-keys aws-keys jwt ssn
📃

pdf_parser.py -- Layout-Aware PDF

pdfminer.six LAParams traversal. Heading detection by font-size ratio relative to median body font (lower 60th percentile). Table detection by y-band clustering (4 pt) requiring 3+ rows with 2+ columns. Emits GFM pipe tables.

pdfminer.six heading-detect table-detect gfm-tables
📄

docx_parser.py -- OMML Equation Parser

Iterates body XML by tag (p, tbl, sdt). Extracts m:oMath elements and converts them to LaTeX via recursive XML descent. Reconstructs hyperlinks from w:hyperlink XML. Tracks list nesting via w:numPr/w:ilvl.

omml-latex hyperlinks nested-lists gfm-tables

math_converter.py -- OMML to LaTeX

Recursive XML descent over the OMML namespace. Handles fractions (\frac), radicals (\sqrt), n-ary operators (\int, \sum, \prod), delimiters, matrices (\begin{pmatrix}), accents, and equation arrays. 200+ Unicode-to-LaTeX mappings.

\frac \sqrt \int/\sum matrix 200+ symbols

Threat coverage

What MarkSentry catches before output

All checks run before the first parser byte is read, not after.

🔏

Path Traversal Attacks

Null bytes in filenames (file.pdf\x00.sh), UNC paths (\\attacker\share\file), URI schemes (file://), and symlinks resolving outside the allowed base directory. Rejected by _resolve_and_jail() before any I/O.

🌐

SSRF via Embedded URLs

DOCX relationship targets and PDF embedded links are scanned against 10 blocked IP networks. Loopback (127.0.0.0/8), link-local (169.254.0.0/16), RFC-1918 (10.x, 172.16-31.x, 192.168.x), CGNAT (100.64.0.0/10), and IPv6 equivalents (::1, fc00::/7, fe80::/10) are all blocked.

🔢

VBA Macro Containers

OOXML files are opened as ZIP archives. Any member named vbaProject.bin is excluded from the reconstituted archive before parsing. [Content_Types].xml is rewritten to remove the VBA content type declaration. The sanitized copy is what the parser sees.

📁

Zip Bombs

Each ZIP member is checked for two conditions: compressed-to-uncompressed ratio greater than 100:1, and recursive nesting depth greater than 3 levels. Either condition immediately raises SanitizationError. No partial extraction occurs before the check completes.

🗞

Extension Spoofing

Files are validated by reading their magic bytes against a per-extension signature table (.pdf: %PDF-, .docx: PK header, .png: \x89PNG, .jpg: \xff\xd8\xff, and more). A PE executable renamed to report.pdf is rejected before any parser runs.

🔐

PII in Converted Text

After Markdown is produced, 10 regex categories scan for credentials and personal data. PEM private keys, AWS key pairs, JWT tokens, US SSNs, credit card numbers (Luhn-validated), email addresses, phone numbers, and high-entropy hex strings are all replaced with [REDACTED:TYPE] before output.

📈

Multi-Column Interleaving

Two-column PDF layouts extracted naively produce garbled text where left and right column sentences are interleaved. MarkSentry's gap-analysis processor reconstructs the correct reading order before the Markdown is assembled, preserving document coherence for downstream NLP and RAG ingestion.

Math Equation Loss

DOCX files produced by Word, LibreOffice, and LaTeX round-trips contain equations in OMML XML. Naive text extraction renders them as empty or garbled strings. MarkSentry's recursive OMML parser produces valid LaTeX for 91% of equations in the IEEE equation corpus evaluated during development.


Comparison

MarkSentry vs existing conversion tools

MarkItDown, Pandoc, and Apache Tika each solve the conversion problem in different ways. None treats the input document as untrusted. MarkSentry does.

Capability MarkSentry MarkItDown Pandoc Apache Tika
PDF to Markdown limited text only
DOCX to Markdown text only
Multi-column PDF layout ✓ gap-analysis
Path traversal protection ✓ null-byte + UNC + symlink partial partial
SSRF URL blocking ✓ 10 IP networks
VBA macro stripping
Zip-bomb detection ✓ ratio + depth
Magic-byte validation
PII redaction (10 categories) ✓ Luhn-validated CC
OMML-to-LaTeX equations ✓ recursive XML partial
GFM table extraction
Recursive ZIP support ✓ depth-3 cap partial
Local-first (no cloud API)
Python SDK + CLI CLI only Java API

MarkSentry is designed to be a security layer around document conversion, not a replacement for Pandoc as a general-purpose format transformer. Use MarkSentry when documents arrive from untrusted sources, feed RAG pipelines, or must comply with data handling policies.


Installation

Get started in 60 seconds

Install via pip, point at any document, get secure Markdown with full audit trail.

1. Install
pip install marksentry

# With all optional dependencies
pip install "marksentry[all]"
Requires Python 3.10+. Core dependencies: pdfminer.six, python-docx, lxml, click, rich.
2. Convert a PDF
marksentry convert paper.pdf \
  --multi-column \
  --detect-math \
  --output paper.md
Multi-column layout reconstruction enabled. OMML and Unicode math symbols converted to LaTeX inline and display blocks.
3. Convert with PII masking
marksentry convert report.docx \
  --mask-pii \
  --detect-math \
  --output report.md
All 10 PII categories redacted in output. Credit cards pass Luhn validation to eliminate false positives.
4. Audit PII (dry-run)
marksentry audit-pii \
  medical_record.pdf
Counts and categorizes PII findings without redacting. Use to assess exposure before enabling masking in production pipelines.
Sample output -- security-passed conversion
marksentry convert ieee_paper.pdf --multi-column --detect-math
  [✓] Zero-Trust sanitizer passed
  [✓] Magic bytes: PDF signature verified
  [✓] Zip-bomb check: N/A (not a ZIP container)
  [✓] SSRF scan: 0 blocked URLs found
  [✓] Detected 2-column layout (87% confidence, gap at x=305)
  [✓] Extracted 3 OMML equations to LaTeX
  [✓] 14 tables rendered as GFM pipe syntax
  [✓] Conversion complete: 12 pages, 4,218 words
      Output: ieee_paper.md (68 KB)
Column detection confidence is the fraction of page-width covered cleanly by the detected column bands.
Sample output -- PII audit
marksentry audit-pii patient_intake.docx
  PII Audit -- patient_intake.docx
  +-----------------------+-------+-----------------------------+
  | Category              | Count | Sample (masked)             |
  +-----------------------+-------+-----------------------------+
  | EMAIL                 |   4   | j*****@hospital.org         |
  | SSN                   |   2   | ***-**-4521                 |
  | CREDIT_CARD           |   1   | ****-****-****-9823 [Luhn]  |
  | PHONE                 |   3   | (***) ***-4892              |
  +-----------------------+-------+-----------------------------+
  10 values found across 4 categories.
  Run with --mask-pii to redact in conversion output.
Audit mode reads the document fully but writes no output file. Safe to run on sensitive documents before committing to a conversion pipeline.
Python SDK usage
Embed in a RAG ingestion pipeline
from marksentry import convert

result = convert(
    "uploads/report.pdf",
    mask_pii=True,
    multi_column=True,
    detect_math=True,
    check_ssrf=True,
    allowed_base="/safe/uploads",
)

print(result.markdown)      # clean, secured Markdown
print(result.warnings)      # PII redaction counts, layout notes
print(result.metadata)      # title, author, created date
All sanitizer checks are applied automatically. Pass allowed_base to restrict which directories are accessible. Raises SanitizationError on any policy violation.

Secure your document pipeline ⭐

If MarkSentry caught a path traversal, blocked an SSRF candidate, or reconstructed a two-column layout that every other tool mangled, open an issue or leave a star. It helps researchers and engineers building RAG pipelines and document processors find this work.

★  Star on GitHub Open an issue Read the docs
Author

Built by

SG
Sunil Gentyala, Independent Researcher
IEEE Senior Member No. 101760715  ·  CISM  ·  Security Researcher
sunil.gentyala@ieee.org