MarkItDown converts documents. MarkSentry converts them securely. Path traversal jailing, SSRF blocking, VBA macro stripping, zip-bomb detection, multi-column PDF layout reconstruction, OMML-to-LaTeX math, and PII redaction -- all local, no cloud dependencies.
Microsoft MarkItDown converts documents to Markdown quickly -- and that is where the safety story ends. Every path it trusts, every URL it leaves embedded, every macro container it skips becomes an attack surface in RAG pipelines and document processing workflows. MarkSentry was built to close those gaps.
Every input passes through the sanitizer before any parser is invoked.
Null bytes, UNC paths, URI schemes, and symlinks are rejected immediately.
The path is resolved and jailed to allowed_base using Python's
relative_to(). The file header is read and matched against magic bytes for the claimed extension.
ZIP containers are checked for ratio (>100:1) and nesting depth (>3).
OOXML files have vbaProject.bin stripped from the ZIP before any parser touches them.
All embedded URLs are scanned against blocked IP ranges: RFC-1918, 127.0.0.0/8, 169.254.0.0/16,
file://, smb://, ldap://, and IPv6 equivalents.
After sanitization, input is dispatched to the appropriate parser by extension.
PdfParser uses pdfminer.six with per-page bounding-box extraction.
DocxParser iterates body elements via python-docx and lxml, rendering headings,
lists, tables, and hyperlinks to GFM Markdown. ZipParser recursively
extracts members up to MAX_NEST_DEPTH=2 and dispatches each to the registry,
with a second path-traversal fence on every extracted member path.
OMML equations are extracted from m:oMath elements and converted to LaTeX
via recursive XML descent over the OMML namespace.
PDF documents from academic papers, technical reports, and scanned journals frequently
use two-column layouts. Naive line-by-line extraction interleaves both columns.
MarkSentry builds a 1-point-resolution horizontal coverage histogram from all
text bounding boxes on the page, finds the widest gap between non-zero regions,
and uses that gap to define column bands. Blocks that span more than 70% of page
width are classified as full-width (titles, captions, equations) and placed outside
the column reconstruction. The final reading order is: full-width header blocks,
then column-major left-to-right interleave, then full-width footer blocks.
Evaluated on 50 IEEE double-column PDFs: 90% column boundary accuracy.
After Markdown is produced, the PII filter scans the text using compiled regex patterns
for: PEM private keys, AWS access keys, AWS secret keys, JWT tokens, US SSNs, email addresses,
credit card numbers (Luhn-validated to eliminate 16-digit false positives),
phone numbers, IPv4 addresses, password assignment expressions, and high-entropy hex strings.
Each match is replaced with [REDACTED:TYPE]. The audit-pii
subcommand runs a dry-run pass that counts and categorizes findings without redacting,
letting operators assess exposure before enabling masking in production pipelines.
Every document passes through sanitizer, parser, layout processor, and PII filter in sequence. No stage is optional when the input is untrusted.
Input: .pdf / .docx / .xlsx / .pptx / .zip / nested .zip
|
v
+--------------------------------------------+
| Zero-Trust Sanitizer |
| - Null-byte + UNC path rejection |
| - symlink jail (relative_to enforcement) |
| - Magic-byte vs. extension validation |
| - Zip-bomb: ratio > 100:1, depth > 3 |
| - VBA macro strip (vbaProject.bin) |
| - SSRF URL scan (RFC-1918, 169.254.x.x) |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Format Parser Registry |
| PdfParser DocxParser |
| - pdfminer.six - python-docx + lxml |
| - LTTextBox grid - Headings / Lists |
| - Table detect - GFM pipe tables |
| - Math symbols - OMML -> LaTeX |
| |
| ZipParser (recursive, MAX_NEST_DEPTH=2) |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Multi-Column Layout Processor |
| - 1-pt coverage histogram per page |
| - Widest gap -> column boundary split |
| - Full-width block classification (70%) |
| - Reading-order: header -> col-major -> |
| footer |
+--------------------------------------------+
|
v
+--------------------------------------------+
| PII Filter |
| 10 regex categories (Luhn CC, high- |
| entropy hex, PEM keys, AWS keys, JWT, |
| SSN, email, phone, IPv4, password) |
| --> [REDACTED:TYPE] masking |
| --> audit-pii dry-run mode |
+--------------------------------------------+
|
v
Markdown Output (.md)
Public sanitize() function that runs all six checks in sequence.
Returns a SanitizationResult dataclass. Raises SanitizationError
on any policy violation. Blocks 10 IP networks including IPv6 equivalents.
1-point-resolution horizontal histogram over all BBox objects on a page.
Detects column count (capped at 4), assigns blocks to column bands, classifies
full-width elements, and emits blocks in human reading order.
10 compiled _PiiPattern entries. Credit card numbers pass through
a Luhn checksum validator to eliminate false positives on arbitrary 16-digit
sequences. High-entropy hex detection catches obfuscated tokens.
pdfminer.six LAParams traversal. Heading detection by font-size ratio
relative to median body font (lower 60th percentile). Table detection by y-band
clustering (4 pt) requiring 3+ rows with 2+ columns. Emits GFM pipe tables.
Iterates body XML by tag (p, tbl, sdt). Extracts m:oMath elements and
converts them to LaTeX via recursive XML descent. Reconstructs hyperlinks
from w:hyperlink XML. Tracks list nesting via w:numPr/w:ilvl.
Recursive XML descent over the OMML namespace. Handles fractions
(\frac), radicals (\sqrt), n-ary operators
(\int, \sum, \prod), delimiters,
matrices (\begin{pmatrix}), accents, and equation arrays. 200+ Unicode-to-LaTeX mappings.
All checks run before the first parser byte is read, not after.
Null bytes in filenames (file.pdf\x00.sh), UNC paths
(\\attacker\share\file), URI schemes (file://),
and symlinks resolving outside the allowed base directory.
Rejected by _resolve_and_jail() before any I/O.
DOCX relationship targets and PDF embedded links are scanned against 10 blocked IP networks. Loopback (127.0.0.0/8), link-local (169.254.0.0/16), RFC-1918 (10.x, 172.16-31.x, 192.168.x), CGNAT (100.64.0.0/10), and IPv6 equivalents (::1, fc00::/7, fe80::/10) are all blocked.
OOXML files are opened as ZIP archives. Any member named
vbaProject.bin is excluded from the reconstituted archive
before parsing. [Content_Types].xml is rewritten to remove
the VBA content type declaration. The sanitized copy is what the parser sees.
Each ZIP member is checked for two conditions: compressed-to-uncompressed ratio
greater than 100:1, and recursive nesting depth greater than 3 levels.
Either condition immediately raises SanitizationError.
No partial extraction occurs before the check completes.
Files are validated by reading their magic bytes against a per-extension
signature table (.pdf: %PDF-, .docx: PK header,
.png: \x89PNG, .jpg: \xff\xd8\xff, and more).
A PE executable renamed to report.pdf is rejected before any parser runs.
After Markdown is produced, 10 regex categories scan for credentials and personal data.
PEM private keys, AWS key pairs, JWT tokens, US SSNs, credit card numbers
(Luhn-validated), email addresses, phone numbers, and high-entropy hex strings
are all replaced with [REDACTED:TYPE] before output.
Two-column PDF layouts extracted naively produce garbled text where left and right column sentences are interleaved. MarkSentry's gap-analysis processor reconstructs the correct reading order before the Markdown is assembled, preserving document coherence for downstream NLP and RAG ingestion.
DOCX files produced by Word, LibreOffice, and LaTeX round-trips contain equations in OMML XML. Naive text extraction renders them as empty or garbled strings. MarkSentry's recursive OMML parser produces valid LaTeX for 91% of equations in the IEEE equation corpus evaluated during development.
MarkItDown, Pandoc, and Apache Tika each solve the conversion problem in different ways. None treats the input document as untrusted. MarkSentry does.
| Capability | MarkSentry | MarkItDown | Pandoc | Apache Tika |
|---|---|---|---|---|
| PDF to Markdown | ✓ | ✓ | limited | text only |
| DOCX to Markdown | ✓ | ✓ | ✓ | text only |
| Multi-column PDF layout | ✓ gap-analysis | ✗ | ✗ | ✗ |
| Path traversal protection | ✓ null-byte + UNC + symlink | ✗ | partial | partial |
| SSRF URL blocking | ✓ 10 IP networks | ✗ | ✗ | ✗ |
| VBA macro stripping | ✓ | ✗ | ✗ | ✗ |
| Zip-bomb detection | ✓ ratio + depth | ✗ | ✗ | ✗ |
| Magic-byte validation | ✓ | ✗ | ✗ | ✓ |
| PII redaction (10 categories) | ✓ Luhn-validated CC | ✗ | ✗ | ✗ |
| OMML-to-LaTeX equations | ✓ recursive XML | partial | ✓ | ✗ |
| GFM table extraction | ✓ | ✓ | ✓ | ✗ |
| Recursive ZIP support | ✓ depth-3 cap | partial | ✗ | ✓ |
| Local-first (no cloud API) | ✓ | ✓ | ✓ | ✓ |
| Python SDK + CLI | ✓ | ✓ | CLI only | Java API |
MarkSentry is designed to be a security layer around document conversion, not a replacement for Pandoc as a general-purpose format transformer. Use MarkSentry when documents arrive from untrusted sources, feed RAG pipelines, or must comply with data handling policies.
Install via pip, point at any document, get secure Markdown with full audit trail.
pip install marksentry # With all optional dependencies pip install "marksentry[all]"
marksentry convert paper.pdf \ --multi-column \ --detect-math \ --output paper.md
marksentry convert report.docx \ --mask-pii \ --detect-math \ --output report.md
marksentry audit-pii \ medical_record.pdf
[✓] Zero-Trust sanitizer passed
[✓] Magic bytes: PDF signature verified
[✓] Zip-bomb check: N/A (not a ZIP container)
[✓] SSRF scan: 0 blocked URLs found
[✓] Detected 2-column layout (87% confidence, gap at x=305)
[✓] Extracted 3 OMML equations to LaTeX
[✓] 14 tables rendered as GFM pipe syntax
[✓] Conversion complete: 12 pages, 4,218 words
Output: ieee_paper.md (68 KB)
PII Audit -- patient_intake.docx +-----------------------+-------+-----------------------------+ | Category | Count | Sample (masked) | +-----------------------+-------+-----------------------------+ | EMAIL | 4 | j*****@hospital.org | | SSN | 2 | ***-**-4521 | | CREDIT_CARD | 1 | ****-****-****-9823 [Luhn] | | PHONE | 3 | (***) ***-4892 | +-----------------------+-------+-----------------------------+ 10 values found across 4 categories. Run with --mask-pii to redact in conversion output.
from marksentry import convert
result = convert(
"uploads/report.pdf",
mask_pii=True,
multi_column=True,
detect_math=True,
check_ssrf=True,
allowed_base="/safe/uploads",
)
print(result.markdown) # clean, secured Markdown
print(result.warnings) # PII redaction counts, layout notes
print(result.metadata) # title, author, created date
allowed_base to restrict which directories are accessible. Raises SanitizationError on any policy violation.If MarkSentry caught a path traversal, blocked an SSRF candidate, or reconstructed a two-column layout that every other tool mangled, open an issue or leave a star. It helps researchers and engineers building RAG pipelines and document processors find this work.