Ocr And Documents — Extract text from PDFs/scans (pymupdf, marker-pdf)
Ocr And Documents
Section titled “Ocr And Documents”Extract text from PDFs/scans (pymupdf, marker-pdf).
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/productivity/ocr-and-documents |
| Version | 2.3.0 |
| Author | Hermes Agent |
| License | MIT |
| Tags | PDF, Documents, Research, Arxiv, Text-Extraction, OCR |
| Related skills | powerpoint |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
PDF & Document Extraction
Section titled “PDF & Document Extraction”For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.
Step 1: Remote URL Available?
Section titled “Step 1: Remote URL Available?”If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])web_extract(urls=["https://example.com/report.pdf"])This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
Step 2: Choose Local Extractor
Section titled “Step 2: Choose Local Extractor”| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
“This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations.”
pymupdf (lightweight)
Section titled “pymupdf (lightweight)”pip install pymupdf pymupdf4llmVia helper script:
python scripts/extract_pymupdf.py document.pdf # Plain textpython scripts/extract_pymupdf.py document.pdf --markdown # Markdownpython scripts/extract_pymupdf.py document.pdf --tables # Tablespython scripts/extract_pymupdf.py document.pdf --images out/ # Extract imagespython scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pagespython scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pagesInline:
python3 -c "import pymupdfdoc = pymupdf.open('document.pdf')for page in doc: print(page.get_text())"marker-pdf (high-quality OCR)
Section titled “marker-pdf (high-quality OCR)”# Check disk space firstpython scripts/extract_marker.py --check
pip install marker-pdfVia helper script:
python scripts/extract_marker.py document.pdf # Markdownpython scripts/extract_marker.py document.pdf --json # JSON with metadatapython scripts/extract_marker.py document.pdf --output_dir out/ # Save imagespython scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracyCLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./outputmarker /path/to/folder --workers 4 # BatchArxiv Papers
Section titled “Arxiv Papers”# Abstract only (fast)web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paperweb_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Searchweb_search(query="arxiv GRPO reinforcement learning 2026")Split, Merge & Search
Section titled “Split, Merge & Search”pymupdf handles these natively — use execute_code or inline Python:
# Split: extract pages 1-5 to a new PDFimport pymupdfdoc = pymupdf.open("report.pdf")new = pymupdf.open()for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i)new.save("pages_1-5.pdf")# Merge multiple PDFsimport pymupdfresult = pymupdf.open()for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path))result.save("merged.pdf")# Search for text across all pagesimport pymupdfdoc = pymupdf.open("report.pdf")for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text"))No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
web_extractis always first choice for URLs- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept
--helpfor full usage - marker-pdf downloads ~2.5GB of models to
~/.cache/huggingface/on first use - For Word docs:
pip install python-docx(better than OCR — parses actual structure) - For PowerPoint: see the
powerpointskill (uses python-pptx)