Skip to content
Runs local · no upload

PDF to Markdown

Converts PDFs to Markdown — text layer directly, scanned pages via OCR. All in your browser tab.

Drop PDF here

Or click to select — up to 100 MB

PDF

How It Works

  1. 01

    Pick a PDF

    Drag & drop or use the file picker. Up to 50 files per batch, 50 MB per file. Encrypted PDFs are detected and reported.

  2. 02

    Check the mode

    If the PDF has a text layer, direct extraction runs. Otherwise OCR kicks in — the tool tells you up front.

  3. 03

    Download Markdown

    One file → direct `.md`. Multiple files → as a ZIP, with referenced images and a conversion report.

Privacy

There is no server path. The PDF is parsed and turned into Markdown inside your browser tab. After the first load the tool also works offline — no tracking, no signup.

PDFs are the standard format for finished documents — and the worst format when you want to feed the content into Obsidian, a wiki, or a RAG index. This tool turns PDFs into clean Markdown: headings become `#`-headers, bullet lists become lists, paragraphs become paragraphs. For scanned pages an OCR fallback reads the text out of the page image. The PDF never leaves your machine.

01 — How to Use

How do you use this tool?

  1. Drag a PDF in or use the file picker — up to 50 MB per file
  2. Check the options — the OCR fallback for scanned pages is on by default
  3. Click 'Convert' and download the `.md` file — multiple files come back as a ZIP

Why convert PDF to Markdown?

Markdown has become the lingua franca for AI workflows, wikis and personal knowledge systems. Obsidian, Logseq, Hugo, Astro content collections, Claude Code files and almost every RAG index expects Markdown — not PDFs. Anyone who needs to feed a stack of contracts, papers or whitepapers into a knowledge base hits the same wall: PDFs are designed for humans, not machines.

This tool makes the reverse trip practical. From a PDF you get a clean .md file with detectable structure: headings as #-headers, lists as bullet points, paragraphs as paragraphs. Anything that can’t be reliably converted — complex tables, mathematical formulas, multi-column layouts with marginalia — is flagged honestly instead of being half-fabricated.

How does the conversion work technically?

If the PDF has an embedded text layer, an established open-source PDF library extracts the text along with position and font size. A layout heuristic groups text blocks into paragraphs, infers heading levels from font size and position, and recognises bullet markers (, -, numeral + period) as lists. The result is a GitHub Flavored Markdown document that renders natively in Obsidian, VS Code and any standard Markdown pipeline.

Scanned PDFs have no text layer — the pages are images. Here the tool switches into OCR mode: a proven WebAssembly OCR model reads the text out of the image, with language packs for English, German and other European languages. The model is cached in the browser on first use (~12 MB); after that the tool keeps working without an internet connection.

What is the tool actually used for?

  • Filling an Obsidian vault. A pile of academic papers becomes Markdown files where you can set links and backlinks.
  • Claude Code or coding wiki seed. Architecture PDFs become Markdown that lives next to the code files.
  • RAG index preparation. Markdown chunks much more cleanly than PDF — splitters work along heading boundaries.
  • Logseq block import. Markdown headings become Logseq blocks.
  • Hugo / Astro content migration. Existing PDF documentation becomes static-site content.

What survives the conversion — and what doesn’t?

Preserved: headings (with detectable hierarchy), paragraphs, lists (ordered and unordered), inline formatting like bold and italic, links with anchor text, simple tables, images as referenced files.

Flagged with a hint block instead of 1:1 conversion: complex tables with merged cells, mathematical formulas, multi-column layouts with cross-references, footnote linking. The hint block makes clear where the conversion sees its limit — you decide how to clean up.

Not in this version: annotations, form-field data, embedded files, OCG layers. These live structurally below text extraction and need separate handling — Phase 2 will catch up once the MVP runs stably.

How does the tool keep my PDF private?

Many free PDF-to-Markdown services upload the file to a server, convert there, and send the result back. The business model often piggybacks on that, because the server sees the content — even when it claims to delete after 24 hours. For confidential contracts, medical records or internal strategy PDFs that’s rarely acceptable.

None of that happens here. The PDF is parsed in your browser tab, the OCR model runs as a WebAssembly module inside the same tab, the Markdown is assembled in memory and offered as a download. Open the Network panel of your developer tools and watch: not a single byte of your PDF leaves your machine.

This tool is part of the Markdown converter family — a set of browser-only converters that prepare office formats for AI and wiki workflows:

  • DOCX to Markdown — Word documents straight to Markdown with heading structure and lists preserved.
  • XLSX to Markdown — Excel and ODS sheets as GFM pipe tables, multi-sheet support included.
  • HTML to Markdown — web pages or HTML snippets, file or paste mode.
  • Remove Metadata — strip EXIF, GPS and XMP fields from images and PDFs locally in the browser.

Last updated:

You might also like