Building an AI-Powered PDF Assistant from Scratch — A Deep Dive into PdfMind

aichimp · December 28, 2025, 2:14pm

TL;DR: This article provides an in-depth look at building a modern AI PDF assistant, covering core technical challenges and solutions including multimodal document understanding, intelligent context compression, scanned document OCR, LaTeX formula rendering, and Word export.

Background

In daily work and study, we often need to read large volumes of PDF documents — research papers, reports, manuals, etc. Traditional AI Q&A tools typically only do simple text extraction and struggle when faced with complex tables, mathematical formulas, or scanned PDFs.

PdfMind was created to solve these pain points.

This One PDF Feature Blew My Mind

Core Features

Feature	Description
Smart Q&A	Multi-turn conversations powered by Gemini 2.5 Flash, with rigorous/casual modes
Screenshot Questions	Select any region (charts/formulas) and get AI analysis with full document context
Table Extraction	Multimodal recognition, handles complex headers and merged cells
Notes Export	One-click save Q&A to notes, export to high-quality Word documents (LaTeX supported)
Context Compression	AI generates structured memory, reducing token consumption by 80%+

Architecture

┌──────────────────────────────────────────────────────────────┐
│                        Frontend (React + Vite)               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐    │
│  │ PDF Viewer   │  │ Chat Panel   │  │  Notes Manager   │    │
│  │ (PDF.js)     │  │ (Markdown)   │  │  (Export DOCX)   │    │
│  └──────┬───────┘  └──────┬───────┘  └────────┬─────────┘    │
└─────────┼─────────────────┼───────────────────┼──────────────┘
          │                 │                   │
          ▼                 ▼                   ▼
┌──────────────────────────────────────────────────────────────┐
│              Cloudflare Pages Functions (Serverless)         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐    │
│  │ /api/ai      │  │ /api/ark/*   │  │  /api/word       │    │
│  │ (AI Gateway) │  │ (Scanned OCR)│  │  (Proxy)         │    │
│  └──────┬───────┘  └──────┬───────┘  └────────┬─────────┘    │
└─────────┼─────────────────┼───────────────────┼──────────────┘
          │                 │                   │
          ▼                 ▼                   ▼
┌──────────────────┐ ┌──────────────────┐ ┌────────────────────┐
│ Google Gemini    │ │ Volcengine Ark   │ │ Cloud Run (Pandoc) │
│ 2.5 Flash        │ │ (Doubao Vision)  │ │ Word Export        │
└──────────────────┘ └──────────────────┘ └────────────────────┘

Technology Stack

Layer	Technology	Rationale
Frontend	React + Vite + Tailwind	High dev efficiency, blazing fast HMR
PDF Rendering	PDF.js	Open source standard, supports text selection and screenshots
Backend	Cloudflare Pages Functions	Zero cold start, global edge deployment
Main Chat AI	Gemini 2.5 Flash	Strong multimodal capabilities, large context window
OCR/Scanned	Volcengine Ark (Doubao Vision)	Excellent Chinese document support
Context Compression	Qwen-Flash (via MuleRun)	Cost-effective, stable structured output
Word Export	Pandoc (Cloud Run)	Industry standard, excellent formula/image support

Deep Dive: Core Technologies

1. Intelligent Context Compression: 80% Token Reduction

Pain Point: A 50-page PDF can contain 100K+ tokens after text extraction. Sending the full text for every query is prohibitively expensive.

Solution: AI Context Memory

// Preprocess on PDF upload
export async function preloadPdfText(file, onProgress) {
  // Step 1: Extract raw text
  const pdfText = await extractPdfText(file);

  // Step 2: AI generates structured memory
  const compressorPrompt = `
  # Role: Knowledge Architect
  
  Compress the document into structured "AI Context Memory":
  1. Meta Info (topic, purpose, type)
  2. Key Glossary (terminology)
  3. Logical Structure (document flow)
  4. Core Mechanisms (key formulas/concepts)
  5. Key Tables (critical table data)
  6. Key Data Points (important statistics)
  `;

  const contextMemory = await callQwenFlash(compressorPrompt, pdfText);
  CONTEXT_MEMORY_CACHE.set(file.id, contextMemory);
}

Results:

Scenario	Raw Text	Compressed Memory	Compression
50-page paper	120K tokens	8K tokens	93%
100-page report	250K tokens	15K tokens	94%

2. Scanned PDF Processing: OCR Workflow

Scanned PDFs have no extractable text layer and require OCR.

Option A: Local Images + Gemini Vision (Fallback)

const images = await convertPdfToImages(file, 10);
const response = await callGeminiWithImages(prompt, images);

Option B: Volcengine Ark Document Understanding (Recommended)

// 1. Upload file to Ark
const fileId = await arkUpload(file);

// 2. Poll for processing completion
await pollFileStatus(fileId);

// 3. Direct chat (Ark handles OCR + vectorization internally)
const result = await arkChat(fileId, prompt);

Why Ark?

Optimized for long documents, no manual pagination needed
Built-in OCR + document structure understanding
Superior support for Chinese handwriting and complex layouts

3. Screenshot Questions: Deep Multimodal Correlation

Users can select any region in the PDF preview (e.g., complex charts), and the AI answers by combining the screenshot with full document context.

Key Challenge: Don’t just “describe the image” — must correlate with document context.

Prompt Design:

If the user sends a screenshot:
- Locate the corresponding paragraph in the full document
- Combine visual information with document text
- Example: "As shown in the figure..., this aligns with what's mentioned on Page X..."

Implementation:

// Image sent with message, placed before text so model sees visual context first
if (msg.image && isActiveImage) {
  content.push({ type: "image_url", image_url: { url: msg.image } });
}
content.push({ type: "text", text: msg.text });

4. Multimodal Table Extraction

Plain text extraction cannot understand table structure, especially merged cells and complex headers.

Solution: Send PDF images + extracted text together

export async function generateStructuredData(file) {
  // Text as auxiliary context
  const pdfText = await extractPdfText(file);

  // Images are core (preserve table lines, structure)
  const images = await convertPdfToImages(file, 10);

  const prompt = `
  Extract all bordered tables:
  - Merge headers: e.g., "Measurement" with "Before" and "After" -> ["Measurement-Before", "Measurement-After"]
  - Preserve original numbering: a., b., c. should NOT become 1, 2, 3
  - Output in JSON format
  `;

  return await callGeminiMultimodal(prompt, images, pdfText);
}

5. Perfect LaTeX Formula Rendering & Export

Frontend Rendering: Using KaTeX (100x faster than MathJax)

Word Export: Pandoc tex_math_dollars extension

Gotcha: AI sometimes wraps formulas in backticks, breaking rendering.

Solution: Backend preprocessing

# Strip backticks
md_text = re.sub(r'`(\$[^\$`]+?\$)`', r'\1', md_text)

# Strip incorrect bold markers
md_text = re.sub(r'\*\*(\$[^\$\n]+?\$)\*\*', r'\1', md_text)

6. Word Export Service: Cloud Run + Pandoc

Why a Separate Service?

Cloudflare Workers cannot run Pandoc (binary program)
Pandoc is the industry standard for Markdown → DOCX

Architecture:

Cloudflare Pages Function (/api/word)
        │
        ▼ (Proxy request)
Google Cloud Run (Flask + Pandoc)
        │
        ▼
Return generated .docx file

Processing Flow:

def convert_markdown_to_docx(md_text, tmpdir):
    # 1. Protect math formulas (prevent Markdown processing from breaking them)
    # 2. Extract base64 images to temp files
    # 3. Preprocess list formatting
    # 4. Call Pandoc for conversion
    # 5. Return DOCX binary

Lesson Learned: Placeholder Collision Bug

# ❌ Wrong: PLACEHOLDER1 matches PLACEHOLDER10
placeholder = f"MATHINLINEPLACEHOLDER{i}END"

# ✅ Correct: Use brackets to isolate
placeholder = f"MATHINLINEPLACEHOLDER[{i}]END"

# And replace in reverse order
for i in range(len(formulas) - 1, -1, -1):
    md_text = md_text.replace(f"PLACEHOLDER[{i}]", formulas[i])

Deployment Architecture

Frontend + API Layer: Cloudflare Pages

# One-command deploy
npm run build
npm run pages:deploy

Advantages:

Global edge nodes, minimal latency
Functions with zero cold start
Generous free tier (100K requests/day)

Word Export Service: Google Cloud Run

gcloud run deploy word-export \
  --source ./docs/word-export \
  --region us-central1

Dockerfile Highlights:

FROM python:3.11-slim

# Install Pandoc
RUN wget https://github.com/jgm/pandoc/releases/download/3.2/pandoc-3.2-1-amd64.deb \
    && dpkg -i pandoc-3.2-1-amd64.deb

# Generate custom Word template (set font to Arial)
COPY generate_reference.py .
RUN python generate_reference.py

COPY . .
CMD exec gunicorn --bind :$PORT main:app

Performance Metrics

Metric	Value
Initial Load Time	< 2s
50-page PDF Preprocessing	8-15s
Single Q&A Response	2-5s
Word Export	1-3s

Key Takeaways

Multimodal is King: Pure text processing can never perfectly handle tables, charts, and formulas.
Context Compression is Critical: For multi-turn conversations, pre-generated structured memory dramatically reduces costs.
Prompt IS the Product: 80% of AI output quality depends on prompt design.
Fallbacks are Essential: Scanned OCR failures, network timeouts — all need graceful degradation.
Details Make the Experience: Formula formatting, numbering preservation, multilingual support — all require polish.

Resources

Try it Live: https://mulerun.com/@Aichimp/pdf
GitHub Open Source: https://github.com/aichimp/PDF_mind
Become a MuleRun Creator: Apply Here - Publish your AI apps on the MuleRun platform
Tech Stack Documentation:

Acknowledgments

Thanks to these open source projects:

Mozilla PDF.js
John MacFarlane’s Pandoc
KaTeX Team
Cloudflare’s generous free tier

About the Author: A full-stack developer who loves technology and tinkering. If you found this article helpful, feel free to leave a comment!