Building an AI-Powered PDF Assistant from Scratch β€” A Deep Dive into PdfMind

TL;DR: This article provides an in-depth look at building a modern AI PDF assistant, covering core technical challenges and solutions including multimodal document understanding, intelligent context compression, scanned document OCR, LaTeX formula rendering, and Word export.


:bullseye: Background

In daily work and study, we often need to read large volumes of PDF documents β€” research papers, reports, manuals, etc. Traditional AI Q&A tools typically only do simple text extraction and struggle when faced with complex tables, mathematical formulas, or scanned PDFs.

PdfMind was created to solve these pain points.

PdfMind Demo

This One PDF Feature Blew My Mind

Core Features

Feature Description
:open_book: Smart Q&A Multi-turn conversations powered by Gemini 2.5 Flash, with rigorous/casual modes
:camera_with_flash: Screenshot Questions Select any region (charts/formulas) and get AI analysis with full document context
:bar_chart: Table Extraction Multimodal recognition, handles complex headers and merged cells
:memo: Notes Export One-click save Q&A to notes, export to high-quality Word documents (LaTeX supported)
:brain: Context Compression AI generates structured memory, reducing token consumption by 80%+

:building_construction: Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Frontend (React + Vite)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ PDF Viewer   β”‚  β”‚ Chat Panel   β”‚  β”‚  Notes Manager   β”‚    β”‚
β”‚  β”‚ (PDF.js)     β”‚  β”‚ (Markdown)   β”‚  β”‚  (Export DOCX)   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                 β”‚                   β”‚
          β–Ό                 β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Cloudflare Pages Functions (Serverless)         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ /api/ai      β”‚  β”‚ /api/ark/*   β”‚  β”‚  /api/word       β”‚    β”‚
β”‚  β”‚ (AI Gateway) β”‚  β”‚ (Scanned OCR)β”‚  β”‚  (Proxy)         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                 β”‚                   β”‚
          β–Ό                 β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Google Gemini    β”‚ β”‚ Volcengine Ark   β”‚ β”‚ Cloud Run (Pandoc) β”‚
β”‚ 2.5 Flash        β”‚ β”‚ (Doubao Vision)  β”‚ β”‚ Word Export        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technology Stack

Layer Technology Rationale
Frontend React + Vite + Tailwind High dev efficiency, blazing fast HMR
PDF Rendering PDF.js Open source standard, supports text selection and screenshots
Backend Cloudflare Pages Functions Zero cold start, global edge deployment
Main Chat AI Gemini 2.5 Flash Strong multimodal capabilities, large context window
OCR/Scanned Volcengine Ark (Doubao Vision) Excellent Chinese document support
Context Compression Qwen-Flash (via MuleRun) Cost-effective, stable structured output
Word Export Pandoc (Cloud Run) Industry standard, excellent formula/image support

:light_bulb: Deep Dive: Core Technologies

1. Intelligent Context Compression: 80% Token Reduction

Pain Point: A 50-page PDF can contain 100K+ tokens after text extraction. Sending the full text for every query is prohibitively expensive.

Solution: AI Context Memory

// Preprocess on PDF upload
export async function preloadPdfText(file, onProgress) {
  // Step 1: Extract raw text
  const pdfText = await extractPdfText(file);

  // Step 2: AI generates structured memory
  const compressorPrompt = `
  # Role: Knowledge Architect
  
  Compress the document into structured "AI Context Memory":
  1. Meta Info (topic, purpose, type)
  2. Key Glossary (terminology)
  3. Logical Structure (document flow)
  4. Core Mechanisms (key formulas/concepts)
  5. Key Tables (critical table data)
  6. Key Data Points (important statistics)
  `;

  const contextMemory = await callQwenFlash(compressorPrompt, pdfText);
  CONTEXT_MEMORY_CACHE.set(file.id, contextMemory);
}

Results:

Scenario Raw Text Compressed Memory Compression
50-page paper 120K tokens 8K tokens 93%
100-page report 250K tokens 15K tokens 94%

2. Scanned PDF Processing: OCR Workflow

Scanned PDFs have no extractable text layer and require OCR.

Option A: Local Images + Gemini Vision (Fallback)

const images = await convertPdfToImages(file, 10);
const response = await callGeminiWithImages(prompt, images);

Option B: Volcengine Ark Document Understanding (Recommended)

// 1. Upload file to Ark
const fileId = await arkUpload(file);

// 2. Poll for processing completion
await pollFileStatus(fileId);

// 3. Direct chat (Ark handles OCR + vectorization internally)
const result = await arkChat(fileId, prompt);

Why Ark?

  • Optimized for long documents, no manual pagination needed
  • Built-in OCR + document structure understanding
  • Superior support for Chinese handwriting and complex layouts

3. Screenshot Questions: Deep Multimodal Correlation

Users can select any region in the PDF preview (e.g., complex charts), and the AI answers by combining the screenshot with full document context.

Key Challenge: Don’t just β€œdescribe the image” β€” must correlate with document context.

Prompt Design:

If the user sends a screenshot:
- Locate the corresponding paragraph in the full document
- Combine visual information with document text
- Example: "As shown in the figure..., this aligns with what's mentioned on Page X..."

Implementation:

// Image sent with message, placed before text so model sees visual context first
if (msg.image && isActiveImage) {
  content.push({ type: "image_url", image_url: { url: msg.image } });
}
content.push({ type: "text", text: msg.text });

4. Multimodal Table Extraction

Plain text extraction cannot understand table structure, especially merged cells and complex headers.

Solution: Send PDF images + extracted text together

export async function generateStructuredData(file) {
  // Text as auxiliary context
  const pdfText = await extractPdfText(file);

  // Images are core (preserve table lines, structure)
  const images = await convertPdfToImages(file, 10);

  const prompt = `
  Extract all bordered tables:
  - Merge headers: e.g., "Measurement" with "Before" and "After" -> ["Measurement-Before", "Measurement-After"]
  - Preserve original numbering: a., b., c. should NOT become 1, 2, 3
  - Output in JSON format
  `;

  return await callGeminiMultimodal(prompt, images, pdfText);
}

5. Perfect LaTeX Formula Rendering & Export

Frontend Rendering: Using KaTeX (100x faster than MathJax)

Word Export: Pandoc tex_math_dollars extension

Gotcha: AI sometimes wraps formulas in backticks, breaking rendering.

Solution: Backend preprocessing

# Strip backticks
md_text = re.sub(r'`(\$[^\$`]+?\$)`', r'\1', md_text)

# Strip incorrect bold markers
md_text = re.sub(r'\*\*(\$[^\$\n]+?\$)\*\*', r'\1', md_text)

6. Word Export Service: Cloud Run + Pandoc

Why a Separate Service?

  • Cloudflare Workers cannot run Pandoc (binary program)
  • Pandoc is the industry standard for Markdown β†’ DOCX

Architecture:

Cloudflare Pages Function (/api/word)
        β”‚
        β–Ό (Proxy request)
Google Cloud Run (Flask + Pandoc)
        β”‚
        β–Ό
Return generated .docx file

Processing Flow:

def convert_markdown_to_docx(md_text, tmpdir):
    # 1. Protect math formulas (prevent Markdown processing from breaking them)
    # 2. Extract base64 images to temp files
    # 3. Preprocess list formatting
    # 4. Call Pandoc for conversion
    # 5. Return DOCX binary

Lesson Learned: Placeholder Collision Bug

# ❌ Wrong: PLACEHOLDER1 matches PLACEHOLDER10
placeholder = f"MATHINLINEPLACEHOLDER{i}END"

# βœ… Correct: Use brackets to isolate
placeholder = f"MATHINLINEPLACEHOLDER[{i}]END"

# And replace in reverse order
for i in range(len(formulas) - 1, -1, -1):
    md_text = md_text.replace(f"PLACEHOLDER[{i}]", formulas[i])

:rocket: Deployment Architecture

Frontend + API Layer: Cloudflare Pages

# One-command deploy
npm run build
npm run pages:deploy

Advantages:

  • Global edge nodes, minimal latency
  • Functions with zero cold start
  • Generous free tier (100K requests/day)

Word Export Service: Google Cloud Run

gcloud run deploy word-export \
  --source ./docs/word-export \
  --region us-central1

Dockerfile Highlights:

FROM python:3.11-slim

# Install Pandoc
RUN wget https://github.com/jgm/pandoc/releases/download/3.2/pandoc-3.2-1-amd64.deb \
    && dpkg -i pandoc-3.2-1-amd64.deb

# Generate custom Word template (set font to Arial)
COPY generate_reference.py .
RUN python generate_reference.py

COPY . .
CMD exec gunicorn --bind :$PORT main:app

:bar_chart: Performance Metrics

Metric Value
Initial Load Time < 2s
50-page PDF Preprocessing 8-15s
Single Q&A Response 2-5s
Word Export 1-3s

:graduation_cap: Key Takeaways

  1. Multimodal is King: Pure text processing can never perfectly handle tables, charts, and formulas.
  2. Context Compression is Critical: For multi-turn conversations, pre-generated structured memory dramatically reduces costs.
  3. Prompt IS the Product: 80% of AI output quality depends on prompt design.
  4. Fallbacks are Essential: Scanned OCR failures, network timeouts β€” all need graceful degradation.
  5. Details Make the Experience: Formula formatting, numbering preservation, multilingual support β€” all require polish.

:link: Resources


:folded_hands: Acknowledgments

Thanks to these open source projects:

  • Mozilla PDF.js
  • John MacFarlane’s Pandoc
  • KaTeX Team
  • Cloudflare’s generous free tier

About the Author: A full-stack developer who loves technology and tinkering. If you found this article helpful, feel free to leave a comment!

2 Likes