TL;DR: This article provides an in-depth look at building a modern AI PDF assistant, covering core technical challenges and solutions including multimodal document understanding, intelligent context compression, scanned document OCR, LaTeX formula rendering, and Word export.
Background
In daily work and study, we often need to read large volumes of PDF documents β research papers, reports, manuals, etc. Traditional AI Q&A tools typically only do simple text extraction and struggle when faced with complex tables, mathematical formulas, or scanned PDFs.
PdfMind was created to solve these pain points.
This One PDF Feature Blew My Mind
Core Features
| Feature | Description |
|---|---|
| Multi-turn conversations powered by Gemini 2.5 Flash, with rigorous/casual modes | |
| Select any region (charts/formulas) and get AI analysis with full document context | |
| Multimodal recognition, handles complex headers and merged cells | |
| One-click save Q&A to notes, export to high-quality Word documents (LaTeX supported) | |
| AI generates structured memory, reducing token consumption by 80%+ |
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React + Vite) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β PDF Viewer β β Chat Panel β β Notes Manager β β
β β (PDF.js) β β (Markdown) β β (Export DOCX) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββ¬ββββββββββ β
βββββββββββΌββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloudflare Pages Functions (Serverless) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β /api/ai β β /api/ark/* β β /api/word β β
β β (AI Gateway) β β (Scanned OCR)β β (Proxy) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββ¬ββββββββββ β
βββββββββββΌββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ
β Google Gemini β β Volcengine Ark β β Cloud Run (Pandoc) β
β 2.5 Flash β β (Doubao Vision) β β Word Export β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Frontend | React + Vite + Tailwind | High dev efficiency, blazing fast HMR |
| PDF Rendering | PDF.js | Open source standard, supports text selection and screenshots |
| Backend | Cloudflare Pages Functions | Zero cold start, global edge deployment |
| Main Chat AI | Gemini 2.5 Flash | Strong multimodal capabilities, large context window |
| OCR/Scanned | Volcengine Ark (Doubao Vision) | Excellent Chinese document support |
| Context Compression | Qwen-Flash (via MuleRun) | Cost-effective, stable structured output |
| Word Export | Pandoc (Cloud Run) | Industry standard, excellent formula/image support |
Deep Dive: Core Technologies
1. Intelligent Context Compression: 80% Token Reduction
Pain Point: A 50-page PDF can contain 100K+ tokens after text extraction. Sending the full text for every query is prohibitively expensive.
Solution: AI Context Memory
// Preprocess on PDF upload
export async function preloadPdfText(file, onProgress) {
// Step 1: Extract raw text
const pdfText = await extractPdfText(file);
// Step 2: AI generates structured memory
const compressorPrompt = `
# Role: Knowledge Architect
Compress the document into structured "AI Context Memory":
1. Meta Info (topic, purpose, type)
2. Key Glossary (terminology)
3. Logical Structure (document flow)
4. Core Mechanisms (key formulas/concepts)
5. Key Tables (critical table data)
6. Key Data Points (important statistics)
`;
const contextMemory = await callQwenFlash(compressorPrompt, pdfText);
CONTEXT_MEMORY_CACHE.set(file.id, contextMemory);
}
Results:
| Scenario | Raw Text | Compressed Memory | Compression |
|---|---|---|---|
| 50-page paper | 120K tokens | 8K tokens | 93% |
| 100-page report | 250K tokens | 15K tokens | 94% |
2. Scanned PDF Processing: OCR Workflow
Scanned PDFs have no extractable text layer and require OCR.
Option A: Local Images + Gemini Vision (Fallback)
const images = await convertPdfToImages(file, 10);
const response = await callGeminiWithImages(prompt, images);
Option B: Volcengine Ark Document Understanding (Recommended)
// 1. Upload file to Ark
const fileId = await arkUpload(file);
// 2. Poll for processing completion
await pollFileStatus(fileId);
// 3. Direct chat (Ark handles OCR + vectorization internally)
const result = await arkChat(fileId, prompt);
Why Ark?
- Optimized for long documents, no manual pagination needed
- Built-in OCR + document structure understanding
- Superior support for Chinese handwriting and complex layouts
3. Screenshot Questions: Deep Multimodal Correlation
Users can select any region in the PDF preview (e.g., complex charts), and the AI answers by combining the screenshot with full document context.
Key Challenge: Donβt just βdescribe the imageβ β must correlate with document context.
Prompt Design:
If the user sends a screenshot:
- Locate the corresponding paragraph in the full document
- Combine visual information with document text
- Example: "As shown in the figure..., this aligns with what's mentioned on Page X..."
Implementation:
// Image sent with message, placed before text so model sees visual context first
if (msg.image && isActiveImage) {
content.push({ type: "image_url", image_url: { url: msg.image } });
}
content.push({ type: "text", text: msg.text });
4. Multimodal Table Extraction
Plain text extraction cannot understand table structure, especially merged cells and complex headers.
Solution: Send PDF images + extracted text together
export async function generateStructuredData(file) {
// Text as auxiliary context
const pdfText = await extractPdfText(file);
// Images are core (preserve table lines, structure)
const images = await convertPdfToImages(file, 10);
const prompt = `
Extract all bordered tables:
- Merge headers: e.g., "Measurement" with "Before" and "After" -> ["Measurement-Before", "Measurement-After"]
- Preserve original numbering: a., b., c. should NOT become 1, 2, 3
- Output in JSON format
`;
return await callGeminiMultimodal(prompt, images, pdfText);
}
5. Perfect LaTeX Formula Rendering & Export
Frontend Rendering: Using KaTeX (100x faster than MathJax)
Word Export: Pandoc tex_math_dollars extension
Gotcha: AI sometimes wraps formulas in backticks, breaking rendering.
Solution: Backend preprocessing
# Strip backticks
md_text = re.sub(r'`(\$[^\$`]+?\$)`', r'\1', md_text)
# Strip incorrect bold markers
md_text = re.sub(r'\*\*(\$[^\$\n]+?\$)\*\*', r'\1', md_text)
6. Word Export Service: Cloud Run + Pandoc
Why a Separate Service?
- Cloudflare Workers cannot run Pandoc (binary program)
- Pandoc is the industry standard for Markdown β DOCX
Architecture:
Cloudflare Pages Function (/api/word)
β
βΌ (Proxy request)
Google Cloud Run (Flask + Pandoc)
β
βΌ
Return generated .docx file
Processing Flow:
def convert_markdown_to_docx(md_text, tmpdir):
# 1. Protect math formulas (prevent Markdown processing from breaking them)
# 2. Extract base64 images to temp files
# 3. Preprocess list formatting
# 4. Call Pandoc for conversion
# 5. Return DOCX binary
Lesson Learned: Placeholder Collision Bug
# β Wrong: PLACEHOLDER1 matches PLACEHOLDER10
placeholder = f"MATHINLINEPLACEHOLDER{i}END"
# β
Correct: Use brackets to isolate
placeholder = f"MATHINLINEPLACEHOLDER[{i}]END"
# And replace in reverse order
for i in range(len(formulas) - 1, -1, -1):
md_text = md_text.replace(f"PLACEHOLDER[{i}]", formulas[i])
Deployment Architecture
Frontend + API Layer: Cloudflare Pages
# One-command deploy
npm run build
npm run pages:deploy
Advantages:
- Global edge nodes, minimal latency
- Functions with zero cold start
- Generous free tier (100K requests/day)
Word Export Service: Google Cloud Run
gcloud run deploy word-export \
--source ./docs/word-export \
--region us-central1
Dockerfile Highlights:
FROM python:3.11-slim
# Install Pandoc
RUN wget https://github.com/jgm/pandoc/releases/download/3.2/pandoc-3.2-1-amd64.deb \
&& dpkg -i pandoc-3.2-1-amd64.deb
# Generate custom Word template (set font to Arial)
COPY generate_reference.py .
RUN python generate_reference.py
COPY . .
CMD exec gunicorn --bind :$PORT main:app
Performance Metrics
| Metric | Value |
|---|---|
| Initial Load Time | < 2s |
| 50-page PDF Preprocessing | 8-15s |
| Single Q&A Response | 2-5s |
| Word Export | 1-3s |
Key Takeaways
- Multimodal is King: Pure text processing can never perfectly handle tables, charts, and formulas.
- Context Compression is Critical: For multi-turn conversations, pre-generated structured memory dramatically reduces costs.
- Prompt IS the Product: 80% of AI output quality depends on prompt design.
- Fallbacks are Essential: Scanned OCR failures, network timeouts β all need graceful degradation.
- Details Make the Experience: Formula formatting, numbering preservation, multilingual support β all require polish.
Resources
- Try it Live: https://mulerun.com/@Aichimp/pdf
- GitHub Open Source: https://github.com/aichimp/PDF_mind
- Become a MuleRun Creator: Apply Here - Publish your AI apps on the MuleRun platform
- Tech Stack Documentation:
Acknowledgments
Thanks to these open source projects:
- Mozilla PDF.js
- John MacFarlaneβs Pandoc
- KaTeX Team
- Cloudflareβs generous free tier
About the Author: A full-stack developer who loves technology and tinkering. If you found this article helpful, feel free to leave a comment!