What is PDF/R & Why It Matters

From the Humble Scanner to the Intelligent Document — PDF/R Is the Foundation

PDF/R (PDF/Raster) was co-developed by the TWAIN Working Group and the PDF Association to solve a problem that has existed since the beginning of document scanning: scanners produce raster images, but the world runs on documents. TIFF and JPEG are image formats — they carry pixels, not documents. PDF/R changes that.

Published as ISO 23504 in 2020, PDF/R is a strictly-defined subset of the PDF specification purpose-built for scanned raster image documents. It is compact enough to be generated directly by scanner firmware on resource-constrained embedded systems, yet produces 100% valid PDF output that every PDF application in the world can read. It is the native output format of TWAIN Direct — the cloud-native, driverless scanning protocol — and can be adopted as a standalone format for any scanning or imaging workflow.

Now, as the document capture ecosystem evolves toward AI pipelines, content authenticity standards, business-process automation, and next-generation compression, the TWAIN Innovation Cloud is convening developers and adopters to explore how PDF/R extends to meet these emerging requirements — and to build the capabilities together.

PDF/R Foundation Capabilities (ISO 23504)
What PDF/R Delivers Today — The Open Baseline
Image Support
Bitonal (1-bit)
Grayscale (8-bit)
RGB Color (24-bit)
JPEG compression
CCITT Group 4 Fax (lossless)
Uncompressed
Multi-page documents
PDF/R Document Features
100% PDF compatible
Metadata support
Digital signature support
Encryption support
Fixed-buffer generation (firmware)
No full PDF parser required to generate
Portable across all PDF consumers
Emerging Capabilities — Where PDF/R Is Going
Content Provenance EMERGING
C2PA Content Credentials — Tamper-Evident Provenance in Every Scan
By embedding a C2PA Content Credentials manifest within the PDF/R document structure, every scanned file can carry a cryptographically signed, tamper-evident record of its origin: the scanning device, the operator identity, the capture timestamp, and whether AI was involved in any processing step. This manifest travels with the PDF/R file to every downstream recipient. Any C2PA-conformant validator — including Adobe, Microsoft, and contentcredentials.org — can verify the document's full provenance chain and detect any post-capture modification. C2PA-enabled PDF/R turns a scanned document into an authenticated document — one that proves where it came from and that it hasn't been altered.
// C2PA manifest embedded in PDF/R XMP metadata stream or as PDF attachment. // Signed using C2PA Trust List certificate. Verifiable by any C2PA Validator. // Aligns with C2PA Specification v2.x Generator Product conformance requirements.
C2PA Spec v2.x Cryptographic Signing Tamper Detection AI Disclosure Ecosystem Interoperable
Business-Process Metadata EXPANDING
XMP Metadata for Business Process — Routing, Workflow, and Compliance in the File
Extensible Metadata Platform (XMP) is an ISO-standardized metadata framework that allows structured, machine-readable metadata to be embedded directly within PDF/R files — extending well beyond basic document information to full business-process metadata: document type, routing destination, approval state, retention schedule, compliance tags, department codes, custodian identity, and any custom workflow attributes your organisation requires. When a PDF/R file arrives at an ECM, ERP, or AI pipeline with XMP business-process metadata embedded, the receiving system knows exactly what the document is, where it belongs, what rules apply to it, and how to process it — without any human routing, interpretation, or re-keying. XMP metadata survives file copy, email transmission, and format conversion, making it the ideal mechanism for encoding business intent at the scanner.
// XMP packet embedded in PDF/R document metadata stream. // Custom schemas for workflow routing, retention policy, document classification. // ISO 16684-1 (XMP) compliant. Readable by any XMP-aware application or AI pipeline.
ISO 16684-1 (XMP) Workflow Routing Retention Policy Custom Schemas Machine-Readable
Document Structure ENHANCED
Properly Structured PDF Output — Document Semantics Beyond Raster Images
PDF/R's strict subset definition keeps implementation simple for embedded systems, but a properly structured PDF/R output goes further — adding correct PDF document structure that enables richer integration with PDF-consuming downstream systems. This includes proper page tree structure, document-level metadata in the PDF document catalog, correctly formed XMP metadata streams, optional OCR text layer embedding (invisible text over images), proper PDF encryption and permission flags, and digital signature fields. The result is a PDF/R file that not only passes PDF validation but participates fully in PDF-based workflow ecosystems — archival (PDF/A compatibility consideration), accessibility pipelines, and enterprise content management systems that expect document-grade PDF, not just image-wrapped PDF.
// Proper PDF document catalog and page tree structure. // XMP metadata stream at document level. Optional invisible OCR text overlay. // PDF encryption flags, permission controls, digital signature fields. // Validated against PDF 1.7 / ISO 32000 conformance requirements.
Document Catalog OCR Text Layer Digital Signatures Encryption Flags PDF/A Alignment
Next-Generation Compression EMERGING
JPEG-XL — Superior Quality at Lower File Sizes for Scanned Documents
JPEG-XL (JXL) is the next-generation image compression standard (ISO/IEC 18181) that delivers dramatically superior quality-to-filesize ratios compared to legacy JPEG — particularly for document scanning use cases. For scanned text and mixed-content documents, JPEG-XL's modular architecture enables lossless compression that beats PNG by ~35%, and lossy compression that beats JPEG at equivalent visual quality by 60% or better. JPEG-XL also supports progressive decoding, HDR, wide color gamut, animation, and lossless re-encoding of existing JPEG files without quality loss (via its JPEG bitstream recompression capability). Incorporating JPEG-XL as an additional compression option within the PDF/R framework would allow scanners and capture devices to produce significantly smaller files with higher fidelity — critical for mobile capture, cloud transmission, and long-term archival workflows.
// JPEG-XL (ISO/IEC 18181) as an additional PDF/R compression stream type. // Lossless mode: ~35% smaller than PNG, perfect for bitonal / text documents. // Lossy mode: 60%+ smaller than JPEG at equivalent SSIM score. // JPEG recompression: lossless transoding of existing JPEG without quality loss.
ISO/IEC 18181 (JPEG-XL) Lossless + Lossy Progressive Decoding Wide Color Gamut JPEG Recompression
AI-Ready Architecture EMERGING
AI / LLM-Ready PDF/R — Structured for Machine Consumption from First Scan
As AI and large language model pipelines increasingly process scanned documents, the gap between what scanners produce and what AI systems need has become a critical inefficiency. AI/LLM-ready PDF/R addresses this by encoding the document's structure and semantics into the file at capture time — so that no pre-processing transformation is needed before the file enters an AI pipeline. This includes: embedded OCR text with spatial coordinates (bounding boxes) aligned to the raster image for visual grounding; structured XMP metadata describing document type, classification, and extraction hints; C2PA provenance manifest confirming the document's authentic origin (critical for AI training data integrity and RAG pipeline trustworthiness); and optional AI-generated document summaries or entity extraction results embedded as PDF/R annotations or XMP custom metadata — so that AI augmentation travels with the document rather than being stored separately in a sidecar system.
// OCR text with bounding boxes: spatially-aligned text for visual grounding. // XMP: document classification, entity extraction hints, confidence scores. // C2PA: provenance manifest for AI training data integrity and RAG trustworthiness. // Embedded AI augmentation: summaries, entities, classifications as XMP / annotations.
OCR + Bounding Boxes Visual Grounding RAG Pipeline Ready AI Training Integrity Embedded Augmentation LLM Context Structured
PDF/R vs Legacy Formats
✗ TIFF / JPEG — The Legacy Approach
  • Image formats — pixels only, no document structure
  • No embedded provenance or authenticity data
  • No business-process metadata survives transmission
  • TIFF: no PDF compatibility without conversion
  • No AI/LLM semantic structure out of the box
  • JPEG: lossy only, no lossless document option
  • Fragmented metadata across sidecar files
✓ PDF/R — The Document Format
  • Document format — metadata, signatures, encryption
  • C2PA provenance manifest embedded at capture
  • XMP business-process metadata travels with file
  • 100% PDF compatible — reads everywhere
  • AI/LLM-ready: OCR + bounding boxes + structured metadata
  • JPEG-XL option: superior lossless + lossy compression
  • ISO 23504 standardized — royalty-free, open
Who Should Use This TIC Offer
🖨
Scanner & Device Manufacturers
Firmware developers implementing PDF/R output natively in scanner hardware — evaluating JPEG-XL support, C2PA signing at capture, and XMP metadata embedding from device firmware.
💻
Document Capture ISVs (TIC)
Software developers building TWAIN Direct or TWAIN Classic capture applications — adding PDF/R output with full metadata, provenance, and AI-ready structure to their scan workflows.
🤖
AI / LLM Platform Developers
Teams building document intelligence pipelines who want a standardized, provenance-verified, semantically structured input format from scanners — replacing ad-hoc TIFF/JPEG + OCR sidecar approaches.
☁️
Cloud & ECM Platform Builders
Cloud document management, ECM, and content services platforms adding native PDF/R ingestion — leveraging XMP business-process metadata for automatic routing, classification, and compliance tagging.
🔒
Compliance & Records Platforms
Healthcare, legal, government, and financial services platforms requiring tamper-evident, provenance-verified document input — where C2PA-embedded PDF/R satisfies audit and legal defensibility requirements.
🔬
Standards Researchers & Pilots
Academic researchers, standards body participants, and innovation labs evaluating JPEG-XL, AI/LLM document structuring, and C2PA integration within a royalty-free, open ISO scanning standard.
🔓
Royalty-Free · Open Download · ISO Standardized — Yours to Use and Build On
PDF/R is completely free to use, implement, and build on. The specification is available for royalty-free download from the TWAIN Working Group and the PDF Association. Sample code is available on GitHub. The standard is ISO-published (ISO 23504) and co-developed with the PDF Association — one of the world's leading open document standards organizations. There is no licence fee, no certification cost, and no vendor lock-in. Build it into your scanner, your application, your cloud platform, or your AI pipeline — it belongs to the ecosystem.
What's Included in the TIC PDF/R Programme
  • PDF/R specification and technical resources — full access to the PDF/R specification (ISO 23504 / PDF/R-1), sample code repository on GitHub, and the TWAIN Working Group's PDF/R technical documentation — everything needed to implement PDF/R output in a scanner, application, or cloud platform.
  • C2PA integration technical consultation — a working session with TWG technical experts on embedding C2PA Content Credentials manifests within PDF/R output — covering manifest structure, Trust List certificate requirements, C2PA Generator Product conformance pathway, and verification via contentcredentials.org.
  • XMP metadata schema design session — a consultation on designing XMP custom metadata schemas for your specific business-process requirements — document type classification, routing codes, retention policies, compliance tags, and any domain-specific metadata your workflow demands — and how to embed them in PDF/R output at scan time.
  • JPEG-XL feasibility evaluation — a technical review of incorporating JPEG-XL as an additional compression option in your PDF/R implementation — covering the ISO/IEC 18181 specification, available open-source encoders/decoders (libjxl), and the performance/quality tradeoffs for your specific document types and target file sizes.
  • AI/LLM-ready PDF/R architecture review — a technical session on structuring PDF/R output for direct consumption by AI document intelligence pipelines — OCR text with spatial bounding boxes, XMP classification metadata, C2PA provenance for training data integrity, and embedding AI-generated augmentation within the PDF/R file structure.
  • Properly structured PDF output review — a review of your current or planned PDF/R implementation against PDF document structure best practices — page tree, document catalog, metadata streams, encryption, digital signatures, and OCR invisible text layer — ensuring your output participates fully in PDF-consuming downstream ecosystems.
  • TWAIN Direct integration pathway — guidance on incorporating PDF/R as the native output format within a TWAIN Direct driverless scanning workflow — covering the TWAIN Direct protocol's PDF/R output specification, cloud delivery, and integration with downstream document management and AI systems.
  • Direct access to TWG technical experts — TIC programme participants engage directly with TWAIN Working Group engineers and PDF/R experts — including members of the original TWG/PDF Association development team — for implementation questions, architecture review, and emerging capability development.
  • TIC ecosystem connection — introduction to TIC sponsor companies whose products are directly relevant to your PDF/R implementation: ExactCODE (RISC-V/open-source), C2PA (content provenance), Verve Capture (enterprise capture), JSE Imaging (TWAIN software), Dynamsoft (scanning SDK), Thin Scanner (cloud capture), and others in the TIC ecosystem.
Register Your PDF/R Interest →
TIC PDF/R Adoption & Development Programme
Register Your PDF/R Interest
The TWAIN Working Group will be in touch within 3 business days to connect you with the right technical resources and experts.
* Required fields