pdf-to-csv

n8n community node to convert PDF documents to CSV format with advanced structure detection, smart extraction method selection, and enhanced table preservation for ERP documents

Package Information

Released: 9/4/2025
Downloads: 409 weekly / 409 monthly
Latest Version: 2.7.0
Author: jkong0221

Documentation

n8n-nodes-pdf-to-csv

An n8n community node for converting PDF documents to CSV format. This node provides robust PDF parsing capabilities with coordinate-based extraction and OCR support for image-based PDFs.

Features

  • 📄 Convert PDF documents to CSV format
  • 🔗 Support for both binary data and URL inputs
  • 🎯 Three parsing methods: Coordinate-based, OCR, and Auto-detect
  • 📊 Multiple output formats (CSV string, JSON array, binary data, Excel)
  • 🔍 OCR support for scanned/image-based PDFs using Tesseract.js
  • ⚡ Fast coordinate-based extraction for native PDF text
  • 🤖 Auto-detect with intelligent fallback from coordinate to OCR
  • ⚙️ Configurable CSV delimiters and headers
  • 🔧 Built-in error handling and validation

Installation

Community Nodes (Recommended)

  1. Go to Settings > Community Nodes in your n8n instance
  2. Select Install
  3. Enter n8n-nodes-pdf-to-csv
  4. Agree to the risks and select Install

Manual Installation

  1. Clone this repository or download the source code
  2. Install dependencies:
    pnpm install
    
  3. Build the node:
    pnpm build
    
  4. Link the node to your n8n installation:
    pnpm link
    cd ~/.n8n/custom
    pnpm link n8n-nodes-pdf-to-csv
    
  5. Restart your n8n instance

Docker Installation

If you're using n8n with Docker, you can install this node by:

  1. Create a Dockerfile extending the n8n image:

    FROM n8nio/n8n
    USER root
    RUN npm install -g n8n-nodes-pdf-to-csv
    USER node
    
  2. Build and run your custom image:

    docker build -t n8n-custom .
    docker run -it --rm --name n8n -p 5678:5678 n8n-custom
    

Usage

Basic PDF to CSV Conversion

  1. Add the PDF to CSV node to your workflow

  2. Configure the input type:

    • Binary Data: Use when PDF comes from a previous node (e.g., HTTP Request, Google Drive)
    • URL: Provide a direct URL to the PDF file
  3. Choose parsing method:

    • Auto-Detect: Tries coordinate-based first, falls back to OCR if needed (recommended)
    • Coordinate-Based (Fast): Uses coordinate analysis for native PDF text (fastest)
    • OCR (Image-Based): Uses Tesseract.js for scanned/image PDFs (slower but works with any PDF)
  4. Configure output format:

    • CSV String: Returns formatted CSV text
    • JSON Array: Returns structured JSON data
    • Binary Data: Returns downloadable CSV file

Input Configuration

Binary Data Input

{
  "inputType": "binaryData",
  "binaryPropertyName": "data"
}

URL Input

{
  "inputType": "url",
  "pdfUrl": "https://example.com/document.pdf"
}

Parsing Methods

Auto-Detect (Recommended)

Intelligent method that tries coordinate-based extraction first, then falls back to OCR if needed. Best for:

  • Unknown PDF types
  • Mixed document collections
  • When you want the fastest method that works

Coordinate-Based (Fast)

Uses coordinate analysis to detect table structure from native PDF text. Best for:

  • PDFs created digitally (not scanned)
  • Documents with clear table structures
  • When speed is important
  • Most business reports and invoices

OCR (Image-Based)

Uses Tesseract.js optical character recognition to extract text from PDF images. Best for:

  • Scanned documents
  • Image-based PDFs
  • PDFs where coordinate extraction fails
  • Documents with complex layouts or mixed content

Note: OCR is slower but more universally compatible with different PDF types.

Example Workflow

{
  "nodes": [
    {
      "parameters": {
        "url": "https://example.com/report.pdf",
        "options": {}
      },
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [250, 300],
      "id": "http-request",
      "name": "Download PDF"
    },
    {
      "parameters": {
        "operation": "convert",
        "inputType": "binaryData",
        "binaryPropertyName": "data",
        "parsingMethod": "auto",
        "outputFormat": "csvString",
        "csvDelimiter": ",",
        "includeHeaders": true,
        "skipEmptyLines": true
      },
      "type": "n8n-nodes-pdf-to-csv.pdfToCsv",
      "typeVersion": 1,
      "position": [450, 300],
      "id": "pdf-to-csv",
      "name": "PDF to CSV"
    }
  ]
}

Configuration Options

Parameter Type Default Description
inputType Options binaryData Source of PDF file (binaryData/url)
binaryPropertyName String data Name of binary property containing PDF
pdfUrl String - URL of PDF file to convert
parsingMethod Options auto Method for parsing PDF content (auto/coordinate/ocr)
csvDelimiter String , Delimiter for CSV output
includeHeaders Boolean true Treat first row as headers
skipEmptyLines Boolean true Skip empty lines in PDF
outputFormat Options csvString Format of output data

Supported File Types

  • PDF documents (.pdf)
  • Password-protected PDFs are not currently supported

Error Handling

The node includes comprehensive error handling for:

  • Invalid PDF files
  • Network errors when fetching URLs
  • Parsing failures
  • Memory limitations for large files

Errors can be handled using n8n's built-in error handling mechanisms.

Limitations

  • Large PDF files may consume significant memory
  • Complex PDF layouts may not parse perfectly with auto-detection
  • Scanned PDFs (images) require OCR preprocessing
  • Password-protected PDFs are not supported

Development

Prerequisites

  • Node.js 18.10 or higher
  • pnpm 7.18 or higher

Setup

git clone https://github.com/your-username/n8n-nodes-pdf-to-csv.git
cd n8n-nodes-pdf-to-csv
pnpm install

Build

pnpm build

Development Mode

pnpm dev

Linting

pnpm lint
pnpm lintfix

Testing

pnpm test

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes
  4. Run tests and linting: pnpm test && pnpm lint
  5. Commit your changes: git commit -m 'Add amazing feature'
  6. Push to the branch: git push origin feature/amazing-feature
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Changelog

v1.0.0

  • Initial release
  • Basic PDF to CSV conversion
  • Multiple parsing methods
  • Flexible output formats
  • Comprehensive error handling

Discussion