Package Information
Available Nodes
Documentation
n8n-nodes-pdf-to-csv
An n8n community node for converting PDF documents to CSV format. This node provides robust PDF parsing capabilities with coordinate-based extraction and OCR support for image-based PDFs.
Features
- 📄 Convert PDF documents to CSV format
- 🔗 Support for both binary data and URL inputs
- 🎯 Three parsing methods: Coordinate-based, OCR, and Auto-detect
- 📊 Multiple output formats (CSV string, JSON array, binary data, Excel)
- 🔍 OCR support for scanned/image-based PDFs using Tesseract.js
- ⚡ Fast coordinate-based extraction for native PDF text
- 🤖 Auto-detect with intelligent fallback from coordinate to OCR
- ⚙️ Configurable CSV delimiters and headers
- 🔧 Built-in error handling and validation
Installation
Community Nodes (Recommended)
- Go to Settings > Community Nodes in your n8n instance
- Select Install
- Enter
n8n-nodes-pdf-to-csv - Agree to the risks and select Install
Manual Installation
- Clone this repository or download the source code
- Install dependencies:
pnpm install - Build the node:
pnpm build - Link the node to your n8n installation:
pnpm link cd ~/.n8n/custom pnpm link n8n-nodes-pdf-to-csv - Restart your n8n instance
Docker Installation
If you're using n8n with Docker, you can install this node by:
Create a
Dockerfileextending the n8n image:FROM n8nio/n8n USER root RUN npm install -g n8n-nodes-pdf-to-csv USER nodeBuild and run your custom image:
docker build -t n8n-custom . docker run -it --rm --name n8n -p 5678:5678 n8n-custom
Usage
Basic PDF to CSV Conversion
Add the PDF to CSV node to your workflow
Configure the input type:
- Binary Data: Use when PDF comes from a previous node (e.g., HTTP Request, Google Drive)
- URL: Provide a direct URL to the PDF file
Choose parsing method:
- Auto-Detect: Tries coordinate-based first, falls back to OCR if needed (recommended)
- Coordinate-Based (Fast): Uses coordinate analysis for native PDF text (fastest)
- OCR (Image-Based): Uses Tesseract.js for scanned/image PDFs (slower but works with any PDF)
Configure output format:
- CSV String: Returns formatted CSV text
- JSON Array: Returns structured JSON data
- Binary Data: Returns downloadable CSV file
Input Configuration
Binary Data Input
{
"inputType": "binaryData",
"binaryPropertyName": "data"
}
URL Input
{
"inputType": "url",
"pdfUrl": "https://example.com/document.pdf"
}
Parsing Methods
Auto-Detect (Recommended)
Intelligent method that tries coordinate-based extraction first, then falls back to OCR if needed. Best for:
- Unknown PDF types
- Mixed document collections
- When you want the fastest method that works
Coordinate-Based (Fast)
Uses coordinate analysis to detect table structure from native PDF text. Best for:
- PDFs created digitally (not scanned)
- Documents with clear table structures
- When speed is important
- Most business reports and invoices
OCR (Image-Based)
Uses Tesseract.js optical character recognition to extract text from PDF images. Best for:
- Scanned documents
- Image-based PDFs
- PDFs where coordinate extraction fails
- Documents with complex layouts or mixed content
Note: OCR is slower but more universally compatible with different PDF types.
Example Workflow
{
"nodes": [
{
"parameters": {
"url": "https://example.com/report.pdf",
"options": {}
},
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 1,
"position": [250, 300],
"id": "http-request",
"name": "Download PDF"
},
{
"parameters": {
"operation": "convert",
"inputType": "binaryData",
"binaryPropertyName": "data",
"parsingMethod": "auto",
"outputFormat": "csvString",
"csvDelimiter": ",",
"includeHeaders": true,
"skipEmptyLines": true
},
"type": "n8n-nodes-pdf-to-csv.pdfToCsv",
"typeVersion": 1,
"position": [450, 300],
"id": "pdf-to-csv",
"name": "PDF to CSV"
}
]
}
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
inputType |
Options | binaryData |
Source of PDF file (binaryData/url) |
binaryPropertyName |
String | data |
Name of binary property containing PDF |
pdfUrl |
String | - | URL of PDF file to convert |
parsingMethod |
Options | auto |
Method for parsing PDF content (auto/coordinate/ocr) |
csvDelimiter |
String | , |
Delimiter for CSV output |
includeHeaders |
Boolean | true |
Treat first row as headers |
skipEmptyLines |
Boolean | true |
Skip empty lines in PDF |
outputFormat |
Options | csvString |
Format of output data |
Supported File Types
- PDF documents (
.pdf) - Password-protected PDFs are not currently supported
Error Handling
The node includes comprehensive error handling for:
- Invalid PDF files
- Network errors when fetching URLs
- Parsing failures
- Memory limitations for large files
Errors can be handled using n8n's built-in error handling mechanisms.
Limitations
- Large PDF files may consume significant memory
- Complex PDF layouts may not parse perfectly with auto-detection
- Scanned PDFs (images) require OCR preprocessing
- Password-protected PDFs are not supported
Development
Prerequisites
- Node.js 18.10 or higher
- pnpm 7.18 or higher
Setup
git clone https://github.com/your-username/n8n-nodes-pdf-to-csv.git
cd n8n-nodes-pdf-to-csv
pnpm install
Build
pnpm build
Development Mode
pnpm dev
Linting
pnpm lint
pnpm lintfix
Testing
pnpm test
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes
- Run tests and linting:
pnpm test && pnpm lint - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- 📧 Email: your.email@example.com
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Changelog
v1.0.0
- Initial release
- Basic PDF to CSV conversion
- Multiple parsing methods
- Flexible output formats
- Comprehensive error handling