PDF to CSV

Convert PDF documents to CSV format using coordinate-based extraction or OCR

Overview

This node converts PDF documents into structured tabular data formats such as CSV, JSON array, binary CSV, or Excel files. It supports extracting tables from PDFs using different parsing methods: coordinate-based extraction (fast and works best with native PDF text), OCR-based extraction (for scanned/image PDFs), or an auto-detect mode that tries coordinate extraction first and falls back to OCR if needed.

Common scenarios where this node is beneficial include:

Automating data extraction from invoices, reports, or financial statements in PDF format.
Converting scanned PDFs into editable spreadsheet formats.
Integrating PDF data extraction into workflows for further processing or analysis.

Practical examples:

Extracting sales data tables from monthly PDF reports and converting them into Excel files for accounting.
Parsing scanned purchase orders using OCR and outputting JSON arrays for database insertion.
Downloading a PDF from a URL, extracting its table content, and returning it as a CSV string for immediate use.

Properties

Name	Meaning
Input Type	Choose the source of the PDF file: - Binary Data: PDF file from previous node's binary property. - URL: Direct URL to the PDF file to download and convert.
Binary Property	(Required if Input Type is Binary Data) The name of the binary property containing the PDF file in the input data.
PDF URL	(Required if Input Type is URL) The URL of the PDF file to convert.
Parsing Method	Method used to extract text from the PDF: - Coordinate-Based (Fast): Extracts text based on coordinates; best for native PDFs. - OCR (Image-Based): Uses OCR to extract text from images/scanned PDFs. - Auto-Detect: Tries coordinate-based first, then OCR fallback if needed.
Output Format	Format of the converted output: - CSV String: Returns parsed data as a CSV string. - JSON Array: Returns parsed data as a JSON array. - Binary Data: Returns CSV as binary data suitable for download. - Excel File: Returns data as an Excel (.xlsx) file.
CSV Delimiter	Delimiter character to use for CSV output (only applies when output format is CSV String or Binary Data). Default is comma (,).
Include Headers	Whether to include column headers in the output data. Defaults to true.

Output

The node outputs an array of items, each containing:

A json field with the parsed table data in the selected format:
- For CSV String, a UTF-8 encoded CSV string (with optional BOM).
- For JSON Array, an array of objects representing rows keyed by column headers (if headers included).
- For Binary Data, a confirmation message in JSON plus a binary property containing the CSV file encoded in base64, with appropriate MIME type and filename.
- For Excel File, a confirmation message in JSON plus a binary property containing the Excel file encoded in base64, with appropriate MIME type and filename.
If the output format is binary (CSV or Excel), the binary data is provided under a binary property named data, including:
- data: Base64-encoded file content.
- mimeType: Content type (text/csv; charset=utf-8 or Excel MIME type).
- fileName: Suggested filename (converted.csv or converted.xlsx).

Dependencies

Requires the following npm packages bundled with the node:
- papaparse for CSV parsing and generation.
- xlsx for Excel file creation.
- pdfreader, pdf-parse, pdf-table-extractor for PDF parsing and table extraction.
- tesseract.js for OCR-based text extraction.
- pdf2pic for converting PDF pages to images for OCR.
Uses Node.js built-in modules like child_process, fs, and path.
For Python fallback extraction, requires Python 3 with the pdfplumber package installed and accessible in the environment.
The node may require temporary file system access to write/read intermediate files during extraction.
HTTP requests to fetch PDFs from URLs require network access.

Troubleshooting

Common Issues

No text found in PDF: This usually means the PDF is image-based and coordinate extraction failed. Switching to OCR parsing method or using Auto-Detect can resolve this.
Memory limit exceeded: Large PDFs with many text items may exceed memory limits. Consider splitting large PDFs or increasing available memory.
Timeout errors: Extraction processes have timeouts (e.g., 30 seconds for coordinate extraction, 60 seconds for Python fallback). Complex or large PDFs may need longer time or simpler extraction methods.
Python fallback failures: If Python or required packages are missing, the fallback extraction will fail. Ensure Python 3 and pdfplumber are installed and accessible.
OCR errors or no text extracted: OCR depends on Tesseract language data and image quality. Poor scans or unsupported languages may cause extraction failure.

Error Messages and Resolutions

PDF conversion failed: No text found in PDF - may be image-based: Use OCR parsing method or Auto-Detect.
PDF parsing error: ...: Indicates low-level PDF reading issues; check PDF integrity.
Memory limit exceeded: Reduce PDF size or increase memory allocation.
Python script timeout after 60 seconds: PDF too complex for fallback; try other parsing methods.
OCR extraction failed: ...: Check Tesseract installation and supported languages.

Links and References

This node provides a robust solution for extracting tabular data from PDFs in various formats, supporting both native text and scanned documents through multiple extraction strategies.