PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation, "Extract Text By Expression," allows users to extract text from PDF documents by applying a regular expression pattern. It supports multiple input methods for the PDF file: binary data from a previous node, a base64-encoded string, or a URL pointing to the PDF. Users can specify which pages to process and provide advanced options for fine-tuning extraction.

Common scenarios include:

  • Extracting specific data such as percentages, email addresses, or codes embedded in PDFs.
  • Automating data retrieval from invoices, reports, or contracts where certain patterns are known.
  • Processing multi-page PDFs selectively to optimize performance.

Practical example:

  • Extract all email addresses from pages 1 to 3 of an uploaded invoice PDF.
  • Retrieve all percentage values from a PDF report accessed via URL.

Properties

Name Meaning
Input Data Type Choose how to provide the PDF file to extract text from. Options: Binary Data (from previous node), Base64 String (direct content), URL (link to PDF file).
Input Binary Field Name of the binary property containing the PDF file when using Binary Data input type. Usually "data" for file uploads.
Base64 PDF Content Base64 encoded string representing the PDF document content. Used when Input Data Type is Base64 String.
PDF URL URL to the PDF file to extract text from. Used when Input Data Type is URL.
Document Name Name of the document used internally during processing. Defaults to "document.pdf".
Expression Regular expression pattern to search for within the PDF text. For example, %, US, or an email pattern like email@example.com.
Page Sequence Specifies which pages to process. Examples: "1-" for all pages starting from page 1, "1,2,3" for specific pages, "1-5" for a range of pages.
Advanced Options Collection of additional settings. Includes "Custom Profiles" where JSON can be provided to adjust API call properties, potentially enabling extra features or optimizations as per external API documentation.

Output

The output is an array of JSON objects corresponding to each input item processed. Each object contains extracted text snippets matching the provided regular expression from the specified pages of the PDF.

If the node processes multiple inputs, the output array will contain results for each input in order.

No binary data output is produced by this operation; it focuses solely on textual extraction.

Dependencies

  • Requires access to an external PDF processing API service capable of extracting text by regular expression.
  • The node expects proper configuration of API credentials (e.g., an API key) in n8n to authenticate requests to the PDF processing service.
  • Network access is needed if the PDF is provided via URL or if the API is cloud-based.

Troubleshooting

  • Common issues:

    • Invalid or malformed regular expressions may cause no matches or errors.
    • Incorrect page sequence format might lead to unexpected results or failure.
    • Providing an invalid URL or inaccessible PDF will result in errors.
    • Missing or incorrect binary property name when using binary input will cause the node to fail to find the PDF data.
  • Error messages and resolutions:

    • "Failed to fetch PDF from URL": Check the URL accessibility and correctness.
    • "Invalid regular expression": Verify the syntax of the expression.
    • "Binary property not found": Ensure the binary field name matches the actual property in the input data.
    • API authentication errors: Confirm that the API key or credential is correctly set up in n8n.

Links and References

Discussion