PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation, Extract Resources, is designed to extract various resources from a PDF document. It supports extracting text content and images embedded within the PDF. Users can provide the PDF input in multiple formats: as binary data from a previous node, as a base64-encoded string, or via a URL pointing to the PDF file.

Common scenarios where this node is beneficial include:

  • Extracting textual information for indexing, searching, or further processing.
  • Retrieving images embedded in PDFs for use in other workflows or analysis.
  • Processing specific pages or ranges of pages within a PDF.
  • Handling PDFs from different sources/formats flexibly.

Practical examples:

  • Extract all text from an invoice PDF received as binary data to automate data entry.
  • Extract images from a PDF brochure provided via URL to reuse them in marketing materials.
  • Extract text and images only from pages 2 to 5 of a large PDF report supplied as a base64 string.

Properties

Name Meaning
Input Data Type Choose how to provide the PDF file to extract resources from. Options: Binary Data (from previous node), Base64 String (base64 encoded PDF content), URL (link to PDF file).
Input Binary Field Name of the binary property containing the PDF file (usually "data" for file uploads). Used when Input Data Type is Binary Data.
Base64 PDF Content Base64 encoded PDF document content. Used when Input Data Type is Base64 String.
PDF URL URL to the PDF file to extract resources from. Used when Input Data Type is URL.
Document Name Name of the document used internally during processing. Defaults to "document.pdf".
Extract Text Boolean flag indicating whether to extract text content from the PDF. Default is true.
Extract Images Boolean flag indicating whether to extract images from the PDF. Default is false.
Return Images as Binary Boolean flag indicating whether extracted images should be returned as binary data in addition to JSON metadata. Default is false.
Binary Data Name Name for the binary data property in the output when returning images as binary. Default is "image". Only shown if Return Images as Binary is true.
Advanced Options Collection of additional options:
- Pages Specify pages to extract resources from. Format examples: "all" (default), "1,2" (specific pages), "2-5" (page range), or combinations like "1-3,5,7".
- Custom Profiles JSON string to adjust custom properties or profiles for API calls, allowing advanced configuration per https://dev.pdf4me.com/apiv2/documentation/.

Output

The node outputs an array of items, each containing a json field with the extracted resources:

  • If Extract Text is enabled, the JSON includes the extracted text content from the specified pages.
  • If Extract Images is enabled, the JSON includes metadata about the extracted images.
  • If Return Images as Binary is enabled, the node also outputs the actual image files as binary data under the specified binary property name (default "image").

The exact structure of the JSON depends on the extraction results but generally contains fields representing text blocks and image details.

Dependencies

  • Requires access to the PDF processing service/API that performs resource extraction.
  • The node expects proper authentication credentials configured in n8n to connect to this external PDF processing API.
  • Network access is required if providing PDF via URL or if the API is cloud-based.

Troubleshooting

  • Common Issues:

    • Providing incorrect or missing PDF input data (e.g., wrong binary property name or invalid base64 string) will cause extraction to fail.
    • Specifying invalid page ranges or malformed custom profile JSON may result in errors.
    • Network issues or invalid URLs when using URL input type can prevent fetching the PDF.
  • Error Messages:

    • Errors related to missing input data usually indicate misconfiguration of the input properties.
    • Parsing errors for custom profiles suggest invalid JSON syntax.
    • API authentication failures require checking the configured API key or token.
  • Resolutions:

    • Verify the input data matches the selected input type and property names.
    • Validate JSON syntax for custom profiles before saving.
    • Ensure API credentials are correctly set up and have necessary permissions.
    • Confirm URLs are accessible and point to valid PDF files.

Links and References

Discussion