PDF Vector icon

PDF Vector

Turn complex PDFs, Word documents, or images into clean Markdown texts and search across millions of academic papers using PDF Vector with OCR support.

Actions5

Overview

This node integrates with the PDF Vector API to extract structured data from documents such as PDFs, Word files, or images. It supports providing the document either via a URL or by uploading a file in binary form. The extraction is guided by a user-defined prompt and a JSON schema that specifies the expected structure of the extracted data.

Common scenarios where this node is beneficial include:

  • Extracting invoice details (e.g., invoice number, dates, totals, line items) from scanned or digital invoices.
  • Parsing structured information from contracts, reports, or forms.
  • Automating data entry workflows by converting complex documents into clean, structured JSON data.

Practical example: You have a batch of invoices in PDF format accessible via URLs or uploaded files. Using this node, you can specify a prompt like "Extract all invoice details including line items, totals, and dates" along with a JSON schema defining the invoice structure. The node will return the extracted data ready for further processing or storage.

Properties

Name Meaning
Input Type Choose how to provide the document:
- URL: Provide a URL to the document.
- File: Upload a PDF, Word doc, or image file from the workflow.
Document URL The URL of the document to extract data from. Required if Input Type is URL.
Binary Property The name of the binary property containing the file data. Required if Input Type is File. This should match the binary data property from a previous node.
Prompt Instructions for data extraction (1-2000 characters). For example, "Extract all invoice details including line items, totals, and dates". Guides the extraction process.
JSON Schema A JSON schema defining the structure of the data to extract. Must be a valid JSON object with "type": "object" and "additionalProperties": false. Used to validate and shape the extracted data.

Output

The node outputs JSON data matching the structure defined by the provided JSON schema. Each item corresponds to one input document processed.

  • The json output contains the extracted structured data fields as specified by the schema.
  • If the extraction fails or encounters errors, the output includes an error object describing the issue.
  • The node does not output binary data directly; it only returns structured JSON results based on the document content.

Dependencies

  • Requires an active connection to the PDF Vector API service.
  • Needs an API key credential configured in n8n for authentication with the PDF Vector API.
  • The node uses HTTP requests to communicate with the external API endpoint at https://www.pdfvector.com/v1/api/extract.
  • For file inputs, the node expects binary data to be available in the specified binary property from previous nodes.

Troubleshooting

  • No binary data found in property: Ensure the binary property name matches exactly the property holding the file data from the previous node.
  • Binary data is empty or invalid: Verify that the file was correctly uploaded or passed through the workflow and is not corrupted.
  • Invalid input type: The input type must be either "url" or "file". Check the node configuration.
  • Invalid JSON schema: The schema must be valid JSON, have "type": "object", and include "additionalProperties": false. Use the recommended JSON Schema Editor to validate.
  • API errors:
    • Invalid API key: Check your API credentials configuration.
    • Insufficient credits: Add more credits to your PDF Vector account.
    • Rate limit exceeded: Wait before making more requests.
    • Bad request: Review the prompt and schema for correctness.

Links and References

Discussion