Overview
This node integrates with the PDF Vector API to extract structured data from documents such as PDFs, Word files, or images. It supports providing the document either via a URL or by uploading a file in binary form. The extraction is guided by a user-defined prompt and a JSON schema that specifies the expected structure of the extracted data.
Common scenarios where this node is beneficial include:
- Extracting invoice details (e.g., invoice number, dates, totals, line items) from scanned or digital invoices.
- Parsing structured information from contracts, reports, or forms.
- Automating data entry workflows by converting complex documents into clean, structured JSON data.
Practical example: You have a batch of invoices in PDF format accessible via URLs or uploaded files. Using this node, you can specify a prompt like "Extract all invoice details including line items, totals, and dates" along with a JSON schema defining the invoice structure. The node will return the extracted data ready for further processing or storage.
Properties
| Name | Meaning |
|---|---|
| Input Type | Choose how to provide the document: - URL: Provide a URL to the document. - File: Upload a PDF, Word doc, or image file from the workflow. |
| Document URL | The URL of the document to extract data from. Required if Input Type is URL. |
| Binary Property | The name of the binary property containing the file data. Required if Input Type is File. This should match the binary data property from a previous node. |
| Prompt | Instructions for data extraction (1-2000 characters). For example, "Extract all invoice details including line items, totals, and dates". Guides the extraction process. |
| JSON Schema | A JSON schema defining the structure of the data to extract. Must be a valid JSON object with "type": "object" and "additionalProperties": false. Used to validate and shape the extracted data. |
Output
The node outputs JSON data matching the structure defined by the provided JSON schema. Each item corresponds to one input document processed.
- The
jsonoutput contains the extracted structured data fields as specified by the schema. - If the extraction fails or encounters errors, the output includes an error object describing the issue.
- The node does not output binary data directly; it only returns structured JSON results based on the document content.
Dependencies
- Requires an active connection to the PDF Vector API service.
- Needs an API key credential configured in n8n for authentication with the PDF Vector API.
- The node uses HTTP requests to communicate with the external API endpoint at
https://www.pdfvector.com/v1/api/extract. - For file inputs, the node expects binary data to be available in the specified binary property from previous nodes.
Troubleshooting
- No binary data found in property: Ensure the binary property name matches exactly the property holding the file data from the previous node.
- Binary data is empty or invalid: Verify that the file was correctly uploaded or passed through the workflow and is not corrupted.
- Invalid input type: The input type must be either "url" or "file". Check the node configuration.
- Invalid JSON schema: The schema must be valid JSON, have
"type": "object", and include"additionalProperties": false. Use the recommended JSON Schema Editor to validate. - API errors:
- Invalid API key: Check your API credentials configuration.
- Insufficient credits: Add more credits to your PDF Vector account.
- Rate limit exceeded: Wait before making more requests.
- Bad request: Review the prompt and schema for correctness.
Links and References
- PDF Vector API Documentation
- JSON Schema Editor — Tool to create and validate JSON schemas for data extraction