Overview
The node provides functionality to parse documents such as PDFs, Word documents, or images (JPG, PNG) into clean, structured text using the PDF Vector API. It supports parsing documents provided either via a URL or uploaded directly as binary files within the workflow. The node can optionally leverage Large Language Model (LLM) parsing to enhance the extraction quality, with configurable usage modes.
This node is beneficial in scenarios where users need to extract readable text or structured data from complex document formats for further processing, analysis, or search indexing. For example:
- Automatically converting academic papers or reports into Markdown text.
- Extracting invoice details or other structured information from scanned documents.
- Parsing resumes or contracts uploaded as files or referenced by URLs.
Properties
| Name | Meaning |
|---|---|
| Input Type | Choose how to provide the document: URL (provide a link to the document) or File (upload a file from the workflow). |
| Document URL | The URL of the document or image to parse (supports PDF, Word, JPG, PNG). Required if Input Type is URL. |
| Binary Property | The name of the binary property containing the file data from the previous node. Required if Input Type is File. Default is "data". |
| Use LLM | Determines the approach for using Large Language Model parsing: Auto (automatic decision), Always (always use LLM), or Never (do not use LLM). |
Output
The node outputs JSON data representing the parsed content of the document. The exact structure depends on the document and the parsing performed by the external API but generally includes clean text extracted from the input document.
If the input was a file, the node reads the binary data, encodes it in base64, and sends it to the API; the output JSON contains the parsed textual content.
No binary output is produced by this node.
Dependencies
- Requires an active connection to the PDF Vector API service.
- Requires an API key credential configured in n8n for authenticating requests to the PDF Vector API.
- The node uses HTTP POST requests to endpoints like
/parseon the PDF Vector API. - If using file input, the node expects binary data from a previous node in the workflow.
Troubleshooting
- No binary data found: If the node is set to use file input but the specified binary property does not exist or is empty, it will throw an error indicating missing or invalid binary data. Ensure the previous node outputs the file correctly under the specified binary property name.
- Invalid input type: The node only accepts
"url"or"file"as input types. Any other value will cause an error. - API errors: Common HTTP errors returned by the API include:
401 Unauthorized: Invalid API key. Verify your API credentials.402 Payment Required: Insufficient credits in your PDF Vector account. Add more credits to continue.429 Too Many Requests: Rate limit exceeded. Wait before sending more requests.400 Bad Request: Usually caused by malformed input or schema issues. Check the input parameters carefully.
- JSON Schema errors (for extract operation): The schema must be valid JSON, have
"type": "object", and include"additionalProperties": false. Otherwise, the node throws a validation error.
Links and References
- PDF Vector API Documentation
- JSON Schema Editor (useful for creating schemas when extracting structured data)