Actions80
- Add Attachment To PDF
- Add Barcode To PDF
- Add Form Fields To PDF
- Add HTML Header Footer
- Add Image Stamp To PDF
- Add Image Watermark To Image
- Add Margin To PDF
- Add Page Number To PDF
- Add Text Stamp To PDF
- Add Text Watermark To Image
- AI-Invoice Parser
- AI-Process Contract
- AI-Process HealthCard
- Classify Document
- Compress Image
- Compress PDF
- Convert HTML To PDF
- Convert Image Format
- Convert JSON To Excel
- Convert Markdown To PDF
- Convert PDF To Editable PDF Using OCR
- Convert PDF To Excel
- Convert PDF To PowerPoint
- Convert PDF To Word
- Convert To PDF
- Convert URL to PDF
- Convert VISIO
- Convert Word to PDF Form
- Create Images From PDF
- Create PDF/A
- Create Swiss QR Bill
- Crop Image
- Delete Blank Pages From PDF
- Delete Unwanted Pages From PDF
- Split PDF By Barcode
- Disable Tracking Changes In Word
- Enable Tracking Changes In Word
- Extract Attachment From PDF
- Extract Form Data From PDF
- Extract Pages From PDF
- Extract Resources
- Extract Table From PDF
- Extract Text By Expression
- Extract Text From Word
- Fill PDF Form
- Find And Replace Text
- Flip Image
- Flatten PDF
- Generate Barcode
- Generate Document Single
- Generate Documents Multiple
- Get Document From Pdf4me
- Get Image Metadata
- Get PDF Metadata
- Split PDF By Swiss QR
- Get Tracking Changes In Word
- Image Extract Text
- Linearize PDF
- Merge Multiple PDFs
- Overlay PDFs
- Parse Document
- Protect PDF
- Read Barcode From Image
- Read Barcode From PDF
- Read SwissQR Code
- Remove EXIF Tags From Image
- Repair PDF Document
- Replace Text With Image
- Replace Text With Image In Word
- Resize Image
- Rotate Document
- Rotate Image
- Rotate Image By EXIF Data
- Rotate PDF Page
- Sign PDF
- Split PDF By Text
- Split PDF Regular
- Unlock PDF
- Update Hyperlinks Annotation
- Upload File To PDF4me
Overview
This node operation, "Extract Text By Expression," allows users to extract text from PDF documents by applying a regular expression pattern. It supports multiple input methods for the PDF file: binary data from a previous node, a base64-encoded string, or a URL pointing to the PDF. Users can specify which pages to process and provide advanced options for fine-tuning extraction.
Common scenarios include:
- Extracting specific data such as percentages, email addresses, or codes embedded in PDFs.
- Automating data retrieval from invoices, reports, or contracts where certain patterns are known.
- Processing multi-page PDFs selectively to optimize performance.
Practical example:
- Extract all email addresses from pages 1 to 3 of an uploaded invoice PDF.
- Retrieve all percentage values from a PDF report accessed via URL.
Properties
| Name | Meaning |
|---|---|
| Input Data Type | Choose how to provide the PDF file to extract text from. Options: Binary Data (from previous node), Base64 String (direct content), URL (link to PDF file). |
| Input Binary Field | Name of the binary property containing the PDF file when using Binary Data input type. Usually "data" for file uploads. |
| Base64 PDF Content | Base64 encoded string representing the PDF document content. Used when Input Data Type is Base64 String. |
| PDF URL | URL to the PDF file to extract text from. Used when Input Data Type is URL. |
| Document Name | Name of the document used internally during processing. Defaults to "document.pdf". |
| Expression | Regular expression pattern to search for within the PDF text. For example, %, US, or an email pattern like email@example.com. |
| Page Sequence | Specifies which pages to process. Examples: "1-" for all pages starting from page 1, "1,2,3" for specific pages, "1-5" for a range of pages. |
| Advanced Options | Collection of additional settings. Includes "Custom Profiles" where JSON can be provided to adjust API call properties, potentially enabling extra features or optimizations as per external API documentation. |
Output
The output is an array of JSON objects corresponding to each input item processed. Each object contains extracted text snippets matching the provided regular expression from the specified pages of the PDF.
If the node processes multiple inputs, the output array will contain results for each input in order.
No binary data output is produced by this operation; it focuses solely on textual extraction.
Dependencies
- Requires access to an external PDF processing API service capable of extracting text by regular expression.
- The node expects proper configuration of API credentials (e.g., an API key) in n8n to authenticate requests to the PDF processing service.
- Network access is needed if the PDF is provided via URL or if the API is cloud-based.
Troubleshooting
Common issues:
- Invalid or malformed regular expressions may cause no matches or errors.
- Incorrect page sequence format might lead to unexpected results or failure.
- Providing an invalid URL or inaccessible PDF will result in errors.
- Missing or incorrect binary property name when using binary input will cause the node to fail to find the PDF data.
Error messages and resolutions:
- "Failed to fetch PDF from URL": Check the URL accessibility and correctness.
- "Invalid regular expression": Verify the syntax of the expression.
- "Binary property not found": Ensure the binary field name matches the actual property in the input data.
- API authentication errors: Confirm that the API key or credential is correctly set up in n8n.
Links and References
- PDF4me API Documentation — for details on custom profiles and advanced options.
- Regular expression tutorials for crafting effective patterns: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions