PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation extracts text content from a Word document (.docx). It supports multiple input methods for the Word file, including binary data from a previous node, a base64-encoded string, or a URL pointing to the Word file. Users can specify page ranges and apply extraction options such as removing comments, headers/footers, and accepting tracked changes. This functionality is useful for automating document processing workflows where textual data needs to be extracted from Word files for further analysis, indexing, or integration with other systems.

Practical examples:

Extracting contract text from uploaded Word documents to feed into a legal review system.
Processing scanned reports converted to Word format by extracting clean text without comments or headers.
Automating data extraction from Word templates hosted online via URLs.

Properties

Name	Meaning
Input Data Type	Method to provide the Word file: "Binary Data" (from previous node), "Base64 String" (base64 encoded content), or "URL" (link to the Word file).
Input Binary Field	Name of the binary property containing the Word file when using "Binary Data" input type (commonly "data").
Base64 Word Content	Base64 encoded string of the Word document content, used if "Base64 String" input type is selected.
Word URL	URL to the Word file, used if "URL" input type is selected.
Document Name	The name assigned to the document during processing (default "document.docx").
Start Page Number	Starting page number for text extraction (default 1).
End Page Number	Ending page number for text extraction (default 3).
Extraction Options	Collection of boolean options controlling extraction behavior: • Remove Comments — whether to exclude comments from extracted text. • Remove Header/Footer — whether to exclude headers and footers. • Accept Changes — whether to accept tracked changes in the document.
Output Binary Field Name	Name of the binary property where the output file will be stored (default "data").
Advanced Options	Custom JSON profiles for advanced API call configurations, allowing fine-tuning of extraction parameters according to external API documentation.

Output

The node outputs an array of items, each containing a json field with the extracted text content from the specified pages of the Word document. Additionally, it may include a binary property (named as per the "Output Binary Field Name") containing the processed Word file or related output data.

If binary data is output, it typically represents the processed document or extracted content in a file format suitable for downstream nodes.

Dependencies

Requires access to an external document processing API service capable of extracting text from Word documents.
Needs proper API authentication configured in n8n credentials (e.g., an API key or token).
Network access to fetch Word files if the URL input method is used.

Troubleshooting

Common issues:
- Incorrect input data type selection leading to missing or invalid file input.
- Invalid or inaccessible URL when using the URL input method.
- Page range values outside the actual document page count.
- API authentication failures or quota limits reached.
Error messages and resolutions:
- "File not found or inaccessible" — Verify the URL or binary input contains valid Word file data.
- "Invalid base64 content" — Ensure the base64 string is correctly encoded and complete.
- "Authentication failed" — Check API credentials and permissions.
- "Page number out of range" — Adjust start/end page numbers within the document's page count.
- "Extraction failed due to unsupported document format" — Confirm the input file is a valid Word document (.docx).

Links and References

PDF4me API Documentation — For details on custom profiles and advanced options.
Microsoft Word File Format Overview — Background on Word document structure.
Base64 Encoding Reference — Explanation of base64 encoding used for input content.