Overview
This node performs a web crawling operation starting from a specified root URL. It systematically follows links on the website up to a defined depth and breadth, extracting content and metadata according to user-defined options. This is useful for scenarios such as website auditing, data scraping, SEO analysis, or content aggregation.
For example, you can use this node to crawl a blog homepage and extract all article URLs and their summaries, or to audit a website’s structure by limiting crawl depth and breadth.
Properties
| Name | Meaning |
|---|---|
| URL | The root URL where the crawler begins its operation. |
| Options | A collection of optional parameters to customize the crawl: |
| - Instructions | Natural language instructions guiding the crawler's behavior. |
| - Max Depth | Maximum levels deep the crawler will follow links from the root URL (minimum 1). |
| - Max Breadth | Maximum number of links to follow per level during the crawl (minimum 1). |
| - Limit | Maximum total number of results (pages) to return from the crawl (minimum 1). |
| - Select Paths | Regex patterns to include only URLs whose paths match these patterns. |
| - Select Domains | Regex patterns to restrict crawling to specific domains or subdomains. |
| - Exclude Paths | Regex patterns to exclude URLs with matching path patterns from the crawl. |
| - Exclude Domains | Regex patterns to exclude specific domains or subdomains from crawling. |
| - Allow External | Boolean flag indicating whether to follow links that lead to external domains beyond the root domain. |
| - Include Images | Boolean flag indicating whether images found during the crawl should be included in the results. |
| - Extract Depth | Level of detail for content extraction: "Basic" or "Advanced". |
| - Format | Format of the extracted page content: either "Markdown" or plain "Text". |
| - Include Favicon | Boolean flag indicating whether to include the favicon URL for each crawled page in the results. |
Output
The node outputs JSON data representing the crawl results. Each item typically includes:
- The URL of the crawled page.
- Extracted content in the selected format (Markdown or Text).
- Metadata such as page title, possibly favicon URL if requested.
- Optionally included images if enabled.
- Crawl depth and link relationships depending on extraction depth.
If images are included, they appear as part of the JSON output rather than binary data.
Dependencies
- Requires internet access to perform HTTP requests to target URLs.
- May require an API key or authentication token if crawling protected or rate-limited sites (not explicitly shown in code).
- No internal credential names or environment variables are exposed in the provided code snippet.
Troubleshooting
Common issues:
- Invalid or unreachable root URL causing crawl failure.
- Overly restrictive regex patterns resulting in no URLs being crawled.
- Setting max depth or breadth too high may cause long execution times or timeouts.
- External links not followed if "Allow External" is false, which might limit expected results.
Error messages:
- Network errors when the root URL or linked pages cannot be reached.
- Validation errors if required properties like URL are missing or malformed.
- Possible rate limiting or blocking by target websites if crawling aggressively.
Resolutions:
- Verify the root URL is correct and accessible.
- Adjust regex filters to ensure they match intended URLs.
- Lower max depth/breadth or add delays between requests if supported.
- Enable "Allow External" if external domains need to be crawled.