Tavily

Tavily API

Actions3

Search Actions
- Query
Extract Actions
- URLs
Crawl Actions
- URL

Overview

This node performs a web crawling operation starting from a specified root URL. It systematically follows links on the website up to a defined depth and breadth, extracting content and metadata according to user-defined options. This is useful for scenarios such as website auditing, data scraping, SEO analysis, or content aggregation.

For example, you can use this node to crawl a blog homepage and extract all article URLs and their summaries, or to audit a website’s structure by limiting crawl depth and breadth.

Properties

Name	Meaning
URL	The root URL where the crawler begins its operation.
Options	A collection of optional parameters to customize the crawl:
- Instructions	Natural language instructions guiding the crawler's behavior.
- Max Depth	Maximum levels deep the crawler will follow links from the root URL (minimum 1).
- Max Breadth	Maximum number of links to follow per level during the crawl (minimum 1).
- Limit	Maximum total number of results (pages) to return from the crawl (minimum 1).
- Select Paths	Regex patterns to include only URLs whose paths match these patterns.
- Select Domains	Regex patterns to restrict crawling to specific domains or subdomains.
- Exclude Paths	Regex patterns to exclude URLs with matching path patterns from the crawl.
- Exclude Domains	Regex patterns to exclude specific domains or subdomains from crawling.
- Allow External	Boolean flag indicating whether to follow links that lead to external domains beyond the root domain.
- Include Images	Boolean flag indicating whether images found during the crawl should be included in the results.
- Extract Depth	Level of detail for content extraction: "Basic" or "Advanced".
- Format	Format of the extracted page content: either "Markdown" or plain "Text".
- Include Favicon	Boolean flag indicating whether to include the favicon URL for each crawled page in the results.

Output

The node outputs JSON data representing the crawl results. Each item typically includes:

The URL of the crawled page.
Extracted content in the selected format (Markdown or Text).
Metadata such as page title, possibly favicon URL if requested.
Optionally included images if enabled.
Crawl depth and link relationships depending on extraction depth.

If images are included, they appear as part of the JSON output rather than binary data.

Dependencies

Requires internet access to perform HTTP requests to target URLs.
May require an API key or authentication token if crawling protected or rate-limited sites (not explicitly shown in code).
No internal credential names or environment variables are exposed in the provided code snippet.

Troubleshooting

Common issues:
- Invalid or unreachable root URL causing crawl failure.
- Overly restrictive regex patterns resulting in no URLs being crawled.
- Setting max depth or breadth too high may cause long execution times or timeouts.
- External links not followed if "Allow External" is false, which might limit expected results.
Error messages:
- Network errors when the root URL or linked pages cannot be reached.
- Validation errors if required properties like URL are missing or malformed.
- Possible rate limiting or blocking by target websites if crawling aggressively.
Resolutions:
- Verify the root URL is correct and accessible.
- Adjust regex filters to ensure they match intended URLs.
- Lower max depth/breadth or add delays between requests if supported.
- Enable "Allow External" if external domains need to be crawled.