Tavily icon

Tavily

Tavily API

Actions3

Overview

This node performs a web crawling operation starting from a specified root URL. It systematically follows links on the website up to a defined depth and breadth, extracting content and metadata according to user-defined options. This is useful for scenarios such as website auditing, data scraping, SEO analysis, or content aggregation.

For example, you can use this node to crawl a blog homepage and extract all article URLs and their summaries, or to audit a website’s structure by limiting crawl depth and breadth.

Properties

Name Meaning
URL The root URL where the crawler begins its operation.
Options A collection of optional parameters to customize the crawl:
- Instructions Natural language instructions guiding the crawler's behavior.
- Max Depth Maximum levels deep the crawler will follow links from the root URL (minimum 1).
- Max Breadth Maximum number of links to follow per level during the crawl (minimum 1).
- Limit Maximum total number of results (pages) to return from the crawl (minimum 1).
- Select Paths Regex patterns to include only URLs whose paths match these patterns.
- Select Domains Regex patterns to restrict crawling to specific domains or subdomains.
- Exclude Paths Regex patterns to exclude URLs with matching path patterns from the crawl.
- Exclude Domains Regex patterns to exclude specific domains or subdomains from crawling.
- Allow External Boolean flag indicating whether to follow links that lead to external domains beyond the root domain.
- Include Images Boolean flag indicating whether images found during the crawl should be included in the results.
- Extract Depth Level of detail for content extraction: "Basic" or "Advanced".
- Format Format of the extracted page content: either "Markdown" or plain "Text".
- Include Favicon Boolean flag indicating whether to include the favicon URL for each crawled page in the results.

Output

The node outputs JSON data representing the crawl results. Each item typically includes:

  • The URL of the crawled page.
  • Extracted content in the selected format (Markdown or Text).
  • Metadata such as page title, possibly favicon URL if requested.
  • Optionally included images if enabled.
  • Crawl depth and link relationships depending on extraction depth.

If images are included, they appear as part of the JSON output rather than binary data.

Dependencies

  • Requires internet access to perform HTTP requests to target URLs.
  • May require an API key or authentication token if crawling protected or rate-limited sites (not explicitly shown in code).
  • No internal credential names or environment variables are exposed in the provided code snippet.

Troubleshooting

  • Common issues:

    • Invalid or unreachable root URL causing crawl failure.
    • Overly restrictive regex patterns resulting in no URLs being crawled.
    • Setting max depth or breadth too high may cause long execution times or timeouts.
    • External links not followed if "Allow External" is false, which might limit expected results.
  • Error messages:

    • Network errors when the root URL or linked pages cannot be reached.
    • Validation errors if required properties like URL are missing or malformed.
    • Possible rate limiting or blocking by target websites if crawling aggressively.
  • Resolutions:

    • Verify the root URL is correct and accessible.
    • Adjust regex filters to ensure they match intended URLs.
    • Lower max depth/breadth or add delays between requests if supported.
    • Enable "Allow External" if external domains need to be crawled.

Links and References

Discussion