ScrapegraphAI icon

ScrapegraphAI

Turn any webpage into usable data in one shot – ScrapegraphAI explores the website and extracts the content you need.

Actions7

Overview

The Smart Crawler - Crawl operation enables automated crawling of websites to extract structured data using AI. It explores a given URL and follows links up to a specified depth and page limit, optionally restricting the crawl to the same domain. The node uses an AI prompt to guide content extraction from the crawled pages. Users can also define a custom JSON schema to structure the output data.

This node is beneficial for scenarios such as:

  • Gathering company information or product details from multiple pages within a website.
  • Extracting structured insights from complex websites without manual scraping setup.
  • Automating data collection workflows where content spans several linked pages.

Example use cases:

  • Crawling a business directory site to collect company names, descriptions, and websites.
  • Extracting blog post summaries across multiple pages of a news site.
  • Collecting product specifications from e-commerce category pages.

Properties

Name Meaning
URL The starting URL to begin crawling from.
Prompt AI prompt that instructs how to extract content from the crawled pages.
Cache Website Whether to cache the website content during crawling to optimize repeated requests.
Depth Number of link levels to follow from the initial URL (crawling depth).
Max Pages Maximum number of pages to crawl in total.
Same Domain Only Whether to restrict crawling to pages within the same domain as the starting URL.
Use Custom Output Schema Whether to apply a user-defined JSON schema to structure the extracted output data.
Output Schema JSON schema defining the structure and types of the output data when "Use Custom Output Schema" is enabled.
Render Heavy JS Whether to render JavaScript-heavy websites during crawling (may consume additional credits).

Output

The node outputs a JSON object representing the crawl results. The structure depends on whether a custom output schema is used:

  • By default, the output contains raw extracted data from the crawled pages as interpreted by the AI.
  • If a custom output schema is provided, the output conforms to that schema, enabling structured and typed data extraction.

The output JSON typically includes arrays or objects with fields such as company names, descriptions, and URLs, depending on the prompt and schema.

The node does not output binary data.

Dependencies

  • Requires an API key credential for authentication with the ScrapegraphAI service.
  • Makes HTTP POST requests to the ScrapegraphAI API endpoint /crawl to perform the crawling and extraction.
  • Optional: User must provide valid JSON if using a custom output schema.

Troubleshooting

  • Invalid JSON in Output Schema: If the custom output schema JSON is malformed, the node throws an error indicating invalid JSON. To fix, ensure the JSON schema is correctly formatted.
  • API Authentication Errors: Missing or incorrect API credentials will cause authentication failures. Verify the API key is configured properly in n8n credentials.
  • Crawl Limits: Setting very high depth or max pages may lead to long execution times or API rate limits. Adjust these parameters according to your needs and API usage limits.
  • JavaScript Rendering: Enabling "Render Heavy JS" increases resource usage and may incur extra costs. Disable it if not necessary.
  • Same Domain Restriction: If no results are returned, check if "Same Domain Only" is enabled and whether the target site has cross-domain links you want to crawl.

Links and References

Discussion