ScrapegraphAI

Turn any webpage into usable data in one shot – ScrapegraphAI explores the website and extracts the content you need.

Actions7

Scrape Actions
- Scrape
Agentic Scraper Actions
- Execute
Markdownify Actions
- Convert
Search Scraper Actions
- Search
Smart Crawler Actions
- Crawl
- Get Status
Smart Scraper Actions
- Scrape

Overview

The Smart Crawler - Crawl operation enables automated crawling of websites to extract structured data using AI. It explores a given URL and follows links up to a specified depth and page limit, optionally restricting the crawl to the same domain. The node uses an AI prompt to guide content extraction from the crawled pages. Users can also define a custom JSON schema to structure the output data.

This node is beneficial for scenarios such as:

Gathering company information or product details from multiple pages within a website.
Extracting structured insights from complex websites without manual scraping setup.
Automating data collection workflows where content spans several linked pages.

Example use cases:

Crawling a business directory site to collect company names, descriptions, and websites.
Extracting blog post summaries across multiple pages of a news site.
Collecting product specifications from e-commerce category pages.

Properties

Name	Meaning
URL	The starting URL to begin crawling from.
Prompt	AI prompt that instructs how to extract content from the crawled pages.
Cache Website	Whether to cache the website content during crawling to optimize repeated requests.
Depth	Number of link levels to follow from the initial URL (crawling depth).
Max Pages	Maximum number of pages to crawl in total.
Same Domain Only	Whether to restrict crawling to pages within the same domain as the starting URL.
Use Custom Output Schema	Whether to apply a user-defined JSON schema to structure the extracted output data.
Output Schema	JSON schema defining the structure and types of the output data when "Use Custom Output Schema" is enabled.
Render Heavy JS	Whether to render JavaScript-heavy websites during crawling (may consume additional credits).

Output

The node outputs a JSON object representing the crawl results. The structure depends on whether a custom output schema is used:

By default, the output contains raw extracted data from the crawled pages as interpreted by the AI.
If a custom output schema is provided, the output conforms to that schema, enabling structured and typed data extraction.

The output JSON typically includes arrays or objects with fields such as company names, descriptions, and URLs, depending on the prompt and schema.

The node does not output binary data.

Dependencies

Requires an API key credential for authentication with the ScrapegraphAI service.
Makes HTTP POST requests to the ScrapegraphAI API endpoint /crawl to perform the crawling and extraction.
Optional: User must provide valid JSON if using a custom output schema.

Troubleshooting

Invalid JSON in Output Schema: If the custom output schema JSON is malformed, the node throws an error indicating invalid JSON. To fix, ensure the JSON schema is correctly formatted.
API Authentication Errors: Missing or incorrect API credentials will cause authentication failures. Verify the API key is configured properly in n8n credentials.
Crawl Limits: Setting very high depth or max pages may lead to long execution times or API rate limits. Adjust these parameters according to your needs and API usage limits.
JavaScript Rendering: Enabling "Render Heavy JS" increases resource usage and may incur extra costs. Disable it if not necessary.
Same Domain Restriction: If no results are returned, check if "Same Domain Only" is enabled and whether the target site has cross-domain links you want to crawl.

Links and References

ScrapegraphAI Documentation (for API details and usage)
JSON Schema Specification (to create custom output schemas)