Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The Firecrawl node enables users to extract structured data from web pages using the Firecrawl API. It is designed for web scraping and data extraction tasks where users want to gather information from multiple URLs, optionally guided by prompts and schemas to shape the extracted data. This node is beneficial for scenarios such as monitoring website content changes, aggregating data from multiple sources, or extracting specific information like product details, news articles, or metadata.

Practical examples include:

  • Extracting product prices and descriptions from e-commerce sites.
  • Monitoring changes in competitor websites using change tracking formats.
  • Capturing screenshots of web pages for visual records.
  • Summarizing content from a list of URLs for research purposes.

Properties

Name Meaning
URLs List of URLs (supports glob patterns) to extract data from.
Prompt Text prompt to guide the extraction process, helping tailor the data extraction.
Schema JSON schema defining the expected structure of the extracted data.
Ignore Sitemap Whether to ignore the website sitemap during crawling (default: true).
Include Subdomains Whether to include subdomains of the website when crawling (default: false).
Enable Web Search Whether to enable web search to find additional data beyond the provided URLs (default: false).
Show Sources Whether to show the sources used during data extraction (default: false).
Scrape Options Options controlling how content is scraped during crawling, including output formats (e.g., markdown, HTML, JSON, screenshots), prompts and schemas for JSON extraction, change tracking modes, screenshot settings, etc.
Only Main Content Whether to return only the main content of the page, excluding headers, navigation bars, footers, etc. (default: true).
Include Tags List of HTML tags to explicitly include in the output.
Exclude Tags List of HTML tags to exclude from the output.
Headers Custom HTTP headers to send with each request.
Wait For (Ms) Milliseconds to wait for the page to load before fetching content.
Mobile Whether to emulate scraping from a mobile device (default: false).
Skip TLS Verification Whether to skip TLS certificate verification when making requests (default: false).
Timeout (Ms) Request timeout in milliseconds (default: 30000).
Actions List of actions to interact with dynamic content before scraping, such as clicking elements, pressing keys, scrolling, waiting, writing text, or taking screenshots.
Location Location settings for the request including country code and preferred languages/locales.
Remove Base64 Images Whether to remove base64 encoded images from the output (default: true).
Block Ads Enables ad-blocking and cookie popup blocking during scraping (default: true).
Store In Cache Whether to store the page in the Firecrawl index and cache; disable for sensitive scraping parameters or data protection concerns (default: true).
Proxy Type of proxy to use for requests: Basic or Stealth.
Additional Fields Additional custom JSON properties to add to the request body when using a custom request body.
Use Custom Body Whether to use a fully custom request body instead of the standard parameters.

Output

The node outputs JSON data representing the extracted content from the specified URLs. The structure of the json output depends on the provided schema and formats selected in the scrape options. It can include:

  • Structured data matching the user-defined schema.
  • Raw or processed HTML content.
  • Markdown summaries.
  • Change tracking data showing differences between crawls.
  • Screenshots encoded as images (binary data).
  • Lists of links or other extracted elements.

If screenshots are requested, the node outputs binary data representing the captured images.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl API.
  • The node sends HTTP requests to the Firecrawl API endpoint (default https://api.firecrawl.dev/v2).
  • Optional proxy configuration can be set within the node properties.
  • No additional external dependencies beyond the Firecrawl API and n8n's HTTP request capabilities.

Troubleshooting

  • Timeouts: If the request times out, consider increasing the "Timeout (Ms)" property.
  • Invalid URLs: Ensure URLs are correctly formatted and accessible; glob patterns should be valid.
  • Schema Errors: Incorrect JSON schema definitions may cause extraction failures; validate JSON syntax carefully.
  • Authentication Failures: Verify that the API key credential is correctly configured and has necessary permissions.
  • Dynamic Content Issues: If dynamic page content is not loaded properly, use the "Actions" property to interact with the page (e.g., wait, click).
  • TLS Verification: If scraping sites with self-signed certificates, enable "Skip TLS Verification".
  • Ad Blocking: If content is blocked by ads or popups, ensure "Block Ads" is enabled.
  • Cache Concerns: Disable "Store In Cache" if sensitive data is involved or if you want fresh data on every run.

Links and References

Discussion