Firecrawl

Get data from Firecrawl API

Actions7

Overview

The Firecrawl node enables users to extract structured data from web pages using the Firecrawl API. It is designed for web scraping and data extraction tasks where users want to gather information from multiple URLs, optionally guided by prompts and schemas to shape the extracted data. This node is beneficial for scenarios such as monitoring website content changes, aggregating data from multiple sources, or extracting specific information like product details, news articles, or metadata.

Practical examples include:

Extracting product prices and descriptions from e-commerce sites.
Monitoring changes in competitor websites using change tracking formats.
Capturing screenshots of web pages for visual records.
Summarizing content from a list of URLs for research purposes.

Properties

Name	Meaning
URLs	List of URLs (supports glob patterns) to extract data from.
Prompt	Text prompt to guide the extraction process, helping tailor the data extraction.
Schema	JSON schema defining the expected structure of the extracted data.
Ignore Sitemap	Whether to ignore the website sitemap during crawling (default: true).
Include Subdomains	Whether to include subdomains of the website when crawling (default: false).
Enable Web Search	Whether to enable web search to find additional data beyond the provided URLs (default: false).
Show Sources	Whether to show the sources used during data extraction (default: false).
Scrape Options	Options controlling how content is scraped during crawling, including output formats (e.g., markdown, HTML, JSON, screenshots), prompts and schemas for JSON extraction, change tracking modes, screenshot settings, etc.
Only Main Content	Whether to return only the main content of the page, excluding headers, navigation bars, footers, etc. (default: true).
Include Tags	List of HTML tags to explicitly include in the output.
Exclude Tags	List of HTML tags to exclude from the output.
Headers	Custom HTTP headers to send with each request.
Wait For (Ms)	Milliseconds to wait for the page to load before fetching content.
Mobile	Whether to emulate scraping from a mobile device (default: false).
Skip TLS Verification	Whether to skip TLS certificate verification when making requests (default: false).
Timeout (Ms)	Request timeout in milliseconds (default: 30000).
Actions	List of actions to interact with dynamic content before scraping, such as clicking elements, pressing keys, scrolling, waiting, writing text, or taking screenshots.
Location	Location settings for the request including country code and preferred languages/locales.
Remove Base64 Images	Whether to remove base64 encoded images from the output (default: true).
Block Ads	Enables ad-blocking and cookie popup blocking during scraping (default: true).
Store In Cache	Whether to store the page in the Firecrawl index and cache; disable for sensitive scraping parameters or data protection concerns (default: true).
Proxy	Type of proxy to use for requests: Basic or Stealth.
Additional Fields	Additional custom JSON properties to add to the request body when using a custom request body.
Use Custom Body	Whether to use a fully custom request body instead of the standard parameters.

Output

The node outputs JSON data representing the extracted content from the specified URLs. The structure of the json output depends on the provided schema and formats selected in the scrape options. It can include:

Structured data matching the user-defined schema.
Raw or processed HTML content.
Markdown summaries.
Change tracking data showing differences between crawls.
Screenshots encoded as images (binary data).
Lists of links or other extracted elements.

If screenshots are requested, the node outputs binary data representing the captured images.

Dependencies

Requires an API key credential for authenticating with the Firecrawl API.
The node sends HTTP requests to the Firecrawl API endpoint (default https://api.firecrawl.dev/v2).
Optional proxy configuration can be set within the node properties.
No additional external dependencies beyond the Firecrawl API and n8n's HTTP request capabilities.

Troubleshooting

Timeouts: If the request times out, consider increasing the "Timeout (Ms)" property.
Invalid URLs: Ensure URLs are correctly formatted and accessible; glob patterns should be valid.
Schema Errors: Incorrect JSON schema definitions may cause extraction failures; validate JSON syntax carefully.
Authentication Failures: Verify that the API key credential is correctly configured and has necessary permissions.
Dynamic Content Issues: If dynamic page content is not loaded properly, use the "Actions" property to interact with the page (e.g., wait, click).
TLS Verification: If scraping sites with self-signed certificates, enable "Skip TLS Verification".
Ad Blocking: If content is blocked by ads or popups, ensure "Block Ads" is enabled.
Cache Concerns: Disable "Store In Cache" if sensitive data is involved or if you want fresh data on every run.

Links and References

Firecrawl API Documentation: https://firecrawl.dev/docs
MDN Web Docs on Accept-Language header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language
n8n Documentation: https://docs.n8n.io/