Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The node integrates with the Firecrawl API to perform website crawling and scraping tasks. It allows users to crawl a specified URL, extract content in various formats, and control the crawling behavior through numerous options such as concurrency, delays, path inclusion/exclusion, and more.

This node is beneficial for scenarios where automated data extraction from websites is needed, such as:

  • Collecting blog posts or articles from a website.
  • Monitoring changes on web pages over time.
  • Extracting structured data or summaries from web content.
  • Capturing screenshots of web pages for visual records.
  • Filtering crawled URLs by patterns or tags.

For example, a user can configure the node to crawl a news site’s blog section, exclude certain paths like archives, limit the number of results, and output the scraped content in markdown format.

Properties

Name Meaning
Url The starting URL to crawl (e.g., https://firecrawl.dev).
Prompt A prompt string used during the crawl, typically to guide content extraction or summarization.
Limit Maximum number of crawl results to return (minimum 1).
Delay Delay in milliseconds between requests during the crawl.
Max Concurrency Maximum number of concurrent page scrapes allowed during the crawl.
Exclude Paths List of URL path patterns (regex-like) to exclude from crawling (e.g., blog/* excludes /blog/article-1).
Include Paths List of URL path patterns to include in the crawl exclusively.
Crawl Options Collection of boolean flags controlling crawl behavior:
• Ignore Sitemap: whether to ignore sitemap.xml.
• Ignore Query Params: treat URLs without query parameters as identical.
• Allow External Links: follow external links.
• Allow Subdomains: follow subdomains of main domain.
Scrape Options Options controlling how content is scraped:
• Formats: output formats such as markdown, html, json, links, screenshot, summary, change tracking.
• Only Main Content: whether to extract only main page content.
• Include Tags / Exclude Tags: specify HTML tags to include or exclude.
• Actions: list of interactions before scraping (click, press key, scroll, wait, write text, screenshot).
• Location: country and preferred languages for the request.
• Remove Base64 Images: whether to remove base64 images.
• Block Ads: enable ad and cookie popup blocking.
• Store In Cache: whether to store page in Firecrawl cache.
• Proxy: type of proxy to use (basic or stealth).
Headers Custom HTTP headers to send with each request (key-value pairs).
Wait For (Ms) Milliseconds to wait for the page to load before scraping content.
Mobile Whether to emulate a mobile device during scraping.
Skip TLS Verification Whether to skip TLS certificate verification when making requests.
Timeout (Ms) Request timeout in milliseconds.
Additional Fields Custom JSON properties to add to the request body for advanced use cases.
Use Custom Body Option to provide a fully custom request body instead of using the predefined parameters.

Output

The node outputs JSON data representing the results of the crawl and scrape operation. The structure includes details about the crawled pages and their extracted content according to the selected formats. Possible output formats include:

  • Markdown: Extracted content in markdown format.
  • HTML / Raw HTML: Full or partial HTML content.
  • JSON: Structured data extracted using a JSON schema.
  • Links: List of links found on pages.
  • Summary: Summarized content of pages.
  • Change Tracking: Differences detected between crawls, supporting modes like git-diff or JSON diff.
  • Screenshot: Binary image data capturing the page screenshot (if requested).

If screenshots are included, binary data representing images is output alongside JSON metadata.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl API.
  • The node sends HTTP requests to the Firecrawl service endpoint (default https://api.firecrawl.dev/v2).
  • Optional proxy configuration (basic or stealth) can be set.
  • Supports advanced scraping features that may require additional permissions or API plan capabilities on Firecrawl.

Troubleshooting

  • Timeouts: If requests time out, consider increasing the "Timeout (Ms)" property or reducing concurrency.
  • Empty Results: Check if the URL is correct and accessible; verify include/exclude path patterns do not filter out all pages.
  • Invalid JSON Schema: When using JSON extraction formats, ensure the provided JSON schema is valid and matches expected data.
  • Permission Errors: Confirm the API key credential is valid and has sufficient permissions.
  • TLS Errors: If TLS verification fails, enabling "Skip TLS Verification" might help but use cautiously.
  • Proxy Issues: If using proxies, verify proxy settings and connectivity.
  • Ad Blocking Side Effects: Enabling ad-blocking may sometimes block desired content; try disabling if content is missing.

Links and References

Discussion