Webpage Content Extractor icon

Webpage Content Extractor

Extracts the content from a given URL. Similar to the "Reader" mode in your browser, it ignores headers, footers, banners, etc.

Overview

The Webpage Content Extractor node processes raw HTML code and extracts the main readable content from a web page, similar to the "Reader" mode in modern browsers. It removes extraneous elements such as headers, footers, banners, and advertisements, providing only the core article or content. This node is particularly useful for workflows that need to analyze, summarize, or repurpose web articles, blog posts, or news stories.

Practical examples:

  • Automatically summarizing news articles for newsletters.
  • Extracting the main text from blog posts for sentiment analysis.
  • Archiving clean versions of web content without clutter.

Properties

Name Type Meaning
HTML Code String The full HTML source code of the webpage to extract content from. Typically obtained using an HTTP Request node.

Output

The node outputs a JSON object with the following fields:

Field Type Description
excerpt String A short summary or excerpt of the main content.
siteName String The name of the website, if available.
length Number The character count of the extracted main content.
textContent String The plain text version of the main content, with all HTML tags removed.
content String The HTML-formatted main content (cleaned up, suitable for display).
title String The title of the article or main content.
language String The detected language code of the content (e.g., "en").
byline String The author or byline information, if available.
publishedTime String The publication date/time of the content, if available.

Example output:

{
  "excerpt": "This is a summary of the article...",
  "siteName": "Example News",
  "length": 1234,
  "textContent": "Full plain text of the article...",
  "content": "<div><p>Full HTML content...</p></div>",
  "title": "Breaking News: Example Event",
  "language": "en",
  "byline": "By Jane Doe",
  "publishedTime": "2024-06-01T12:00:00Z"
}

Dependencies

  • External Libraries:
    • @mozilla/readability: Used for extracting the main content from HTML.
    • jsdom: Used to parse and simulate the DOM environment for Readability.
  • n8n Configuration:
    • No special API keys or environment variables are required.
    • The node expects valid HTML input, typically fetched using n8n's HTTP Request node.

Troubleshooting

  • Common Issues:
    • Invalid or incomplete HTML: If the provided HTML is malformed or missing key elements, extraction may fail or return empty results.
    • Non-article pages: Pages without clear main content (e.g., homepages, search results) may not yield meaningful output.
  • Error Messages:
    • "Could not extract main contents of webpage."
      • Cause: The Readability library could not identify the main content in the provided HTML.
      • Resolution: Ensure you are passing the full HTML of an article or content-rich page. Try fetching the page again or check if the URL points to a valid article.

Links and References

Discussion