Webpage Content Extractor

Extracts the content from a given URL. Similar to the "Reader" mode in your browser, it ignores headers, footers, banners, etc.

Overview

The Webpage Content Extractor node processes raw HTML code and extracts the main readable content from a web page, similar to the "Reader" mode in modern browsers. It removes extraneous elements such as headers, footers, banners, and advertisements, providing only the core article or content. This node is particularly useful for workflows that need to analyze, summarize, or repurpose web articles, blog posts, or news stories.

Practical examples:

Automatically summarizing news articles for newsletters.
Extracting the main text from blog posts for sentiment analysis.
Archiving clean versions of web content without clutter.

Properties

Name	Type	Meaning
HTML Code	String	The full HTML source code of the webpage to extract content from. Typically obtained using an HTTP Request node.

Output

The node outputs a JSON object with the following fields:

Field	Type	Description
excerpt	String	A short summary or excerpt of the main content.
siteName	String	The name of the website, if available.
length	Number	The character count of the extracted main content.
textContent	String	The plain text version of the main content, with all HTML tags removed.
content	String	The HTML-formatted main content (cleaned up, suitable for display).
title	String	The title of the article or main content.
language	String	The detected language code of the content (e.g., "en").
byline	String	The author or byline information, if available.
publishedTime	String	The publication date/time of the content, if available.

Example output:

{
  "excerpt": "This is a summary of the article...",
  "siteName": "Example News",
  "length": 1234,
  "textContent": "Full plain text of the article...",
  "content": "<div><p>Full HTML content...</p></div>",
  "title": "Breaking News: Example Event",
  "language": "en",
  "byline": "By Jane Doe",
  "publishedTime": "2024-06-01T12:00:00Z"
}

Dependencies

External Libraries:
- @mozilla/readability: Used for extracting the main content from HTML.
- jsdom: Used to parse and simulate the DOM environment for Readability.
n8n Configuration:
- No special API keys or environment variables are required.
- The node expects valid HTML input, typically fetched using n8n's HTTP Request node.

Troubleshooting

Common Issues:
- Invalid or incomplete HTML: If the provided HTML is malformed or missing key elements, extraction may fail or return empty results.
- Non-article pages: Pages without clear main content (e.g., homepages, search results) may not yield meaningful output.
Error Messages:
- "Could not extract main contents of webpage."
  - Cause: The Readability library could not identify the main content in the provided HTML.
  - Resolution: Ensure you are passing the full HTML of an article or content-rich page. Try fetching the page again or check if the URL points to a valid article.