BrightData

Interact with Bright Data to scrape websites or use existing datasets from the marketplace to generate adapted snapshots

Actions14

Overview

The node interacts with the Bright Data platform to download snapshot data generated by web scraping tasks. Specifically, the "Download Snapshot" operation under the "Web Scraper" resource allows users to retrieve parts of a previously created snapshot in either JSON or CSV format. This is useful for processing large datasets incrementally by downloading them in batches and parts.

Common scenarios include:

Incrementally downloading large scraped datasets to avoid memory overload.
Retrieving specific parts of a snapshot for partial analysis.
Choosing between JSON or CSV formats depending on downstream processing needs.
Optionally compressing the downloaded data to reduce bandwidth usage.

Example: A user has triggered a web scraping job that produced a snapshot with 10,000 records. Using this node, they can download the snapshot in 5 parts of 2,000 records each, specifying batch size and part number, and receive the data as compressed JSON.

Properties

Name	Meaning
Batch Size	The number of records to download in each batch.
Part	The part number of the snapshot to download (e.g., 1 for first part, 2 for second, etc.).
Snapshot ID	The unique identifier of the snapshot to operate on (required).
Format	The format of the returned data. Options: JSON, CSV.
Compress	Whether to compress the response using gzip format (true/false).

Output

The node outputs the downloaded snapshot data in the json field of the output items. The structure depends on the chosen format:

JSON: Parsed JSON objects representing the scraped records.
CSV: Likely a string or parsed CSV data representing the records.

If compression is enabled, the node handles decompressing the gzip response before outputting the data.

No binary data output is indicated in the provided code snippet.

Dependencies

Requires an API key credential for authenticating with the Bright Data platform.
Connects to the Bright Data API endpoint at https://api.brightdata.com.
The node uses HTTP requests with query parameters to specify batch size, part number, snapshot ID, format, and compression.

Troubleshooting

Invalid Snapshot ID: If the snapshot ID is incorrect or does not exist, the API will likely return an error. Verify the snapshot ID is correct.
Batch Size Too Large: Requesting too many records per batch may cause timeouts or memory issues. Reduce batch size.
Part Number Out of Range: Requesting a part number beyond the available parts will result in empty or error responses. Ensure part numbers are within valid range.
Compression Issues: If compression is enabled but the client cannot handle gzip, data may be unreadable. Disable compression if unsupported.
Authentication Errors: Missing or invalid API credentials will prevent access. Confirm API key configuration.