Dataset Deduplication & Merge Tool

01KP86Y74J2ZECDY04PV8GKTVT

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

by Odin Kael

4.7

13runs

Last updated:2026-04-20

Try for Free

2,000 Free Results

Dedup Datasets Worker is an enterprise-grade tool for merging and deduplicating datasets from multiple JSON/JSONL sources. Optimized for the CoreClaw platform with enhanced features and robust error handling.

✨ Key Features

Feature	Description
📦Multi-source Merge	Load and merge data from direct input, URLs, or Core Datasets
🎯Composite Key Dedup	Deduplicate based on multi-field combinations (e.g.,`productId` + `sku`)
🔄Dual Processing Modes	`dedup-after-load` (preserve order) or `dedup-as-loading` (streaming)
🔧Custom Transformations	Pre/post deduplication JavaScript transformation functions
🚀Auto Format Detection	Automatically detect JSON/JSONL based on file extension
🌐Proxy Support	Built-in proxy configuration for CoreClaw cloud environment
💾State Persistence	Automatic state saving for recovery from interruptions
🔍Duplicate Detection	Find and output duplicate items separately

🚀 Quick Start

bash

npm install
npm start

📋 Input Parameters

Platform Compatibility Note

CoreClaw requires the schema b field to reference an array-typed property. In this worker, b is bound to an internal placeholder field named runUnits so the UI can load reliably without splitting dedup logic by business fields.

Do not remove runUnits from the schema when publishing to CoreClaw.
runUnits is a platform compatibility field, not a business input.
fields remains the actual dedup key configuration used by the worker.

Required Parameters

Parameter	Type	Description
`fields`	array	Fields to use for deduplication (e.g.,`["productId", "sku"]`)

Data Source Parameters (Choose One)

Parameter	Type	Description
`dataSourceType`	string	Data source type:`"direct-input"`, `"network-url"`, or `"core-dataset"`
`inputData`	string	JSON array data (when `dataSourceType="direct-input"`)
`inputUrls`	array	URL list of data files (when `dataSourceType="network-url"`)
`datasetIds`	array	Core Dataset ID list (when `dataSourceType="core-dataset"`)

Optional Parameters

Parameter	Type	Default	Description
`inputFormat`	string	`"json"`	File format:`json` or `jsonl` (auto-detected by extension)
`output`	string	`"unique-items"`	Output type:`unique-items`, `duplicate-items`, or `nothing`
`mode`	string	`"dedup-after-load"`	Processing mode
`fieldsToLoad`	array	`[]`	Only load specified fields to save memory
`preDedupTransformFunction`	string	`""`	JavaScript function to transform data before deduplication
`postDedupTransformFunction`	string	`""`	JavaScript function to transform data after deduplication
`customInputData`	string	`""`	Custom data object (JSON) passed to transform functions
`nullAsUnique`	boolean	`false`	Treat null/undefined values as unique
`parallelLoads`	integer	`10`	Number of parallel file loads (1-100)
`parallelPushes`	integer	`5`	Number of parallel data pushes (1-50)
`batchSize`	integer	`5000`	Batch size for processing (100-50000)
`appendFileSource`	boolean	`false`	Add `__fileSource__` field to track file origin
`verboseLog`	boolean	`false`	Enable detailed logging

💡 Usage Examples

Example 1: E-Commerce Product Dedup (Composite Key)

Deduplicate products by productId + sku combination — same product with different SKUs are kept as unique items.

json

{
  "dataSourceType": "direct-input",
  "inputData": "[{\"productId\":\"SKU-1001\",\"sku\":\"A-BLACK\",\"name\":\"Wireless Earbuds Pro\",\"price\":79.99,\"category\":\"Electronics\",\"rating\":4.8},{\"productId\":\"SKU-1001\",\"sku\":\"A-WHITE\",\"name\":\"Wireless Earbuds Pro\",\"price\":79.99,\"category\":\"Electronics\",\"rating\":4.8},{\"productId\":\"SKU-1001\",\"sku\":\"A-BLACK\",\"name\":\"Wireless Earbuds Pro (Duplicate)\",\"price\":69.99,\"category\":\"Electronics\",\"rating\":4.5}]",
  "fields": ["productId", "sku"]
}

Result: 2 unique items — the third entry (same productId + sku as the first) is deduplicated.

Example 2: Multi-Source Merge with Transform Pipeline

Merge product catalogs from multiple URLs, filter out invalid items before dedup, and enrich after dedup.

json

{
  "dataSourceType": "network-url",
  "inputUrls": [
    { "url": "https://api.example.com/catalog/electronics.json" },
    { "url": "https://api.example.com/catalog/wearables.json" },
    { "url": "https://api.example.com/catalog/audio.jsonl" }
  ],
  "inputFormat": "json",
  "fields": ["productId", "sku"],
  "preDedupTransformFunction": "async (items, customData) => {\n  return items\n    .filter(item => item.price >= customData.minPrice && item.stock > 0)\n    .map(item => ({...item, price: Math.round(item.price * 100) / 100}));\n}",
  "postDedupTransformFunction": "async (items) => {\n  const ts = new Date().toISOString();\n  return items.map(item => ({...item, mergedAt: ts, source: 'catalog-merge'}));\n}",
  "customInputData": "{\"minPrice\": 10}",
  "appendFileSource": true,
  "verboseLog": true
}

Pipeline: Load 3 sources → Filter by minPrice & stock → Round prices → Dedup by productId+sku → Add mergedAt timestamp → Output

Example 3: Core Dataset Cross-Reference Dedup

Deduplicate across multiple Core Datasets to find unique entries.

json

{
  "dataSourceType": "core-dataset",
  "datasetIds": ["ds-product-crawl-a", "ds-product-crawl-b", "ds-supplier-feed"],
  "fields": ["productId", "sku"],
  "mode": "dedup-as-loading",
  "batchSize": 10000,
  "fieldsToLoad": ["productId", "sku", "name", "price", "stock"],
  "appendFileSource": true
}

Example 4: Large-Scale Streaming Dedup (>100K records)

json

{
  "dataSourceType": "network-url",
  "inputUrls": [{ "url": "https://data.example.com/full-catalog.jsonl" }],
  "inputFormat": "jsonl",
  "fields": ["productId", "sku"],
  "mode": "dedup-as-loading",
  "batchSize": 10000,
  "fieldsToLoad": ["productId", "sku", "name", "price"],
  "verboseLog": true
}

Example 5: Duplicate Detection & Analysis

Find all duplicate entries for data quality auditing.

json

{
  "dataSourceType": "direct-input",
  "inputData": "[{\"productId\":\"SKU-1001\",\"sku\":\"A-BLACK\",\"name\":\"Earbuds\"},{\"productId\":\"SKU-1001\",\"sku\":\"A-BLACK\",\"name\":\"Earbuds Pro\"},{\"productId\":\"SKU-1002\",\"sku\":\"B-WHITE\",\"name\":\"Watch\"}]",
  "fields": ["productId", "sku"],
  "output": "duplicate-items"
}

Result: 1 duplicate item (the second SKU-1001/A-BLACK entry)

📊 Performance Guide

Dataset Size	Recommended Mode	Memory	Speed
< 10K	`dedup-after-load`	Low	Fast
10K-100K	`dedup-after-load`	Medium	Fast
100K-1M	`dedup-as-loading`	Low	Medium
> 1M	`dedup-as-loading`	Low	Slow

🔧 Optimization Tips

Load Only Required Fields: Use fieldsToLoad parameter
Use JSONL Format: More memory-efficient than JSON
Adjust Batch Size: Tune batchSize based on available memory
Enable Parallel Loading: Increase parallelLoads for faster processing
File Format Auto-detection: Worker automatically detects .jsonl files regardless of inputFormat

🐛 Troubleshooting

Problem	Solution
`JavaScript heap out of memory`	Switch to `dedup-as-loading` mode, reduce `batchSize`, use `fieldsToLoad`, use JSONL format
All items appear as unique	Verify field names, enable `verboseLog: true`, confirm data contains the specified fields

Price Estimation

Results Limit

101,000 results

Estimated:

~$0.30

100 results × $0.003. You only pay for success.

Run Now

Buy Now

Quick Tips

New users get 2,000 free results
Failed requests are free
Export results in JSON or CSV

Explore more popular scrapers from our marketplace

View All Scrapers

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.6

133 runs

From $3/results

Google Sheets Import Export Tool

by Odin Kael

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

4.8

2 runs

From $3/results

Cheerio Web Scraping

by Odin Kael

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

4.9

3 runs

From $3/results

Playwright Web Scraping

by Odin Kael

A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.

Dataset Deduplication & Merge Tool

✨ Key Features

🚀 Quick Start

📋 Input Parameters

Platform Compatibility Note

Required Parameters

Data Source Parameters (Choose One)

Optional Parameters

💡 Usage Examples

Example 1: E-Commerce Product Dedup (Composite Key)

Example 2: Multi-Source Merge with Transform Pipeline

Example 3: Core Dataset Cross-Reference Dedup

Example 4: Large-Scale Streaming Dedup (>100K records)

Example 5: Duplicate Detection & Analysis

📊 Performance Guide

🔧 Optimization Tips

🐛 Troubleshooting

Price Estimation

You might also like

Google Search Results (SERP) Scraper API

Google Sheets Import Export Tool

Cheerio Web Scraping

Playwright Web Scraping