CoreClaw
Store
Pricing
Start Free Trial
Kael Odin

Dataset Deduplication & Merge Tool

Pricing
Try for free
Kael Odin

Dataset Deduplication & Merge Tool

odin-kael/dataset-deduplication-and-merge-tool

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

Try for Free
2,000 Free Results

Dedup Datasets Worker is an enterprise-grade tool for merging and deduplicating datasets from multiple JSON/JSONL sources. Optimized for the CoreClaw platform with enhanced features and robust error handling.

✨ Key Features

FeatureDescription
📦Multi-source MergeLoad and merge data from direct input, URLs, or Core Datasets
🎯Composite Key DedupDeduplicate based on multi-field combinations (e.g.,productId + sku)
🔄Dual Processing Modesdedup-after-load (preserve order) or dedup-as-loading (streaming)
🔧Custom TransformationsPre/post deduplication JavaScript transformation functions
🚀Auto Format DetectionAutomatically detect JSON/JSONL based on file extension
🌐Proxy SupportBuilt-in proxy configuration for CoreClaw cloud environment
💾State PersistenceAutomatic state saving for recovery from interruptions
🔍Duplicate DetectionFind and output duplicate items separately

🚀 Quick Start

bash
npm install
npm start

📋 Input Parameters

Platform Compatibility Note

CoreClaw requires the schema b field to reference an array-typed property. In this worker, b is bound to an internal placeholder field named runUnits so the UI can load reliably without splitting dedup logic by business fields.

  • Do not remove runUnits from the schema when publishing to CoreClaw.
  • runUnits is a platform compatibility field, not a business input.
  • fields remains the actual dedup key configuration used by the worker.

Required Parameters

ParameterTypeDescription
fieldsarrayFields to use for deduplication (e.g.,["productId", "sku"])

Data Source Parameters (Choose One)

ParameterTypeDescription
dataSourceTypestringData source type:"direct-input", "network-url", or "core-dataset"
inputDatastringJSON array data (when dataSourceType="direct-input")
inputUrlsarrayURL list of data files (when dataSourceType="network-url")
datasetIdsarrayCore Dataset ID list (when dataSourceType="core-dataset")

Optional Parameters

ParameterTypeDefaultDescription
inputFormatstring"json"File format:json or jsonl (auto-detected by extension)
outputstring"unique-items"Output type:unique-items, duplicate-items, or nothing
modestring"dedup-after-load"Processing mode
fieldsToLoadarray[]Only load specified fields to save memory
preDedupTransformFunctionstring""JavaScript function to transform data before deduplication
postDedupTransformFunctionstring""JavaScript function to transform data after deduplication
customInputDatastring""Custom data object (JSON) passed to transform functions
nullAsUniquebooleanfalseTreat null/undefined values as unique
parallelLoadsinteger10Number of parallel file loads (1-100)
parallelPushesinteger5Number of parallel data pushes (1-50)
batchSizeinteger5000Batch size for processing (100-50000)
appendFileSourcebooleanfalseAdd __fileSource__ field to track file origin
verboseLogbooleanfalseEnable detailed logging

💡 Usage Examples

Example 1: E-Commerce Product Dedup (Composite Key)

Deduplicate products by productId + sku combination — same product with different SKUs are kept as unique items.

json
{
  "dataSourceType": "direct-input",
  "inputData": "[{"productId":"SKU-1001","sku":"A-BLACK","name":"Wireless Earbuds Pro","price":79.99,"category":"Electronics","rating":4.8},{"productId":"SKU-1001","sku":"A-WHITE","name":"Wireless Earbuds Pro","price":79.99,"category":"Electronics","rating":4.8},{"productId":"SKU-1001","sku":"A-BLACK","name":"Wireless Earbuds Pro (Duplicate)","price":69.99,"category":"Electronics","rating":4.5}]",
  "fields": ["productId", "sku"]
}

Result: 2 unique items — the third entry (same productId + sku as the first) is deduplicated.

Example 2: Multi-Source Merge with Transform Pipeline

Merge product catalogs from multiple URLs, filter out invalid items before dedup, and enrich after dedup.

json
{
  "dataSourceType": "network-url",
  "inputUrls": [
    { "url": "https://api.example.com/catalog/electronics.json" },
    { "url": "https://api.example.com/catalog/wearables.json" },
    { "url": "https://api.example.com/catalog/audio.jsonl" }
  ],
  "inputFormat": "json",
  "fields": ["productId", "sku"],
  "preDedupTransformFunction": "async (items, customData) => {
  return items
    .filter(item => item.price >= customData.minPrice && item.stock > 0)
    .map(item => ({...item, price: Math.round(item.price * 100) / 100}));
}",
  "postDedupTransformFunction": "async (items) => {
  const ts = new Date().toISOString();
  return items.map(item => ({...item, mergedAt: ts, source: 'catalog-merge'}));
}",
  "customInputData": "{"minPrice": 10}",
  "appendFileSource": true,
  "verboseLog": true
}

Pipeline: Load 3 sources → Filter by minPrice & stock → Round prices → Dedup by productId+sku → Add mergedAt timestamp → Output

Example 3: Core Dataset Cross-Reference Dedup

Deduplicate across multiple Core Datasets to find unique entries.

json
{
  "dataSourceType": "core-dataset",
  "datasetIds": ["ds-product-crawl-a", "ds-product-crawl-b", "ds-supplier-feed"],
  "fields": ["productId", "sku"],
  "mode": "dedup-as-loading",
  "batchSize": 10000,
  "fieldsToLoad": ["productId", "sku", "name", "price", "stock"],
  "appendFileSource": true
}

Example 4: Large-Scale Streaming Dedup (>100K records)

json
{
  "dataSourceType": "network-url",
  "inputUrls": [{ "url": "https://data.example.com/full-catalog.jsonl" }],
  "inputFormat": "jsonl",
  "fields": ["productId", "sku"],
  "mode": "dedup-as-loading",
  "batchSize": 10000,
  "fieldsToLoad": ["productId", "sku", "name", "price"],
  "verboseLog": true
}

Example 5: Duplicate Detection & Analysis

Find all duplicate entries for data quality auditing.

json
{
  "dataSourceType": "direct-input",
  "inputData": "[{"productId":"SKU-1001","sku":"A-BLACK","name":"Earbuds"},{"productId":"SKU-1001","sku":"A-BLACK","name":"Earbuds Pro"},{"productId":"SKU-1002","sku":"B-WHITE","name":"Watch"}]",
  "fields": ["productId", "sku"],
  "output": "duplicate-items"
}

Result: 1 duplicate item (the second SKU-1001/A-BLACK entry)

📊 Performance Guide

Dataset SizeRecommended ModeMemorySpeed
< 10Kdedup-after-loadLowFast
10K-100Kdedup-after-loadMediumFast
100K-1Mdedup-as-loadingLowMedium
> 1Mdedup-as-loadingLowSlow

🔧 Optimization Tips

  1. Load Only Required Fields: Use fieldsToLoad parameter
  2. Use JSONL Format: More memory-efficient than JSON
  3. Adjust Batch Size: Tune batchSize based on available memory
  4. Enable Parallel Loading: Increase parallelLoads for faster processing
  5. File Format Auto-detection: Worker automatically detects .jsonl files regardless of inputFormat

🐛 Troubleshooting

ProblemSolution
JavaScript heap out of memorySwitch to dedup-as-loading mode, reduce batchSize, use fieldsToLoad, use JSONL format
All items appear as uniqueVerify field names, enable verboseLog: true, confirm data contains the specified fields

Pricing

Failed results don't count

Rating

4.7

Developer

Kael Odin

Worker Stats

15 Total runs
Success rate: 86.67%
Last updated: Apr 20, 2026

Categories

Google

Share

You might also like

Explore more popular scrapers from our marketplace

View All Scrapers
Google Search Results (SERP) Scraper API

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.6
467 runs
From $1.2/1,000 results
Google Sheets Import Export Tool

Google Sheets Import Export Tool

by Kael Odin

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

4.8
2 runs
From $1.2/1,000 results
Cheerio Web Scraping

Cheerio Web Scraping

by Kael Odin

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

4.9
3 runs
From $1.2/1,000 results
Playwright Web Scraping

Playwright Web Scraping

by Kael Odin

A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.

4.5
4 runs
From $1.2/1,000 results
View All Scrapers
CoreClaw

Deploy ready-to-use Workers to accelerate your data collection workflows.

Email: support@coreclaw.com

Resources

  • Quick Start
  • API Reference
  • Leads

Recommend

  • Store
  • Pricing

Address

Apex DataWorks Limited

UNIT 9, 1/F, THE CLOUD, 111 TUNG CHAU STREET, TAI KOK TSUI, KOWLOON,HONG KONG