Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.
Dedup Datasets Worker is an enterprise-grade tool for merging and deduplicating datasets from multiple JSON/JSONL sources. Optimized for the CoreClaw platform with enhanced features and robust error handling.
| Feature | Description |
|---|---|
| 📦Multi-source Merge | Load and merge data from direct input, URLs, or Core Datasets |
| 🎯Composite Key Dedup | Deduplicate based on multi-field combinations (e.g.,productId + sku) |
| 🔄Dual Processing Modes | dedup-after-load (preserve order) or dedup-as-loading (streaming) |
| 🔧Custom Transformations | Pre/post deduplication JavaScript transformation functions |
| 🚀Auto Format Detection | Automatically detect JSON/JSONL based on file extension |
| 🌐Proxy Support | Built-in proxy configuration for CoreClaw cloud environment |
| 💾State Persistence | Automatic state saving for recovery from interruptions |
| 🔍Duplicate Detection | Find and output duplicate items separately |
CoreClaw requires the schema b field to reference an array-typed property. In this worker, b is bound to an internal placeholder field named runUnits so the UI can load reliably without splitting dedup logic by business fields.
runUnits from the schema when publishing to CoreClaw.runUnits is a platform compatibility field, not a business input.fields remains the actual dedup key configuration used by the worker.| Parameter | Type | Description |
|---|---|---|
fields | array | Fields to use for deduplication (e.g.,["productId", "sku"]) |
| Parameter | Type | Description |
|---|---|---|
dataSourceType | string | Data source type:"direct-input", "network-url", or "core-dataset" |
inputData | string | JSON array data (when dataSourceType="direct-input") |
inputUrls | array | URL list of data files (when dataSourceType="network-url") |
datasetIds | array | Core Dataset ID list (when dataSourceType="core-dataset") |
| Parameter | Type | Default | Description |
|---|---|---|---|
inputFormat | string | "json" | File format:json or jsonl (auto-detected by extension) |
output | string | "unique-items" | Output type:unique-items, duplicate-items, or nothing |
mode | string | "dedup-after-load" | Processing mode |
fieldsToLoad | array | [] | Only load specified fields to save memory |
preDedupTransformFunction | string | "" | JavaScript function to transform data before deduplication |
postDedupTransformFunction | string | "" | JavaScript function to transform data after deduplication |
customInputData | string | "" | Custom data object (JSON) passed to transform functions |
nullAsUnique | boolean | false | Treat null/undefined values as unique |
parallelLoads | integer | 10 | Number of parallel file loads (1-100) |
parallelPushes | integer | 5 | Number of parallel data pushes (1-50) |
batchSize | integer | 5000 | Batch size for processing (100-50000) |
appendFileSource | boolean | false | Add __fileSource__ field to track file origin |
verboseLog | boolean | false | Enable detailed logging |
Deduplicate products by productId + sku combination — same product with different SKUs are kept as unique items.
Result: 2 unique items — the third entry (same productId + sku as the first) is deduplicated.
Merge product catalogs from multiple URLs, filter out invalid items before dedup, and enrich after dedup.
Pipeline: Load 3 sources → Filter by minPrice & stock → Round prices → Dedup by productId+sku → Add mergedAt timestamp → Output
Deduplicate across multiple Core Datasets to find unique entries.
Find all duplicate entries for data quality auditing.
Result: 1 duplicate item (the second SKU-1001/A-BLACK entry)
| Dataset Size | Recommended Mode | Memory | Speed |
|---|---|---|---|
| < 10K | dedup-after-load | Low | Fast |
| 10K-100K | dedup-after-load | Medium | Fast |
| 100K-1M | dedup-as-loading | Low | Medium |
| > 1M | dedup-as-loading | Low | Slow |
fieldsToLoad parameterbatchSize based on available memoryparallelLoads for faster processing.jsonl files regardless of inputFormat| Problem | Solution |
|---|---|
JavaScript heap out of memory | Switch to dedup-as-loading mode, reduce batchSize, use fieldsToLoad, use JSONL format |
| All items appear as unique | Verify field names, enable verboseLog: true, confirm data contains the specified fields |
Explore more popular scrapers from our marketplace
by CoreClaw
It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.
by Odin Kael
A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.
by Odin Kael
A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.
by Odin Kael
A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.