CoreClaw
Store
Pricing
Start Free Trial
Kael Odin

Dataset Deduplication & Merge Tool

Pricing
Try for free
Kael Odin

Dataset Deduplication & Merge Tool

odin-kael/dataset-deduplication-and-merge-tool

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

Try for Free
2,000 Free Results
运行占位 | Run UnitRequired

CoreClaw 并发占位字段,请保持默认值 | Internal single-run placeholder for CoreClaw

Type: array
应用场景 | ScenarioOptional

选择预设场景自动填充最佳配置 | Select preset scenario

Type: select
Default: ecommerce-products
Options:
🛒 电商商品数据合并🕷️ 爬虫结果清洗👥 用户数据整合📋 日志事件去重⚙️ 自定义场景
去重字段 | Dedup FieldsRequired

用于判断重复的字段名称列表 | Field names for deduplication

Type: array
合并策略 | Merge StrategyOptional

重复数据合并策略 | Strategy for merging duplicates

Type: select
Default: keep-newest
Options:
保留第一条 | Keep First保留最新 | Keep Newest字段合并 | Merge Fields保留最完整 | Keep Most Complete
时间字段 | Timestamp FieldOptional

用于判断新旧的时间字段(keep-newest策略) | Timestamp field for keep-newest

Type: string
Default: updatedAt
数据来源类型 | Data Source TypeOptional

选择数据来源 | Select data source

Type: select
Default: direct-input
Options:
直接输入 | Direct Input网络URL | Network URLCore Dataset
输入数据 | Input DataOptional

JSON数组数据 | JSON array data

Type: string
Default: [{"productId": "P001", "sku": "SKU-A-BLACK", "name": "无线蓝牙耳机 Pro", "price": 299.00, "stock": 156, "source": "京东旗舰店", "updatedAt": "2024-01-20T10:30:00"}, {"productId": "P001", "sku": "SKU-A-BLACK", "name": "无线蓝牙耳机 Pro (黑)", "price": 279.00, "stock": 200, "source": "天猫旗舰店", "updatedAt": "2024-01-22T14:20:00"}, {"productId": "P001", "sku": "SKU-A-WHITE", "name": "无线蓝牙耳机 Pro", "price": 299.00, "stock": 88, "source": "京东旗舰店", "updatedAt": "2024-01-20T10:30:00"}, {"productId": "P002", "sku": "SKU-B", "name": "智能手表 Ultra", "price": 1299.00, "stock": 45, "source": "官网", "updatedAt": "2024-01-18T09:00:00"}]
数据URL列表 | Data URLsOptional

数据文件URL列表 | Data file URL list

Type: array
Dataset ID列表 | Dataset IDsOptional

Core Dataset ID列表 | Core Dataset ID list

Type: array
输入文件格式 | Input FormatOptional

输入数据格式 | Input data format

Type: select
Default: json
Options:
JSON (数组格式 | Array)JSONL (每行一个JSON)
输出内容 | Output ContentOptional

输出类型 | Output type

Type: select
Default: unique-items
Options:
唯一项(去重后) | Unique Items重复项 | Duplicate Items仅统计 | Statistics Only
生成差异报告 | Generate ReportOptional

输出去重差异报告 | Output dedup difference report

Type: boolean
Default: true
处理模式 | Processing ModeOptional

去重处理模式 | Dedup mode. Large datasets (>100K) use 'As Loading'

Type: select
Default: dedup-after-load
Options:
先加载后去重 | After Load边加载边去重 | As Loading
仅加载指定字段 | Load Fields OnlyOptional

仅加载指定字段以减少内存 | Load only specified fields

Type: array
去重前转换函数 | Pre-Dedup TransformOptional

去重前自定义转换函数 | Custom transform function before dedup

Type: string
去重后转换函数 | Post-Dedup TransformOptional

去重后自定义转换函数 | Custom transform function after dedup

Type: string
自定义输入数据 | Custom Input DataOptional

传递给转换函数的自定义数据(JSON格式) | Custom data for transform functions (JSON)

Type: string
Null值视为唯一 | Treat Null as UniqueOptional

null/undefined值视为唯一值 | Treat null/undefined as unique

Type: boolean
Default: false
并行加载数 | Parallel LoadsOptional

并行加载数据源的线程数 | Parallel load threads

Type: integer
Default: 10
并行推送数 | Parallel PushesOptional

并行推送数据的线程数 | Parallel push threads

Type: integer
Default: 5
批次大小 | Batch SizeOptional

每次处理的批次大小 | Batch size per processing

Type: integer
Default: 5000
附加文件来源 | Append File SourceOptional

添加__fileSource__字段记录数据来源 | Add fileSource field

Type: boolean
Default: false
详细日志 | Verbose LogOptional

开启详细日志输出 | Enable verbose logging

Type: boolean
Default: false

Pricing

Failed results don't count

Rating

5.0

Developer

Kael Odin

Worker Stats

15 Total runs
Success rate: 86.67%
Last updated: Apr 20, 2026

Categories

Google

Share

You might also like

Explore more popular scrapers from our marketplace

View All Scrapers
Google Search Results (SERP) Scraper API

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.8
588 runs
From $1.2/1,000 results
Google Sheets Import Export Tool

Google Sheets Import Export Tool

by Kael Odin

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

5.0
2 runs
From $1.2/1,000 results
Cheerio Web Scraping

Cheerio Web Scraping

by Kael Odin

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

5.0
3 runs
From $1.2/1,000 results
Playwright Web Scraping

Playwright Web Scraping

by Kael Odin

A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.

5.0
4 runs
From $1.2/1,000 results
View All Scrapers
CoreClaw

Deploy ready-to-use Workers to accelerate your data collection workflows.

Email: support@coreclaw.com

Resources

  • Quick Start
  • API Reference
  • Leads

Recommend

  • Store
  • Pricing

Address

Apex DataWorks Limited

UNIT 9, 1/F, THE CLOUD, 111 TUNG CHAU STREET, TAI KOK TSUI, KOWLOON,HONG KONG