Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.
CoreClaw 并发占位字段,请保持默认值 | Internal single-run placeholder for CoreClaw
选择预设场景自动填充最佳配置 | Select preset scenario
用于判断重复的字段名称列表 | Field names for deduplication
重复数据合并策略 | Strategy for merging duplicates
用于判断新旧的时间字段(keep-newest策略) | Timestamp field for keep-newest
选择数据来源 | Select data source
JSON数组数据 | JSON array data
数据文件URL列表 | Data file URL list
Core Dataset ID列表 | Core Dataset ID list
输入数据格式 | Input data format
输出类型 | Output type
输出去重差异报告 | Output dedup difference report
去重处理模式 | Dedup mode. Large datasets (>100K) use 'As Loading'
仅加载指定字段以减少内存 | Load only specified fields
去重前自定义转换函数 | Custom transform function before dedup
去重后自定义转换函数 | Custom transform function after dedup
传递给转换函数的自定义数据(JSON格式) | Custom data for transform functions (JSON)
null/undefined值视为唯一值 | Treat null/undefined as unique
并行加载数据源的线程数 | Parallel load threads
并行推送数据的线程数 | Parallel push threads
每次处理的批次大小 | Batch size per processing
添加__fileSource__字段记录数据来源 | Add fileSource field
开启详细日志输出 | Enable verbose logging
Explore more popular scrapers from our marketplace
by CoreClaw
It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.
by Kael Odin
A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.
by Kael Odin
A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.
by Kael Odin
A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.