CoreClaw
Store
Pricing
Start Free Trial
Kael Odin

Dataset Deduplication & Merge Tool

Pricing
Try for free
Kael Odin

Dataset Deduplication & Merge Tool

odin-kael/dataset-deduplication-and-merge-tool

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

Try for Free
2,000 Free Results

You can access the Worker programmatically from your own applications using the CoreClaw API. Choose your preferred language from the options below. To get started with the CoreClaw API, you'll need a CoreClaw account and your API token — find it in the overview in your Console.

<?php

// API URL
const API_URL = "https://openapi.coreclaw.com/api/v1/scraper/run";

// Your API KEY
const API_KEY = "<YOUR_API_KEY>";

// curl timeout (seconds)
const TIMEOUT = 30;

/**
 * Run scraper
 *
 * @param array $params Request parameters
 * @param string $apiKey API Key
 * @return array Return result ["success" => bool, "run_slug" => string|null, "error" => string|null]
 */
function runScraper(array $params, string $apiKey): array
{
    // Initialize cURL
    $ch = curl_init();

    // Set cURL options
    curl_setopt_array($ch, [
        CURLOPT_URL => API_URL,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_ENCODING => "",
        CURLOPT_MAXREDIRS => 10,
        CURLOPT_TIMEOUT => TIMEOUT,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
        CURLOPT_CUSTOMREQUEST => "POST",
        CURLOPT_POSTFIELDS => json_encode($params),
        CURLOPT_HTTPHEADER => [
            "api-key: " . $apiKey,
            "Content-Type: application/json"
        ],
    ]);

    // Execute request
    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $error = curl_error($ch);

    // Close cURL
    curl_close($ch);

    // Check cURL error
    if ($error) {
        return [
            "success" => false,
            "run_slug" => null,
            "error" => "cURL error: " . $error
        ];
    }

    // Check HTTP status code
    if ($httpCode !== 200) {
        return [
            "success" => false,
            "run_slug" => null,
            "error" => "HTTP error: " . $httpCode . " - " . $response
        ];
    }

    // Parse response
    $result = json_decode($response, true);
    if (json_last_error() !== JSON_ERROR_NONE) {
        return [
            "success" => false,
            "run_slug" => null,
            "error" => "JSON decode error: " . json_last_error_msg()
        ];
    }

    // Check business error code
    if (isset($result["code"]) && $result["code"] !== 0) {
        return [
            "success" => false,
            "run_slug" => null,
            "error" => "Business error: " . (isset($result["message"]) ? $result["message"] : "Unknown error") . " (code: " . $result["code"] . ")"
        ];
    }

    // Return success result
    return [
        "success" => true,
        "run_slug" => isset($result["data"]["run_slug"]) ? $result["data"]["run_slug"] : null,
        "error" => null
    ];
}

/**
 * Main function
 */
function main()
{
    // Build request parameters
    $requestParams = [
        "scraper_slug" => "01KG2DV66JTCN65ZBTRX3M456E",
        "version" => "v1.0.8",
        "input" => [
            "parameters" => [
                "system" => [
                    "proxy_region" => "",
                    "cpus" => 0.125,
                    "memory" => 512,
                    "execute_limit_time_seconds" => 1800,
                    "max_total_charge" => 0,
                    "max_total_traffic" => 0
                ],
                "custom" => {
          'runUnits': [
                    {
                              'url': 'https://coreclaw.local/__single_run__'
                    }
          ],
          'scenario': 'ecommerce-products',
          'fields': [
                    {
                              'string': 'productId'
                    },
                    {
                              'string': 'sku'
                    }
          ],
          'mergeStrategy': 'keep-newest',
          'timestampField': 'updatedAt',
          'dataSourceType': 'direct-input',
          'inputData': '[{\'productId\': \'P001\', \'sku\': \'SKU-A-BLACK\', \'name\': \'无线蓝牙耳机 Pro\', \'price\': 299.00, \'stock\': 156, \'source\': \'京东旗舰店\', \'updatedAt\': \'2024-01-20T10:30:00\'}, {\'productId\': \'P001\', \'sku\': \'SKU-A-BLACK\', \'name\': \'无线蓝牙耳机 Pro (黑)\', \'price\': 279.00, \'stock\': 200, \'source\': \'天猫旗舰店\', \'updatedAt\': \'2024-01-22T14:20:00\'}, {\'productId\': \'P001\', \'sku\': \'SKU-A-WHITE\', \'name\': \'无线蓝牙耳机 Pro\', \'price\': 299.00, \'stock\': 88, \'source\': \'京东旗舰店\', \'updatedAt\': \'2024-01-20T10:30:00\'}, {\'productId\': \'P002\', \'sku\': \'SKU-B\', \'name\': \'智能手表 Ultra\', \'price\': 1299.00, \'stock\': 45, \'source\': \'官网\', \'updatedAt\': \'2024-01-18T09:00:00\'}]',
          'inputUrls': [
                    {
                              'url': 'https://raw.githubusercontent.com/kael-odin/worker-dedup-datasets/main/test/data1.json'
                    }
          ],
          'datasetIds': [],
          'inputFormat': 'json',
          'output': 'unique-items',
          'generateReport': true,
          'mode': 'dedup-after-load',
          'fieldsToLoad': [],
          'nullAsUnique': false,
          'parallelLoads': 10,
          'parallelPushes': 5,
          'batchSize': 5000,
          'appendFileSource': false,
          'verboseLog': false
}
            ]
        ],
        "callback_url" => "https://your-domain.com/callback"
    ];

    // Send request
    echo "Sending request to API...
";
    $result = runScraper($requestParams, API_KEY);

    // Handle result
    if ($result["success"]) {
        echo "Worker run successful!
";
        echo "Run record ID: " . $result["run_slug"] . "
";
        echo "You can use this ID to query run status and results
";
    } else {
        echo "Request failed!
";
        echo "Error message: " . $result["error"] . "
";
    }
}

// Execute main function
main();

Additional Resources

API Reference Documentation
Complete API documentation with all endpoints and parameters

Pricing

Failed results don't count

Rating

5.0

Developer

Kael Odin

Worker Stats

15 Total runs
Success rate: 86.67%
Last updated: Apr 20, 2026

Categories

Google

Share

You might also like

Explore more popular scrapers from our marketplace

View All Scrapers
Google Search Results (SERP) Scraper API

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.8
590 runs
From $1.2/1,000 results
Google Sheets Import Export Tool

Google Sheets Import Export Tool

by Kael Odin

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

5.0
2 runs
From $1.2/1,000 results
Cheerio Web Scraping

Cheerio Web Scraping

by Kael Odin

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

5.0
3 runs
From $1.2/1,000 results
Playwright Web Scraping

Playwright Web Scraping

by Kael Odin

A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.

5.0
4 runs
From $1.2/1,000 results
View All Scrapers
CoreClaw

Deploy ready-to-use Workers to accelerate your data collection workflows.

Email: support@coreclaw.com

Resources

  • Quick Start
  • API Reference
  • Leads
  • Affiliate Program

Recommend

  • Store
  • Pricing

Address

Apex DataWorks Limited

UNIT 9, 1/F, THE CLOUD, 111 TUNG CHAU STREET, TAI KOK TSUI, KOWLOON,HONG KONG