CoreClaw
Store
Pricing
Start Free Trial
Kael Odin

Dataset Deduplication & Merge Tool

Pricing
Try for free
Kael Odin

Dataset Deduplication & Merge Tool

odin-kael/dataset-deduplication-and-merge-tool

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

Try for Free
2,000 Free Results

You can access the Worker programmatically from your own applications using the CoreClaw API. Choose your preferred language from the options below. To get started with the CoreClaw API, you'll need a CoreClaw account and your API token — find it in the overview in your Console.

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;

public class ScraperRunSimple {
    // API URL
    private static final String API_URL = "https://openapi.coreclaw.com/api/v1/scraper/run";

    // Your API KEY
    private static final String API_KEY = "<YOUR_API_KEY>";

    // Request timeout (seconds)
    private static final int TIMEOUT = 30;
    
    public static void main(String[] args) {
        // Build request JSON
        String jsonBody = buildRequestBody();
        
        // Create HttpClient
        HttpClient client = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(TIMEOUT))
            .build();
        
        // Create HttpRequest
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(API_URL))
            .timeout(Duration.ofSeconds(TIMEOUT))
            .header("api-key", API_KEY)
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(jsonBody))
            .build();

        System.out.println("Sending request to API...");

        try {
            // Send request
            HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

            // Check HTTP status code
            int statusCode = response.statusCode();
            if (statusCode != 200) {
                System.out.println("Request failed!");
                System.out.println("HTTP error: " + statusCode + " - " + response.body());
                return;
            }

            // Parse response (simple string handling, no external libraries needed)
            String responseBody = response.body();
            System.out.println("Response content: " + responseBody);

            // Extract run_slug (simple parsing)
            String runSlug = extractRunSlug(responseBody);
            if (runSlug != null) {
                System.out.println("Worker run successful!");
                System.out.println("Run ID: " + runSlug);
                System.out.println("You can use this ID to query run status and results");
            } else {
                System.out.println("Request failed!");
                System.out.println("Unable to parse run_slug");
            }
        } catch (IOException e) {
            System.out.println("Request failed!");
            System.out.println("IO error: " + e.getMessage());
        } catch (InterruptedException e) {
            System.out.println("Request failed!");
            System.out.println("Request interrupted: " + e.getMessage());
            Thread.currentThread().interrupt();
        }
    }

    /**
     * Build request JSON body
     */
    private static String buildRequestBody() {
        return """
            {
                "scraper_slug": "01KG2DV66JTCN65ZBTRX3M456E",
                "version": "v1.0.8",
                "input": {
                    "parameters": {
                        "system": {
                            "proxy_region": "",
                            "cpus": 0.125,
                            "memory": 512,
                            "execute_limit_time_seconds": 1800,
                            "max_total_charge": 0,
                            "max_total_traffic": 0
                        },
                        "custom": {
          "runUnits": [
                    {
                              "url": "https://coreclaw.local/__single_run__"
                    }
          ],
          "scenario": "ecommerce-products",
          "fields": [
                    {
                              "string": "productId"
                    },
                    {
                              "string": "sku"
                    }
          ],
          "mergeStrategy": "keep-newest",
          "timestampField": "updatedAt",
          "dataSourceType": "direct-input",
          "inputData": "[{\"productId\": \"P001\", \"sku\": \"SKU-A-BLACK\", \"name\": \"无线蓝牙耳机 Pro\", \"price\": 299.00, \"stock\": 156, \"source\": \"京东旗舰店\", \"updatedAt\": \"2024-01-20T10:30:00\"}, {\"productId\": \"P001\", \"sku\": \"SKU-A-BLACK\", \"name\": \"无线蓝牙耳机 Pro (黑)\", \"price\": 279.00, \"stock\": 200, \"source\": \"天猫旗舰店\", \"updatedAt\": \"2024-01-22T14:20:00\"}, {\"productId\": \"P001\", \"sku\": \"SKU-A-WHITE\", \"name\": \"无线蓝牙耳机 Pro\", \"price\": 299.00, \"stock\": 88, \"source\": \"京东旗舰店\", \"updatedAt\": \"2024-01-20T10:30:00\"}, {\"productId\": \"P002\", \"sku\": \"SKU-B\", \"name\": \"智能手表 Ultra\", \"price\": 1299.00, \"stock\": 45, \"source\": \"官网\", \"updatedAt\": \"2024-01-18T09:00:00\"}]",
          "inputUrls": [
                    {
                              "url": "https://raw.githubusercontent.com/kael-odin/worker-dedup-datasets/main/test/data1.json"
                    }
          ],
          "datasetIds": [],
          "inputFormat": "json",
          "output": "unique-items",
          "generateReport": true,
          "mode": "dedup-after-load",
          "fieldsToLoad": [],
          "nullAsUnique": false,
          "parallelLoads": 10,
          "parallelPushes": 5,
          "batchSize": 5000,
          "appendFileSource": false,
          "verboseLog": false
}
                    }
                },
                "callback_url": "https://your-domain.com/callback"
            }
            """;
    }

    /**
     * Extract run_slug from response (simple string handling)
     */
    private static String extractRunSlug(String json) {
        try {
            // Find "run_slug":"xxx"
            int startIndex = json.indexOf("\"run_slug\":\"");
            if (startIndex == -1) {
                return null;
            }

            startIndex += "\"run_slug\":\"".length();
            int endIndex = json.indexOf("\"", startIndex);
            if (endIndex == -1) {
                return null;
            }
            return json.substring(startIndex, endIndex);
        } catch (Exception e) {
            return null;
        }
    }
}

Additional Resources

API Reference Documentation
Complete API documentation with all endpoints and parameters

Pricing

Failed results don't count

Rating

5.0

Developer

Kael Odin

Worker Stats

15 Total runs
Success rate: 86.67%
Last updated: Apr 20, 2026

Categories

Google

Share

You might also like

Explore more popular scrapers from our marketplace

View All Scrapers
Google Search Results (SERP) Scraper API

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.8
590 runs
From $1.2/1,000 results
Google Sheets Import Export Tool

Google Sheets Import Export Tool

by Kael Odin

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

5.0
2 runs
From $1.2/1,000 results
Cheerio Web Scraping

Cheerio Web Scraping

by Kael Odin

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

5.0
3 runs
From $1.2/1,000 results
Playwright Web Scraping

Playwright Web Scraping

by Kael Odin

A powerful cross-browser web scraping tool using Playwright for complete browser rendering. Supports Chromium, Firefox, and WebKit browser engines. Perfect for dynamic pages, single-page applications (SPAs), infinite scroll pages, and cross-browser testing scenarios.

5.0
4 runs
From $1.2/1,000 results
View All Scrapers
CoreClaw

Deploy ready-to-use Workers to accelerate your data collection workflows.

Email: support@coreclaw.com

Resources

  • Quick Start
  • API Reference
  • Leads
  • Affiliate Program

Recommend

  • Store
  • Pricing

Address

Apex DataWorks Limited

UNIT 9, 1/F, THE CLOUD, 111 TUNG CHAU STREET, TAI KOK TSUI, KOWLOON,HONG KONG