CoreClaw
Store
Pricing
Start Free Trial
Kael Odin

Shopify Product Scraper

Pricing
Try for free
Kael Odin

Shopify Product Scraper

odin-kael/shopify-scraper-worker

Extract titles, prices, SKUs, variants, images, inventory status & metadata from any Shopify store. Just enter the URL—we handle sitemap discovery, WAF bypass & structured output automatically.

Try for Free
2,000 Free Results

What is Shopify Product Scraper?

Shopify Product Scraper crawls product data from any Shopify store via its JSON API, sitemap discovery, and browser-based anti-bot fallback.Extracts title, price, description, SKU, variants, images, inventory status, and metadata.

Shopify Product Scraper Features

  • Zero-config discovery: Provide a store URL, the worker parses robots.txt to find product sitemaps automatically
  • Dual data source: Tries /products.json bulk API first (fast), falls back to individual product endpoints via browser (robust)
  • Browser fingerprint integration: Uses the platform-hosted remote browser to bypass WAF / Cloudflare when direct HTTP is blocked
  • Sitemap fallback: Tries 5 common sitemap paths when robots.txt doesn't declare one
  • Locale-aware URL normalization: Strips Shopify locale prefixes (/es-US/products/) for correct JSON API access
  • JSON+HTML mode: Optionally fetch HTML pages first, then JSON — preserving raw HTML for custom parsing
  • Extendable output: Inject a custom JavaScript function to transform, filter, or enrich each product row
  • Scraper lifecycle hooks: Inject hooks at PRE/POST navigation, URL filtering, and RUN/FINISH stages
  • Concurrent crawling: Configurable parallelism (1-20) with automatic retry on failure
  • All variants expanded: Each variant becomes a separate output row with its own SKU, price, option attributes, and images

How to Get Quick Start

Input

json
{
    "startUrl": [{ "url": "https://www.gymshark.com" }],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20,
    "checkForBanner": true,
    "fetchHtml": false,
    "debugLog": false,
    "extendOutputFunction": "async ({ data, item, product, images, fns, name, request, variants, context, customData, input, platform }) => {
  return item;
}",
    "extendScraperFunction": "async ({ fns, customData, platform, label }) => {
 
}",
    "customData": {}
}

Output fields

FieldTypeDescription
urlstringProduct page URL
titlestringProduct title
idstringShopify product ID (GUID stripped)
skustringVariant SKU
descriptionstringProduct description (HTML stripped)
pricenumberVariant price
currencystringCurrency (defaults to "USD")
availabilitystring"in stock" or "out of stock"
colorstringOption value for color
sizestringOption value for size
materialstringOption value for material
display_namestringVariant display name
product_typestringShopify product type
images_urlsstring[]Product + variant image URLs (deduped, query strings stripped)
brandstringProduct vendor
video_urlsstring[]Video URLs (reserved)
created_atstringISO 8601 creation timestamp
updated_atstringISO 8601 update timestamp
published_atstringISO 8601 publish timestamp
additionalobjectExtra metadata:variant_attributes, variant_title, scraped_at, barcode, taxcode, stock_count, tags, weight, requires_shipping, plus any custom option keys

Usage Examples

Example 1: Basic scrape — crawl all products from one store

json
{
    "startUrl": [{ "url": "https://www.gymshark.com" }],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20
}

Example 2: Multiple stores in parallel

The startUrl array controls how the platform splits work across concurrent subtasks (via the b field).

json
{
    "startUrl": [
        { "url": "https://www.gymshark.com" },
        { "url": "https://www.spanx.com" },
        { "url": "https://www.nativecos.com" }
    ],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20
}

Example 3: Custom output mapping — price threshold and discount calculation

Use extendOutputFunction to transform or reject each row. Return null to skip.

javascript
async ({ data, item, fns, input, platform }) => {
    // Filter out low-price items
    if (item.price < 10) return null;

    // Compute discount percentage
    const comparePrice = data.product?.variants?.[0]?.compare_at_price;
    if (comparePrice && item.price) {
        item.additional.discount_pct = Math.round((1 - item.price / comparePrice) * 100);
    }

    // Exclude gift cards
    if (item.product_type === 'Gift Card') return null;

    return item;
}

Example 4: Scraper lifecycle — inject custom sitemap URLs

Use extendScraperFunction to hook into different stages of the crawl lifecycle.

javascript
async ({ fns, customData, platform, label }) => {
    if (label === 'SETUP') {
        // Access the request queue to inject additional URLs
        const extraSitemap = customData.extraSitemapUrl;
        if (extraSitemap) {
            // The runner passes { requestQueue } at SETUP stage
        }
    }

    if (label === 'FILTER_SITEMAP_URL') {
        // this.url is the product/sitemap URL being evaluated
        // this.filter(false) excludes it from the crawl
    }

    if (label === 'PRENAVIGATION') {
        // this.crawlingContext.request — modify headers before each request
    }

    if (label === 'FINISHED') {
        // this.crawler — access crawler stats, persist state
    }
}

Example 5: HTML mode — parse embedded JSON-LD from product pages

Enable fetchHtml to get the full HTML page alongside the JSON API response. The HTML body is available in request.userData.body inside the output function.

json
{
    "startUrl": [{ "url": "https://www.colourpop.com" }],
    "maxRequestsPerCrawl": 20,
    "fetchHtml": true
}

In extendOutputFunction:

javascript
async ({ data, item, request, context }) => {
    const htmlBody = request.userData.body;
    if (htmlBody) {
        // Parse JSON-LD or meta tags from the HTML
        // context.$ is cheerio loaded from the HTML page
    }
    return item;
}

Example 6: Debug mode — inspect failed JSON responses

Set debugLog: true and failed responses (missing product title) are saved to storage for inspection.

json
{
    "startUrl": [{ "url": "https://www.kith.com" }],
    "maxRequestsPerCrawl": 5,
    "debugLog": true
}

Input Reference

ParameterTypeDefaultDescription
startUrlarrayrequiredShopify store URLs. Also the b (split) field for concurrency.
maxRequestsPerCrawlinteger0Max products to crawl.0 = unlimited.
maxConcurrencyinteger20Max parallel requests (1-20).
maxRequestRetriesinteger3Retries on failure before giving up.
checkForBannerbooleantrueVerify robots.txt contains "Shopify" before crawling (non-Shopify stores still proceed).
fetchHtmlbooleanfalseFetch HTML pages before JSON API calls (2x requests).
debugLogbooleanfalseVerbose logging; saves failed JSON responses for inspection.
extendOutputFunctionstringpassthroughJavaScript function (async) to transform/filter output rows. Return null to skip.
extendScraperFunctionstringno-opJavaScript function (async) for scraper lifecycle hooks.
customDataobject{}Arbitrary data accessible in both extend functions.

Known Limitations

  • WAF-protected stores: Stores with aggressive WAF (Cloudflare, Akamai) may return challenge pages instead of product data. These appear in output with titles like "Verifying your connection..." and empty product fields. Filter them in extendOutputFunction:
    javascript
    if (!item.sku || item.title === 'Verifying your connection...') return null;
  • Browser-only mode: When products.json is blocked, all requests go through the browser, which is slower (~1 req/sec per concurrent browser). With 5 concurrent browsers, expect ~5 products/sec.
  • Currency detection: Always outputs "USD" — multi-currency stores need custom parsing via extendOutputFunction.

Proxy & Network

On CoreClaw, all outbound HTTP requests go through the platform's SOCKS5 proxy. The proxy address is read from PROXY_AUTH and PROXY_DOMAIN environment variables (set automatically by the platform).

The browser is connected via WebSocket CDP (ChromeWs env var + PROXY_AUTH auth). Both are platform-injected — no manual configuration needed.

FAQ

What types of Shopify stores are supported?

All online stores built on the Shopify platform can be scraped, regardless of theme or language version. The tool automatically detects and handles localized URLs.

Can I scrape non-Shopify sites?

This tool is designed specifically for Shopify. Non-Shopify sites may be attempted, but data structures may not be compatible. The checkForBanner parameter verifies whether robots.txt contains Shopify identifiers.

What's the optimal concurrency setting?

The default of 20 concurrent requests works well for most scenarios. For stores with strict rate limiting, we recommend reducing it to 5-10.

Pricing

Failed results don't count

Rating

4.5

Developer

Kael Odin

Worker Stats

53 Total runs
Success rate: 78.95%
Last updated: May 20, 2026

Categories

E-CommerceOther

Share

You might also like

Explore more popular scrapers from our marketplace

View All Scrapers
Quince.com Product Scraper - Prices, Discounts, Reviews & More

Quince.com Product Scraper - Prices, Discounts, Reviews & More

by Techforce Global

Search products and walk away with selling prices, retail prices, discounts, hero images, and the latest customer reviews for every product, ready to drop into your spreadsheet, dashboard, or BI tool. The Quince.com Product Scraper turns catalog into clean, structured product data in minutes.

4.9
18 runs
From $0.6/1,000 results
SHEIN Single Product Extractor (URL/ID)

SHEIN Single Product Extractor (URL/ID)

by yankun guo

A dedicated tool to extract structured detailed data for individual SHEIN products via product URL or product ID. It connects to a remote Chromium instance, automatically bypasses SHEIN's risk verification, loads the target product page, parses complete product attributes, and returns normalized data. Supports 10+ regional SHEIN sites and configurable workflow retries, ideal for product information monitoring, price tracking, competitor research, and trend analysis.

4.7
132 runs
From $0.6/1,000 results
SHEIN Product Scraper (Keyword/Category-Driven)

SHEIN Product Scraper (Keyword/Category-Driven)

by yankun guo

A scalable tool to automatically discover, parse, and extract structured SHEIN product data through three input modes (keyword, category URL, category ID). It supports multi-regional SHEIN sites (US/UK/DE/FR, etc.), customizable sorting rules, and extraction of core product attributes (price, rating, sales volume, badges, etc.), ideal for price tracking, competitor research, trend analysis, and listing monitoring.

4.7
250 runs
From $0.6/1,000 results
Perplexity AI Answer Scraper with Sources

Perplexity AI Answer Scraper with Sources

by yankun guo

Enter questions or links,no coding required to extract full Perplexity AI answers with source citations in HTML format. Ideal for research, fact-checking and content analysis.

4.6
383 runs
From $0.6/1,000 results
View All Scrapers
CoreClaw

Deploy ready-to-use Workers to accelerate your data collection workflows.

Email: support@coreclaw.com

Resources

  • Quick Start
  • API Reference
  • Leads

Recommend

  • Store
  • Pricing

Address

Apex DataWorks Limited

UNIT 9, 1/F, THE CLOUD, 111 TUNG CHAU STREET, TAI KOK TSUI, KOWLOON,HONG KONG