Shopify Product Scraper

Pricing

Shopify Product Scraper

odin-kael/shopify-scraper-worker

Extract titles, prices, SKUs, variants, images, inventory status & metadata from any Shopify store. Just enter the URL—we handle sitemap discovery, WAF bypass & structured output automatically.

Try for Free

What is Shopify Product Scraper？

Shopify Product Scraper crawls product data from any Shopify store via its JSON API, sitemap discovery, and browser-based anti-bot fallback.Extracts title, price, description, SKU, variants, images, inventory status, and metadata.

Shopify Product Scraper Features

Zero-config discovery: Provide a store URL, the worker parses robots.txt to find product sitemaps automatically
Dual data source: Tries /products.json bulk API first (fast), falls back to individual product endpoints via browser (robust)
Browser fingerprint integration: Uses the platform-hosted remote browser to bypass WAF / Cloudflare when direct HTTP is blocked
Sitemap fallback: Tries 5 common sitemap paths when robots.txt doesn't declare one
Locale-aware URL normalization: Strips Shopify locale prefixes (/es-US/products/) for correct JSON API access
JSON+HTML mode: Optionally fetch HTML pages first, then JSON — preserving raw HTML for custom parsing
Extendable output: Inject a custom JavaScript function to transform, filter, or enrich each product row
Scraper lifecycle hooks: Inject hooks at PRE/POST navigation, URL filtering, and RUN/FINISH stages
Concurrent crawling: Configurable parallelism (1-20) with automatic retry on failure
All variants expanded: Each variant becomes a separate output row with its own SKU, price, option attributes, and images

How to Get Quick Start

Input

json

{
    "startUrl": [{ "url": "https://www.gymshark.com" }],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20,
    "checkForBanner": true,
    "fetchHtml": false,
    "debugLog": false,
    "extendOutputFunction": "async ({ data, item, product, images, fns, name, request, variants, context, customData, input, platform }) => {
  return item;
}",
    "extendScraperFunction": "async ({ fns, customData, platform, label }) => {
 
}",
    "customData": {}
}

Output fields

Field	Type	Description
`url`	string	Product page URL
`title`	string	Product title
`id`	string	Shopify product ID (GUID stripped)
`sku`	string	Variant SKU
`description`	string	Product description (HTML stripped)
`price`	number	Variant price
`currency`	string	Currency (defaults to `"USD"`)
`availability`	string	`"in stock"` or `"out of stock"`
`color`	string	Option value for color
`size`	string	Option value for size
`material`	string	Option value for material
`display_name`	string	Variant display name
`product_type`	string	Shopify product type
`images_urls`	string[]	Product + variant image URLs (deduped, query strings stripped)
`brand`	string	Product vendor
`video_urls`	string[]	Video URLs (reserved)
`created_at`	string	ISO 8601 creation timestamp
`updated_at`	string	ISO 8601 update timestamp
`published_at`	string	ISO 8601 publish timestamp
`additional`	object	Extra metadata:`variant_attributes`, `variant_title`, `scraped_at`, `barcode`, `taxcode`, `stock_count`, `tags`, `weight`, `requires_shipping`, plus any custom option keys

Usage Examples

Example 1: Basic scrape — crawl all products from one store

json

{
    "startUrl": [{ "url": "https://www.gymshark.com" }],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20
}

Example 2: Multiple stores in parallel

The startUrl array controls how the platform splits work across concurrent subtasks (via the b field).

json

{
    "startUrl": [
        { "url": "https://www.gymshark.com" },
        { "url": "https://www.spanx.com" },
        { "url": "https://www.nativecos.com" }
    ],
    "maxRequestsPerCrawl": 0,
    "maxConcurrency": 20
}

Example 3: Custom output mapping — price threshold and discount calculation

Use extendOutputFunction to transform or reject each row. Return null to skip.

javascript

async ({ data, item, fns, input, platform }) => {
    // Filter out low-price items
    if (item.price < 10) return null;

    // Compute discount percentage
    const comparePrice = data.product?.variants?.[0]?.compare_at_price;
    if (comparePrice && item.price) {
        item.additional.discount_pct = Math.round((1 - item.price / comparePrice) * 100);
    }

    // Exclude gift cards
    if (item.product_type === 'Gift Card') return null;

    return item;
}

Example 4: Scraper lifecycle — inject custom sitemap URLs

Use extendScraperFunction to hook into different stages of the crawl lifecycle.

javascript

async ({ fns, customData, platform, label }) => {
    if (label === 'SETUP') {
        // Access the request queue to inject additional URLs
        const extraSitemap = customData.extraSitemapUrl;
        if (extraSitemap) {
            // The runner passes { requestQueue } at SETUP stage
        }
    }

    if (label === 'FILTER_SITEMAP_URL') {
        // this.url is the product/sitemap URL being evaluated
        // this.filter(false) excludes it from the crawl
    }

    if (label === 'PRENAVIGATION') {
        // this.crawlingContext.request — modify headers before each request
    }

    if (label === 'FINISHED') {
        // this.crawler — access crawler stats, persist state
    }
}

Example 5: HTML mode — parse embedded JSON-LD from product pages

Enable fetchHtml to get the full HTML page alongside the JSON API response.
The HTML body is available in request.userData.body inside the output function.

json

{
    "startUrl": [{ "url": "https://www.colourpop.com" }],
    "maxRequestsPerCrawl": 20,
    "fetchHtml": true
}

In extendOutputFunction:

javascript

async ({ data, item, request, context }) => {
    const htmlBody = request.userData.body;
    if (htmlBody) {
        // Parse JSON-LD or meta tags from the HTML
        // context.$ is cheerio loaded from the HTML page
    }
    return item;
}

Example 6: Debug mode — inspect failed JSON responses

Set debugLog: true and failed responses (missing product title) are saved to storage for inspection.

json

{
    "startUrl": [{ "url": "https://www.kith.com" }],
    "maxRequestsPerCrawl": 5,
    "debugLog": true
}

Input Reference

Parameter	Type	Default	Description
`startUrl`	array	required	Shopify store URLs. Also the `b` (split) field for concurrency.
`maxRequestsPerCrawl`	integer	`0`	Max products to crawl.`0` = unlimited.
`maxConcurrency`	integer	`20`	Max parallel requests (1-20).
`maxRequestRetries`	integer	`3`	Retries on failure before giving up.
`checkForBanner`	boolean	`true`	Verify `robots.txt` contains `"Shopify"` before crawling (non-Shopify stores still proceed).
`fetchHtml`	boolean	`false`	Fetch HTML pages before JSON API calls (2x requests).
`debugLog`	boolean	`false`	Verbose logging; saves failed JSON responses for inspection.
`extendOutputFunction`	string	passthrough	JavaScript function (async) to transform/filter output rows. Return `null` to skip.
`extendScraperFunction`	string	no-op	JavaScript function (async) for scraper lifecycle hooks.
`customData`	object	`{}`	Arbitrary data accessible in both extend functions.

Known Limitations

WAF-protected stores: Stores with aggressive WAF (Cloudflare, Akamai) may return challenge pages instead of product data. These appear in output with titles like "Verifying your connection..." and empty product fields. Filter them in extendOutputFunction:
javascript
```
if (!item.sku || item.title === 'Verifying your connection...') return null;
```
Browser-only mode: When products.json is blocked, all requests go through the browser, which is slower (~1 req/sec per concurrent browser). With 5 concurrent browsers, expect ~5 products/sec.
Currency detection: Always outputs "USD" — multi-currency stores need custom parsing via extendOutputFunction.

Proxy & Network

On CoreClaw, all outbound HTTP requests go through the platform's SOCKS5 proxy.
The proxy address is read from PROXY_AUTH and PROXY_DOMAIN environment variables (set automatically by the platform).

The browser is connected via WebSocket CDP (ChromeWs env var + PROXY_AUTH auth).
Both are platform-injected — no manual configuration needed.

FAQ

What types of Shopify stores are supported?

All online stores built on the Shopify platform can be scraped, regardless of theme or language version. The tool automatically detects and handles localized URLs.

Can I scrape non-Shopify sites?

This tool is designed specifically for Shopify. Non-Shopify sites may be attempted, but data structures may not be compatible. The checkForBanner parameter verifies whether robots.txt contains Shopify identifiers.

What's the optimal concurrency setting?

The default of 20 concurrent requests works well for most scenarios. For stores with strict rate limiting, we recommend reducing it to 5-10.

Pricing

Failed results don't count

Rating

5.0

Developer

Kael Odin

Worker Stats

53 Total runs

Success rate: 78.95%

Last updated: May 20, 2026

Made-in-China Supplier Intelligence Scraper | Extract Company Profiles, Contacts & Trade Data

by mmi0cuhn

Scrape Made-in-China supplier pages and collect structured company profiles, main products, audit report numbers, trade details, certificates, shipment images, and contact information for B2B sourcing workflows.

5.0

25 runs

From $0.6/1,000 results

Quince.com Product Scraper - Prices, Discounts, Reviews & More

by Techforce Global

Search products and walk away with selling prices, retail prices, discounts, hero images, and the latest customer reviews for every product, ready to drop into your spreadsheet, dashboard, or BI tool. The Quince.com Product Scraper turns catalog into clean, structured product data in minutes.

5.0

18 runs

From $0.6/1,000 results

SHEIN Single Product Extractor (URL/ID)

by yankun guo

A dedicated tool to extract structured detailed data for individual SHEIN products via product URL or product ID. It connects to a remote Chromium instance, automatically bypasses SHEIN's risk verification, loads the target product page, parses complete product attributes, and returns normalized data. Supports 10+ regional SHEIN sites and configurable workflow retries, ideal for product information monitoring, price tracking, competitor research, and trend analysis.

5.0

307 runs

From $0.6/1,000 results

Goodreads Book Info Extractor

by Adil Ayub

Instantly extract Goodreads book data including title, description, ISBN, ASIN, publisher, format, page count, language, genres, awards, characters, ratings, and rating counts. Receive structured JSON data for seamless integration into your applications and workflows.

5.0

3 runs

From $0.6/1,000 results

View All Scrapers

Shopify Product Scraper

Shopify Product Scraper

What is Shopify Product Scraper？

Shopify Product Scraper Features

How to Get Quick Start

Input

Output fields

Usage Examples

Example 1: Basic scrape — crawl all products from one store

Example 2: Multiple stores in parallel

Example 3: Custom output mapping — price threshold and discount calculation

Example 4: Scraper lifecycle — inject custom sitemap URLs

Example 5: HTML mode — parse embedded JSON-LD from product pages

Example 6: Debug mode — inspect failed JSON responses

Input Reference

Known Limitations

Proxy & Network

FAQ

What types of Shopify stores are supported?

Can I scrape non-Shopify sites?

What's the optimal concurrency setting?

Pricing

Rating

Developer

Worker Stats

Categories

Share

You might also like

Made-in-China Supplier Intelligence Scraper | Extract Company Profiles, Contacts & Trade Data

Quince.com Product Scraper - Prices, Discounts, Reviews & More

SHEIN Single Product Extractor (URL/ID)

Goodreads Book Info Extractor