
Extract titles, prices, SKUs, variants, images, inventory status & metadata from any Shopify store. Just enter the URL—we handle sitemap discovery, WAF bypass & structured output automatically.
Shopify Product Scraper crawls product data from any Shopify store via its JSON API, sitemap discovery, and browser-based anti-bot fallback.Extracts title, price, description, SKU, variants, images, inventory status, and metadata.
robots.txt to find product sitemaps automatically/products.json bulk API first (fast), falls back to individual product endpoints via browser (robust)robots.txt doesn't declare one/es-US/products/) for correct JSON API access| Field | Type | Description |
|---|---|---|
url | string | Product page URL |
title | string | Product title |
id | string | Shopify product ID (GUID stripped) |
sku | string | Variant SKU |
description | string | Product description (HTML stripped) |
price | number | Variant price |
currency | string | Currency (defaults to "USD") |
availability | string | "in stock" or "out of stock" |
color | string | Option value for color |
size | string | Option value for size |
material | string | Option value for material |
display_name | string | Variant display name |
product_type | string | Shopify product type |
images_urls | string[] | Product + variant image URLs (deduped, query strings stripped) |
brand | string | Product vendor |
video_urls | string[] | Video URLs (reserved) |
created_at | string | ISO 8601 creation timestamp |
updated_at | string | ISO 8601 update timestamp |
published_at | string | ISO 8601 publish timestamp |
additional | object | Extra metadata:variant_attributes, variant_title, scraped_at, barcode, taxcode, stock_count, tags, weight, requires_shipping, plus any custom option keys |
The startUrl array controls how the platform splits work across concurrent subtasks (via the b field).
Use extendOutputFunction to transform or reject each row. Return null to skip.
Use extendScraperFunction to hook into different stages of the crawl lifecycle.
Enable fetchHtml to get the full HTML page alongside the JSON API response.
The HTML body is available in request.userData.body inside the output function.
In extendOutputFunction:
Set debugLog: true and failed responses (missing product title) are saved to storage for inspection.
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrl | array | required | Shopify store URLs. Also the b (split) field for concurrency. |
maxRequestsPerCrawl | integer | 0 | Max products to crawl.0 = unlimited. |
maxConcurrency | integer | 20 | Max parallel requests (1-20). |
maxRequestRetries | integer | 3 | Retries on failure before giving up. |
checkForBanner | boolean | true | Verify robots.txt contains "Shopify" before crawling (non-Shopify stores still proceed). |
fetchHtml | boolean | false | Fetch HTML pages before JSON API calls (2x requests). |
debugLog | boolean | false | Verbose logging; saves failed JSON responses for inspection. |
extendOutputFunction | string | passthrough | JavaScript function (async) to transform/filter output rows. Return null to skip. |
extendScraperFunction | string | no-op | JavaScript function (async) for scraper lifecycle hooks. |
customData | object | {} | Arbitrary data accessible in both extend functions. |
extendOutputFunction:
products.json is blocked, all requests go through the browser, which is slower (~1 req/sec per concurrent browser). With 5 concurrent browsers, expect ~5 products/sec."USD" — multi-currency stores need custom parsing via extendOutputFunction.On CoreClaw, all outbound HTTP requests go through the platform's SOCKS5 proxy.
The proxy address is read from PROXY_AUTH and PROXY_DOMAIN environment variables (set automatically by the platform).
The browser is connected via WebSocket CDP (ChromeWs env var + PROXY_AUTH auth).
Both are platform-injected — no manual configuration needed.
All online stores built on the Shopify platform can be scraped, regardless of theme or language version. The tool automatically detects and handles localized URLs.
This tool is designed specifically for Shopify. Non-Shopify sites may be attempted, but data structures may not be compatible. The checkForBanner parameter verifies whether robots.txt contains Shopify identifiers.
The default of 20 concurrent requests works well for most scenarios. For stores with strict rate limiting, we recommend reducing it to 5-10.
Explore more popular scrapers from our marketplace
by Wahlberg
Scrape public Reddit posts, comments, votes, media, and metadata by URL or keyword. Support sorting, filtering, and structured output for research, monitoring, and analysis.
by Odin Kael
Product Hunt Scraper CoreClaw Worker to scrape trending products by keyword for market research, competitor tracking, lead generation and AI startup trend monitoring. Support API, feed, browser and proxy auto strategy.
by Odin Kael
A powerful course scraper for extracting online courses from Coursera and EDX platforms.
by Odin Kael
Stably scrape job postings from recruitment platforms including Indeed and LinkedIn. Supports remote/full-time/salary filtering, custom proxies, and multi-dimensional precise search. Deploy with one click to obtain overseas job data.