Intelligently extract website content using Crawl4AI, retrieving page content in various formats (Markdown, HTML, or plain text). Supports configurable depth, wait conditions, CSS selectors, and comprehensive link discovery. Zero-code operation, one-click export in CSV or JSON format.
Starting URLs to crawl (e.g. https://example.com). One or more URLs. | 开始抓取的 URL(如 https://example.com),支持多个 URL。
Maximum pages to process in total (1–10000). | 总共处理的最大页面数(1-10000)。
Maximum link depth from each start URL (0–10). | 从每个起始 URL 开始的最大链接深度(0-10)。
Number of concurrent page tasks (1–50). | 并发页面任务数(1-50)。
Timeout per page request in seconds (5–600). | 每个页面请求的超时时间(秒,5-600)。
Run browser headless. | 以无头模式运行浏览器。
Output content format: markdown, html, or text. | 输出内容格式:Markdown、HTML 或纯文本。
Maximum output items to push (1–200000). | 推送的最大输出项数(1-200000)。
Only follow links within start URL domains. | 仅跟踪起始 URL 域名内的链接。
Only include URLs matching these regex patterns (optional). | 仅包含匹配这些正则模式的 URL(可选)。
Exclude URLs matching these regex patterns. | 排除匹配这些正则模式的 URL。
Retry failed pages up to this count (0–10). | 重试失败页面的次数(0-10)。
Remove navigation-heavy lines and normalize whitespace. | 移除导航密集的行并规范化空白。
Include unmodified content in a separate field. | 在单独字段中包含未修改的内容。
Truncate content to this length (0 = unlimited, max 500000). | 截断内容到此长度(0=不限制,最大 500000)。
Length of content excerpt for previews (0–5000). | 预览用内容摘要的长度(0-5000)。
Page load strategy: domcontentloaded (fast), load, or networkidle (SPA/slow sites). | 页面加载策略:domcontentloaded(快)、load 或 networkidle(SPA/慢站点)。
CSS selector to wait for before extraction (e.g. .article-body). Leave empty to skip. | 提取前等待的 CSS 选择器(如 .article-body),留空表示不等待。
Extract only content inside this CSS selector (e.g. main, .content). Leave empty for full page. | 仅提取此 CSS 选择器内的内容(如 main, .content),留空表示提取整页。
full = extract content; discover_only = only URLs and links (no content). | full=提取内容;discover_only=仅 URL 和链接(无内容)。
Include links_internal and links_external arrays in each item (full mode). | 在每项中包含内部链接和外部链接数组(完整模式)。
Explore more popular scrapers from our marketplace
by CoreClaw
It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.
by Kael Odin
Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.
by Kael Odin
A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.
by Kael Odin
A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.