Intelligently extract website content using Crawl4AI, retrieving page content in various formats (Markdown, HTML, or plain text). Supports configurable depth, wait conditions, CSS selectors, and comprehensive link discovery. Zero-code operation, one-click export in CSV or JSON format.
A web content extractor is an automated data extraction tool specifically designed for bulk scraping page content from websites, supporting multiple output formats and intelligent content cleaning. With CoreClaw, you can obtain structured web content with zero code, facilitating content aggregation, SEO analysis, AI knowledge base construction, and data mining.
| 📄 Page URL | 📝 Page Title |
|---|---|
| 📖 Markdown Content | 🌐 HTML Content |
| 📄 Plain Text Content | 📊 Content Summary |
| 🔗 Internal Links | 🌐 External Links |
| 📏 Crawling Depth | 📡 HTTP Status Code |
CoreClaw Web Content Extractor handles proxy rotation, task scheduling, data standardization, and final delivery for you in the background. In just a few minutes, you can get your data by following these steps:
| Parameter | Type | Default | Description |
|---|---|---|---|
| startUrls | array | - | Required, list of starting URLs |
| maxPages | integer | 50 | Maximum number of pages to process (1-10000) |
| maxDepth | integer | 2 | Maximum link depth (0-10) |
| concurrency | integer | 5 | Concurrent page tasks (1-50) |
| requestTimeoutSecs | integer | 60 | Page timeout (5-600 seconds) |
| extractMode | string | markdown | Output format: markdown/html/text |
| waitUntil | string | domcontentloaded | Loading strategy |
| waitForSelector | string | - | CSS selector to wait for |
| cssSelector | string | - | Extract only this area |
| sameDomainOnly | boolean | true | Only follow same-domain links |
| includePatterns | array | [] | Regex patterns to include |
| excludePatterns | array | [] | Regex patterns to exclude |
| cleanContent | boolean | true | Clean and normalize content |
| maxContentChars | integer | 0 | Truncate content (0=no limit) |
| crawlMode | string | full | full or discover_only |
Example 1: Basic Crawling
Example 2: Extract Specific Area
Example 3: Discover Links Only
For your convenience, the output results are displayed in tables and tabs. You can choose to download the results in CSV/JSON format.
Basic Fields
Content Fields (Corresponding fields returned based on extraction mode)
Auxiliary Fields
Example Data:
Use the maxDepth parameter to control crawling depth:
Yes. Use the cssSelector parameter to specify the page area to extract:
article - Extracts article content.content - Extracts content with the specified class name#main - Extracts content with the specified IDUse the following two methods to filter:
Limits for different parameters:
Use smart wait parameters:
The content cleaning feature automatically:
Explore more popular scrapers from our marketplace
by CoreClaw
It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.
by Odin Kael
Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.
by Odin Kael
A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.
by Odin Kael
A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.