A high-performance web scraper for RAG and AI, featuring Google search integration, dual-mode extraction (HTTP/Browser), and multi-format output.
A RAG web browser is an automated content extraction tool designed to provide real-time web search and scraping capabilities for Retrieval-Augmented Generation (RAG) pipelines and AI applications. With CoreClaw, you can obtain structured web content with zero code, empowering AI chatbots, knowledge base construction, content aggregation, and data mining.
raw-http (fast HTTP requests) and browser-playwright (full browser rendering)| 🔍 Google Search Results | 📄 Page Titles & Descriptions |
|---|---|
| 📝 Markdown Formatted Content | 📄 Plain Text Content |
| 🌐 Raw HTML Content | 🏷️ Page Metadata |
| 🌍 Language Identification | ⏱️ Scraping Performance Metrics |
| 📊 HTTP Status Codes | 🔗 Page URL Information |
CoreClaw RAG Web Browser handles proxy rotation, task scheduling, concurrency control, and data standardization for you in the background. In just a few minutes, you can get your data by following these steps:
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
| query | string | - | - | Required. Search keyword or direct URL |
| maxResults | number | 3 | 1-100 | Maximum number of search results |
| outputFormat | string | "markdown" | text/markdown/html | Output format |
| scrapingTool | string | "raw-http" | raw-http/browser-playwright | Scraping engine |
| requestTimeoutSecs | number | 40 | 1-300 | Request timeout in seconds |
| serpMaxRetries | number | 2 | 0-5 | Number of Google search retries |
| maxRequestRetries | number | 1 | 0-3 | Number of target page retries |
| dynamicContentWaitSecs | number | 10 | 0-60 | Wait time for dynamic content |
| desiredConcurrency | number | 3 | 1-10 | Number of parallel scraping operations |
| removeCookieWarnings | boolean | true | - | Automatically remove cookie popups |
| htmlTransformer | string | "none" | none/readableText | HTML content transformation |
| removeElementsCssSelector | string | - | - | CSS selector for elements to remove |
| debugMode | boolean | false | - | Enable debug logs and metrics |
Example 1: Scraping based on Google Search
Example 2: Direct Scraping of a Specific URL
Example 3: Concurrent Multi-Page Scraping
Example 4: Readability-Optimized Extraction
For your convenience, output results are displayed in tables and tabs. You can choose to download the results in JSON format.
Each scraped page will output the following data:
Crawl Information (crawl)
Debug Information (debug)
Search Result (searchResult)
Metadata (metadata)
Content Output
JSON Example:
raw-http and browser-playwright modes?raw-http mode:
browser-playwright mode:
Markdown - Recommended Format
Plain Text
HTML
Recommendation: Use Markdown for RAG applications, Plain Text for text analysis, and HTML for exact structure requirements.
The desiredConcurrency parameter controls the number of pages scraped simultaneously:
| Concurrency | Use Case | Notes |
|---|---|---|
| 1-3 | Low-frequency scraping, site-friendly | Recommended default |
| 4-7 | High-frequency, performance-priority | Monitor for rate limits |
| 8-10 | Large bulk scraping | May trigger anti-scraping mechanisms |
Recommendation: Start with 3 and adjust based on the target site's response.
For dynamic content requiring JavaScript rendering:
Use browser-playwright mode
Set dynamic content wait time
dynamicContentWaitSecs parameterVerify content loading
debugMode to see loading detailsFilter content using the following methods:
Remove Cookie Warnings
removeCookieWarnings: trueCustom Element Filtering
removeElementsCssSelector parameter.advertisement, .sidebar, .footerReadability Extraction
htmlTransformer: "readableText"Explore more popular scrapers from our marketplace
by CoreClaw
It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.
by Odin Kael
Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.
by Odin Kael
A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.
by Odin Kael
A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.