RAG Web Browser

01KP7AMQT0SSYZ7PFWFTD4BVET

A high-performance web scraper for RAG and AI, featuring Google search integration, dual-mode extraction (HTTP/Browser), and multi-format output.

by Odin Kael

4.7

6runs

Last updated:2026-04-15

Try for Free

2,000 Free Results

What is a RAG Web Browser?

A RAG web browser is an automated content extraction tool designed to provide real-time web search and scraping capabilities for Retrieval-Augmented Generation (RAG) pipelines and AI applications. With CoreClaw, you can obtain structured web content with zero code, empowering AI chatbots, knowledge base construction, content aggregation, and data mining.

✅ Google Search Integration - Intelligent multi-page search result extraction to find relevant content
✅ Dual Scraping Modes - Supports raw-http (fast HTTP requests) and browser-playwright (full browser rendering)
✅ Multi-Format Output - Supports Markdown, Plain Text, and HTML formats
✅ Concurrent Processing - Configurable parallelism for simultaneous multi-page scraping to boost efficiency
✅ Smart Content Handling - Automatic cookie popup removal, element filtering, and readability extraction
✅ Performance Monitoring - Built-in timing metrics and debug mode for real-time performance tracking
✅ Dynamic Content Support - Scrapes dynamic content by waiting for JavaScript rendering
✅ Production Ready - Optimized for the CoreClaw cloud environment for stability and reliability

What data can you extract?

🔍 Google Search Results	📄 Page Titles & Descriptions
📝 Markdown Formatted Content	📄 Plain Text Content
🌐 Raw HTML Content	🏷️ Page Metadata
🌍 Language Identification	⏱️ Scraping Performance Metrics
📊 HTTP Status Codes	🔗 Page URL Information

How to use the RAG Web Browser?

CoreClaw RAG Web Browser handles proxy rotation, task scheduling, concurrency control, and data standardization for you in the background. In just a few minutes, you can get your data by following these steps:

Create a free CoreClaw account using your email
Open the RAG Web Browser control panel
Enter your search query or a direct target URL
Configure parameters (max results, output format, scraping mode, concurrency, etc.)
Set advanced options (timeouts, retry attempts, content handling, etc.)
Click "Start" and let our cloud servers handle the scraping
Download the cleaned dataset in JSON format

➡️ Input

Key Parameter Descriptions

Parameter	Type	Default	Range	Description
query	string	-	-	Required. Search keyword or direct URL
maxResults	number	3	1-100	Maximum number of search results
outputFormat	string	`"markdown"`	`text`/`markdown`/`html`	Output format
scrapingTool	string	`"raw-http"`	`raw-http`/`browser-playwright`	Scraping engine
requestTimeoutSecs	number	40	1-300	Request timeout in seconds
serpMaxRetries	number	2	0-5	Number of Google search retries
maxRequestRetries	number	1	0-3	Number of target page retries
dynamicContentWaitSecs	number	10	0-60	Wait time for dynamic content
desiredConcurrency	number	3	1-10	Number of parallel scraping operations
removeCookieWarnings	boolean	true	-	Automatically remove cookie popups
htmlTransformer	string	`"none"`	`none`/`readableText`	HTML content transformation
removeElementsCssSelector	string	-	-	CSS selector for elements to remove
debugMode	boolean	false	-	Enable debug logs and metrics

Usage Examples

Example 1: Scraping based on Google Search

Query: Latest developments in AI
Max Results: 5
Output Format: Markdown
Scraping Mode: raw-http
Result: Extracts Markdown content from the top 5 search result pages.

Example 2: Direct Scraping of a Specific URL

Query: https://example.com/article
Output Format: text
Scraping Mode: browser-playwright
Dynamic Content Wait: 15 seconds
Result: Scrapes plain text from the specified page after waiting for JavaScript rendering.

Example 3: Concurrent Multi-Page Scraping

Query: Machine learning tutorials
Max Results: 10
Concurrency: 5
Remove Cookie Warnings: true
Result: Concurrently scrapes 10 pages with automatic cookie popup removal, significantly increasing speed.

Example 4: Readability-Optimized Extraction

Query: https://example.com
HTML Transformer: readableText
Remove Elements: .advertisement, .sidebar
Result: Extracts the main content of the page while automatically removing ads and sidebar elements.

⬅️ Output

For your convenience, output results are displayed in tables and tabs. You can choose to download the results in JSON format.

Output Content Description

Each scraped page will output the following data:

Crawl Information (crawl)

httpStatusCode - HTTP status code
httpStatusMessage - HTTP status message
loadedAt - Timestamp of when the page was loaded
requestStatus - Status of the request
uniqueKey - Unique identifier

Debug Information (debug)

timeMeasures - Timing metrics
totalTimeMs - Total time elapsed (ms)
urlsScraped - Number of URLs scraped

Search Result (searchResult)

title - Page title
description - Page description
url - Page URL

Metadata (metadata)

title - Page title
url - Page URL
description - Page description
languageCode - Language code

Content Output

markdown - Markdown formatted content
text - Plain text content
html - Raw HTML content

JSON Example:

json

{
  "url": "https://example.com",
  "crawl": {
    "httpStatusCode": 200,
    "httpStatusMessage": "OK",
    "loadedAt": "2024-01-01T00:00:00.000Z",
    "requestStatus": "handled",
    "uniqueKey": "abc123",
    "debug": {
      "timeMeasures": [],
      "totalTimeMs": 1654,
      "urlsScraped": 3
    }
  },
  "searchResult": {
    "title": "Page Title",
    "description": "Page description",
    "url": "https://example.com"
  },
  "metadata": {
    "title": "Page Title",
    "url": "https://example.com",
    "description": "Page description",
    "languageCode": "en"
  },
  "markdown": "# Page Content...",
  "text": "Page content...",
  "html": "<html>...</html>"
}

FAQ

What is the difference between `raw-http` and `browser-playwright` modes?

raw-http mode:

Fast - Uses Cheerio for parsing without browser rendering
Low Cost - Minimal resource consumption
Best for - Static web pages, SEO-friendly sites

browser-playwright mode:

Powerful - Full browser rendering with JavaScript execution
Accurate - Perfectly renders dynamic content
Best for - Single Page Applications (SPAs), login-required pages, heavy JavaScript sites

How do I choose an output format?

Markdown - Recommended Format

Ideal for LLM processing
Preserves structure (headings, lists, links)

Plain Text

Pure text, small file size
Best for keyword extraction and text analysis
Easy to search and process

HTML

Preserves full original structure
Best for scenarios requiring exact replication
Supports custom post-processing

Recommendation: Use Markdown for RAG applications, Plain Text for text analysis, and HTML for exact structure requirements.

How should I set the concurrency?

The desiredConcurrency parameter controls the number of pages scraped simultaneously:

Concurrency	Use Case	Notes
1-3	Low-frequency scraping, site-friendly	Recommended default
4-7	High-frequency, performance-priority	Monitor for rate limits
8-10	Large bulk scraping	May trigger anti-scraping mechanisms

Recommendation: Start with 3 and adjust based on the target site's response.

How do I handle dynamic content?

For dynamic content requiring JavaScript rendering:

Use browser-playwright mode
- Enables full browser rendering
- Automatically waits for JavaScript execution
Set dynamic content wait time
- Use the dynamicContentWaitSecs parameter
- Default is 10 seconds; can be adjusted to 30-60 seconds as needed
Verify content loading
- Enable debugMode to see loading details
- Adjust wait time based on debug info

How do I remove unwanted content?

Filter content using the following methods:

Remove Cookie Warnings

Set removeCookieWarnings: true
Automatically removes common cookie consent popups

Custom Element Filtering

Use the removeElementsCssSelector parameter
Specify elements to remove using CSS selectors
Example: .advertisement, .sidebar, .footer

Readability Extraction

Set htmlTransformer: "readableText"
Automatically extracts the main content and removes irrelevant elements

What are the common use cases?

AI Chatbots - Providing real-time web search capabilities for LLMs to answer user queries
RAG Workflows - Extracting and processing web content for knowledge base construction
Content Aggregation - Collecting articles, documentation, and research materials
SEO Analysis - Extracting metadata and content from competitor websites
Data Mining - Extracting structured data from web pages
Knowledge Graphs - Crawling related pages to build domain-specific knowledge bases

Price Estimation

Results Limit

101,000 results

Estimated:

~$0.30

100 results × $0.003. You only pay for success.

Run Now

Buy Now

Quick Tips

New users get 2,000 free results
Failed requests are free
Export results in JSON or CSV

Explore more popular scrapers from our marketplace

View All Scrapers

Google Search Results (SERP) Scraper API

by CoreClaw

It queries the Google search engine by keyword and returns a structured SERP summary, including the final search parameters, organic results, related queries, and people-also-ask data.

4.6

133 runs

From $3/results

Dataset Deduplication & Merge Tool

by Odin Kael

Dedup Datasets Worker is a powerful tool for merging and deduplicating datasets from multiple JSON/JSONL files. Fully optimized for the CafeScraper platform with enhanced features and robust error handling.

4.7

13 runs

From $3/results

Google Sheets Import Export Tool

by Odin Kael

A powerful Google Sheets data import export tool designed for data synchronization, backup, and integration between Google Sheets and external systems. Supports three operation modes, two authentication methods, batch processing, data deduplication, and automatic backup.

4.8

2 runs

From $3/results

Cheerio Web Scraping

by Odin Kael

A high-speed static page scraper based on Cheerio, designed specifically for static HTML pages. Uses Cheerio for HTML parsing, delivering speeds 10-50 times faster than full browser rendering.

RAG Web Browser

What is a RAG Web Browser?

What data can you extract?

How to use the RAG Web Browser?

➡️ Input

Key Parameter Descriptions

Usage Examples

⬅️ Output

Output Content Description

FAQ

What is the difference between raw-http and browser-playwright modes?

How do I choose an output format?

How should I set the concurrency?

How do I handle dynamic content?

How do I remove unwanted content?

What are the common use cases?

Price Estimation

You might also like

Google Search Results (SERP) Scraper API

Dataset Deduplication & Merge Tool

Google Sheets Import Export Tool

Cheerio Web Scraping

What is the difference between `raw-http` and `browser-playwright` modes?