
1
A complete guide for automated data scraping script development, covering file structure, SDK core functions, code examples, and FAQs.
| File Name | Description |
|---|---|
main.py | Script entry file (execution entry point), named main |
requirements.txt | Python dependency management file |
input_schema.json | UI input form configuration file |
README.md | Project documentation file |
sdk.py | SDK core functionality module |
sdk_pb2.py | Enhanced data processing module |
sdk_pb2_grpc.py | Network communication module |
The following three SDK files must be placed in the script root directory:
| File Name | Core Functionality |
|---|---|
sdk.py | Basic functionality module |
sdk_pb2.py | Enhanced data processing module |
sdk_pb2_grpc.py | Network communication module |
These three files form the script's "toolkit", providing all core functions required for interacting with the mid-platform system and running web crawlers.
Retrieve external configuration parameters at script startup (e.g., target website URL, search keywords):
Use Case: When scraping data from different websites, pass different parameters without modifying the code.
Log messages at different levels during script execution, displayed in the mid-platform interface:
| Log Level | Description |
|---|---|
debug | Most detailed debugging info, suitable for development |
info | Normal process log, recommended for key steps |
warn | Warning message, indicates potential issues without stopping |
error | Error message, indicates critical issues requiring attention |
After scraping data, return it to the mid-platform system in two steps:
Define the table structure (similar to setting Excel column headers):
Field Description:
| Field | Description |
|---|---|
label | Column header displayed in the table (user-visible) |
key | Unique data identifier (used in code, lowercase English + underscore recommended) |
format | Data type:text / integer / boolean / array / object |
After setting headers, start pushing scraped data:
Important Reminders:
Specify all third-party Python packages and their versions required to run the script:
| Format | Description |
|---|---|
package==version | Install specific version for environment consistency |
package | No version specified, auto-install latest version |
| Stage | Description |
|---|---|
| 1. Receive Instructions | Get input parameters (e.g., target URL, collection quantity) |
| 2. Proxy Setup | Configure proxy server to access restricted websites |
| 3. Auto Execution | Automatically scrape target page information based on parameters |
| 4. Report Results | Convert unstructured data to standard format, generate table |
To ensure the same package versions are used across different environments (dev, test, prod), avoiding inconsistent behavior or compatibility issues caused by version differences.
The system installs the latest version, which may be incompatible with the script. Recommend fixing versions for core dependencies.
Add a new line in requirements.txt in the format package==version or package, then re-upload the zip archive.
Check network connectivity or try switching Python package mirrors. Contact system admin if issues persist.
The three SDK files (sdk.py, sdk_pb2.py, sdk_pb2_grpc.py) must be placed in the script root directory (the folder containing main).
Use SDK or CoreSDK directly in code to call related functions.
Yes. Keys used when pushing data must exactly match those defined in headers (case-sensitive).
Explore more popular scrapers from our marketplace
by CoreClaw
Extract public TikTok post data via profile URLs, including engagement, viral trends and audio info. One-click CSV/JSON export, zero code required.
by CoreClaw
Extract public TikTok creator profile data using search URLs, including bio, follower counts, content performance and engagement metrics, without platform API limitations. Supports data export, API calls and third-party integrations.
by CoreClaw
Extract public TikTok video comment data in batches by entering video URLs, including comment content, user information, like counts, reply lists, etc., outputting in CSV or JSON format. Supports sentiment analysis and user insights with zero-code operation and one-click structured data export.
by CoreClaw
By entering URLs, batch extract public TikTok creator profile data, including bio, follower count, content performance, engagement metrics, and more, outputting in CSV or JSON format. Support user analysis and marketing decisions with zero-code operation and one-click export of structured data.