Web Crawler Connector
Crawl a site or section of a site into RetainDB when a single URL is too narrow but a full unconstrained crawl would be messy.
Use the web crawler connector when you need multiple related pages from the same site and you want RetainDB to follow links for you.
This is the right tool for a docs section, help center, or small internal website. It is the wrong tool for “crawl the whole internet starting from our homepage.”
Use this connector when
- a single page is not enough
- the site has a clear boundary you can describe
- you want crawl-based discovery instead of a sitemap file
Create the source
curl -X POST "https://api.retaindb.com/v1/projects/proj_123/sources" \
-H "Authorization: Bearer $RETAINDB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Acme Docs Crawl",
"connector_type": "web",
"config": {
"start_url": "https://docs.acme.com",
"max_depth": 3,
"allow_paths": ["/guides", "/reference"]
}
}'Why allow_paths matters
Without a boundary, crawls get noisy fast.
Use path constraints and a reasonable max_depth so you ingest the part of the site you actually want instead of navigation clutter, changelogs, or irrelevant marketing pages.
Start sync and monitor it
curl -X POST "https://api.retaindb.com/v1/sources/src_123/sync" \
-H "Authorization: Bearer $RETAINDB_API_KEY"curl "https://api.retaindb.com/v1/sources/src_123/status" \
-H "Authorization: Bearer $RETAINDB_API_KEY"What a good first crawl looks like
Start smaller than you think:
- one docs section
- low depth
- explicit allow paths
Once retrieval looks good, expand the crawl boundary.
Common mistakes
Crawl explosion
If the site fans out quickly, the crawler may spend effort on pages you do not care about. Tighten allow_paths first.
Duplicate or low-value pages
If the site has many similar pages, navigation shells, or printer views, retrieval quality can get noisier than expected.
Using it for JS-heavy pages
If critical content only appears after client-side rendering, the AI Browser connector may be a better fit.
Next step
If the site already has a reliable sitemap, use sitemap instead. If you only need one page, go back to URL connector.
Was this page helpful?
Your feedback helps us prioritize docs improvements weekly.