Crawling website content
This page explains how the crawl pipeline discovers public pages, fetches clean text, and refreshes the content that powers the knowledge base.
What does crawling do?
The crawl pipeline discovers and fetches pages from your site, then converts them into clean text that powers Hi, I'm Kai answers. Crawling is the first step in building a knowledge base from existing website content.
The crawler follows public URLs from the root page and respects exclusion rules. It stores status for each URL so you can see which pages were discovered, fetched, skipped, or failed.
How do I start a crawl?
- Go to the Crawl page.
- If your site is not listed yet, click Add site and enter its URL.
- Click Start crawl. The system begins discovering pages from the root URL.
After the crawl starts, review the status list and let the process finish before making large knowledge base edits.
What do crawl statuses mean?
The crawl status table has two columns: the status value stored by the crawler and the meaning of that value for the page.
| Status | Meaning |
|---|---|
discovered | URL found but not yet fetched. |
fetching | Page is being downloaded. |
fetched | Content successfully retrieved. |
skipped | Page excluded by robots.txt or crawl rules. |
failed | Fetch attempt failed because of a timeout, 4xx response, or 5xx response. |
Healthy crawls end with most important pages in the fetched state. Investigate repeated failed statuses on pages that should be answer sources.
How do I re-crawl changed content?
Re-run a crawl any time your content changes. Stale content is replaced during re-ingestion so outdated information does not persist in the knowledge base.
For high-impact content such as pricing, hours, or policies, re-crawl immediately after publishing the website update.