Troubleshooting Sitecore Search Crawling Failures: A Step-by-Step Guide
Sitecore Search offers the following pull sources:
Web crawler - a tool that crawls your content by starting from a point and following hyperlinks.
Advanced web crawler - a powerful and highly customizable crawler that crawls your content and adds it to an index.
API crawler - a crawler specifically designed to crawl API endpoints that return JSON.
- This issue may arise when the system attempts to parse your source and finds it not in the correct expected format.
- For example, if the source is sitemap.xml and if it does not render in the correct XML format, the crawling will fail.
- To prevent this, please ensure that your sitemap (https://site.com/sitemap.xml) is always formatted correctly.
- Rerun the crawling and the index and check if it is progressing to completion. Navigate to the Sources link on the CEC, and then find the source and click on the "Recrawl and reindex" link.
- There could be an issue with the Sitecore Search platform itself so please reach out to Sitecore Support via a ticket.
- We recently faced an issue with Sitecore Search where the Sitecore Search crawling started to fail intermittently giving the error "Job failed due to heartbeat error". Sitecore Support did confirm there was an issue going on with the heartbeat error, and they immediately launched a new version with the fix immediately.
- There could be a recent change implemented by an admin or developer before the crawling started failing. If the scripts on the document extractors start throwing errors, then there will be an impact on the crawling job.
- One option is to undo the recent change and see if the issue get fixed and the crawling is successful again.
- Further troubleshooting may be required with the changes on the scripts for the document extractors.