Configuring Sitecore Search Document Extractor: A Step-by-Step Guide
Document extraction on Sitecore Search typically refers to the process of searching and extracting specific data or DOM elements by crawling over the web pages across the website. Document extraction involves the conversion of content into a structured format that can be processed and indexed for search.
This blog is intended to demo a step-by-step guide to setting up Document Extractors in Sitecore Search. Below are some of the common Key attributes that's required for the Search to work properly:
- Title or Name of the page
- Content/Description of the pages
- meta tags
- key Elements from page components that should be included as part of the Search
Here is a quick demo to configure a JavaScript Document Extractor to extract attribute values for an advanced web crawler or an API crawler:
- On the CEC portal, click Sources, and click on the custom Source. Then on the Source Settings configuration -> click on Document Extractors -> click edit on the right.
- Next step is to add the extractor, add a Name Demo JavaScript Extractor and select JS as the Extractor type
- In the Taggers section, -> click Add Tagger. The function must use Cheerio syntax and must return an array of objects.
This is a sample JS script already available on the editor on the tagger. This sample already includes extracting the below fields from the pages of the website:
- title
- description
- searchtitle meta tag
- Open Graph type
- Open Graph URL
- Open graph description tag
- similarly language can be extracted from body or the url
// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
$ = response.body;
return [{
'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
'type': $('meta[property="og:type"]').attr('content') || 'website_content',
'url': $('meta[property="og:url"]').attr('content')
}];
}
Conditional logic can be implemented to extract key variables based on URLs:
if (url.includes('/blogs')) {
type = 'Blogs';
} else if (url.includes('about-us')) {
type = 'About Us';
} else ..
One key thing is to configure attributes under Textual relevance under Domain Configuration and also on Global Widget Settings to make these attributes work properly. Here is an article which describe in details to configure the attributes on Textual relevance.
https://sitecorebasics5.blogspot.com/2023/10/sitecore-search-how-to-configure.html
There are other methods of extraction as well like XPath and CSS document extractors. Depending on your requirement, you can choose them.
Reference:
https://doc.sitecore.com/search/en/users/search-user-guide/configuring-document-extractors.html