Friday, November 3, 2023

Configuring Sitecore Search Document Extraction: A Step-by-Step Guide

Configuring Sitecore Search Document Extractor: A Step-by-Step Guide

Document extraction on Sitecore Search typically refers to the process of searching and extracting specific data or DOM elements by crawling over the web pages across the website. Document extraction involves the conversion of content into a structured format that can be processed and indexed for search.

This blog is intended to demo a step-by-step guide to setting up Document Extractors in Sitecore Search. Below are some of the common Key attributes that's required for the Search to work properly:

Title or Name of the page
Content/Description of the pages
meta tags
key Elements from page components that should be included as part of the Search

Here is a quick demo to configure a JavaScript Document Extractor to extract attribute values for an advanced web crawler or an API crawler:

On the CEC portal, click Sources, and click on the custom Source. Then on the Source Settings configuration -> click on Document Extractors -> click edit on the right.
Next step is to add the extractor, add a Name Demo JavaScript Extractor and select JS as the Extractor type
In the Taggers section, -> click Add Tagger. The function must use Cheerio syntax and must return an array of objects.

This is a sample JS script already available on the editor on the tagger. This sample already includes extracting the below fields from the pages of the website:

title
description
searchtitle meta tag
Open Graph type
Open Graph URL
Open graph description tag
similarly language can be extracted from body or the url

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

Conditional logic can be implemented to extract key variables based on URLs:

    if (url.includes('/blogs')) {
      type = 'Blogs';
    } else if (url.includes('about-us')) {
      type = 'About Us';
    } else ..

One key thing is to configure attributes under Textual relevance under Domain Configuration and also on Global Widget Settings to make these attributes work properly. Here is an article which describe in details to configure the attributes on Textual relevance.

https://sitecorebasics5.blogspot.com/2023/10/sitecore-search-how-to-configure.html

There are other methods of extraction as well like XPath and CSS document extractors. Depending on your requirement, you can choose them.

Reference:

https://doc.sitecore.com/search/en/users/search-user-guide/configuring-document-extractors.html

Thursday, October 26, 2023

Sitecore Search: How to Configure Textual Relevance for Better Search Results

Sitecore Search : How to Configure Textual Relevance for Better Search Results

Textual relevance in Sitecore Search is determined by how closely a potential result matches the visitor’s search query. To configure textual relevance, you need to specify which content areas Sitecore Search should look for matching terms and the relative importance of each area. This blog is intended to demo on how to configure Textual Relevance on Sitecore Search to aim better search experience for customers.

For example, If the attributes name and description are configured for textual relevance then Sitecore Search looks for matching terms in these attributes.
Multiple attributes can be configured with the textual relevance feature.
Each attribute can be give a Weight/Priority in the Global Widget Settings. This weight is used with other factors to determine the order of documents in search results.

Below are the steps to configure Textual Relevance in Sitecore Search:

To add an attribute and enable it for textual relevance:

In the CEC portal, click Administration > Domain Settings. Under Attributes click on Add Attribute at the top right corner.
Click Settings > Entity and choose the relevant entity. In the Display Name field, enter a display name for the attribute, e.g. Name
In the Attribute Name field, enter the attribute's key e.g. Name. This value is used later in the source configuration
On the Use For Features tab, select the Textual relevance option. Click Save and then, click Publish.

The next step is to configure textual relevance at the domain level

In the CEC portal, click Administration > Domain Setting > Feature Configuration.
Click Textual Relevance > Add Attribute.
On the field where textual relevance needs to be added, click Add Analyzer and then click Add.
By default, the analyzer Multi-Locale Standard Analyzer is already set on the attribute but as per the requirement it can be selected from the list available. Click Save and Publish.

The next step is to enable the new attribute for Textual Relevance in the Global Widget

In the CEC, click Global Resources > Global Widget > Global Widget Settings > Textual Relevance. Click Advanced Mode.
Here weightage can be assigned numeric values for different attribute/analyzer combinations.
To include an attribute, click Include.
In the WEIGHT column, assign a weight to the attribute, e.g. enter 2 for Name and 1 for Description.
Click Save and Publish.

By setting up the attributes for textual relevance, all you need to do is to run the rescan and re-index and check if the search results have updated and better potential results are shown to the search query.

Thursday, October 12, 2023

Troubleshooting Sitecore Search Crawling Failures: A Step-by-Step Guide

Sitecore Search offers the following pull sources:

Web crawler - a tool that crawls your content by starting from a point and following hyperlinks.
Advanced web crawler - a powerful and highly customizable crawler that crawls your content and adds it to an index.
API crawler - a crawler specifically designed to crawl API endpoints that return JSON.

Sitecore Search crawls the website to extract the latest content using the trigger setup, usually it is sitemap.xml. There could be multiple reasons why the crawling might start to fail and index may not get the latest content from the website due to that. This blog is intended to demo on various reasons of why the crawling may fail, and how to resolve these issues.

Below are some potential options that you can try to remediate the issue faster:

This issue may arise when the system attempts to parse your source and finds it not in the correct expected format.

For example, if the source is sitemap.xml and if it does not render in the correct XML format, the crawling will fail.
To prevent this, please ensure that your sitemap (https://site.com/sitemap.xml) is always formatted correctly.

Rerun the crawling and the index and check if it is progressing to completion. Navigate to the Sources link on the CEC, and then find the source and click on the "Recrawl and reindex" link.

There could be an issue with the Sitecore Search platform itself so please reach out to Sitecore Support via a ticket.

We recently faced an issue with Sitecore Search where the Sitecore Search crawling started to fail intermittently giving the error "Job failed due to heartbeat error". Sitecore Support did confirm there was an issue going on with the heartbeat error, and they immediately launched a new version with the fix immediately.

There could be a recent change implemented by an admin or developer before the crawling started failing. If the scripts on the document extractors start throwing errors, then there will be an impact on the crawling job.

One option is to undo the recent change and see if the issue get fixed and the crawling is successful again.
Further troubleshooting may be required with the changes on the scripts for the document extractors.

Thursday, August 31, 2023

Sitecore PowerShell Extensions: Creating Sitecore users in bulk using SPE

This blog is intended to demo on how to create multiple users and assign them the required roles through automation by the use of Sitecore PowerShell Extensions. If there is a requirement to create multiple users in Sitecore and assign them individual Roles, it can take a while to do it manually one by one. Instead, we can configure the user creation and role assignment using a script in Sitecore PowerShell Extensions.

Here is an example which explains how multiple users have been created into Sitecore. Reference to the Sitecore documentation : https://doc.sitecorepowershell.com/appendix/security/new-user

New-User -Identity usrA -Enabled -Password b -Email usrA@gmail.com -FullName "User A"

New-User -Identity usrB -Enabled -Password b -Email usrB@gmail.com -FullName "User B"
New-User -Identity usrC -Enabled -Password b -Email usrC@gmail.com -FullName "User C"

Once these users are created, the below step is to add the users into their respective roles. Below is the command to add the individual users into the Roles.

Add-RoleMember -Identity "Developer" -Members "usrA", "usrC"
Add-RoleMember -Identity "Publisher" -Members "usrB"

Reference to the Sitecore documentation : https://doc.sitecorepowershell.com/appendix/security/new-userhttps://doc.sitecorepowershell.com/appendix/security/add-rolemember

Hope this article helps to create users in bulk and can save lot of manual efforts.

Monday, August 7, 2023

Allowing PDF file redirects on the Sitecore website

As per the Standard Sitecore setup, the PDF redirects are not allowed or handled via Sitecore. This blog is intended to demo the use case to allow redirects of PDF files.

In order to give the freedom to Content Authors, so that they can setup these PDF redirects, the PDF extension is required to be allowed in the below processor FilterUrlFilesAndExtensions.

Below is the configuration required for a SXA website:

After making the above change, the URL https://sc102.dev.local/dummypage/dummy.pdf gets redirected successfully to https://sc102.dev.local/home

Below is the configuration required for a normal Sitecore website without SXA module: 

<sitecore>
  <pipelines>
    <preprocessRequest>
      <processor type="Sitecore.Pipelines.PreprocessRequest.FilterUrlExtensions, Sitecore.Kernel">
        <param desc="Allowed extensions (comma separated)">aspx, ashx, asmx, pdf</param>
      </processor>
    </preprocessRequest>
  </pipelines>

Sunday, August 6, 2023

Issue with Sitecore CleanUpAgent resulting in low disk space on Sitecore servers

Issue with Sitecore CleanupAgent resulting in low disk space on Sitecore servers

This blog is intended to demo the resolution of the issue with Cleanup Agent, due to which the old log files are not cleaned up automatically by the agent, resulting in low disk space on the Sitecore servers.

In multiple versions of Sitecore, there is a bug in the CleanupAgent that the task executes but leaves behind so many log files. Usually, due to the rollingStyle mentioned in the Log Appender, for every 10MB, it creates log files with name ending with a suffix number like azure.log.xxxx.txt.1, azure.log.xxxx.txt.2 and so on.

Due to this bug

Low disk space on the Sitecore servers
Size of the app becomes quite huge
Backups will start to fail, and restore of the sites start to fail due to enormous back up size

With an easy fix, this issue can be resolved. With the below pattern, CleanupAgent will definitely pickup the log files ending with the number suffix and the logs folder will reduce to the expected size.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:role="http://www.sitecore.net/xmlconfig/role/"

xmlns:env="http://www.sitecore.net/xmlconfig/env" xmlns:set="http://www.sitecore.net/xmlconfig/set/">

<files>

</files>

</agent>

</scheduling>

</sitecore>

</configuration>

Friday, August 4, 2023

Encrypt Sitecore credentials by securing credentials under App services Configuration

This blog is intended to secure the connection strings of the Sitecore platform as per the recommendations from Sitecore. This is a very important step to secure the credentials, to stop them exposed to unauthorised access.

By default, the Sitecore passwords are stored in Connectionstrings.config for the different roles. As per the recommendations from Sitecore, they should be encrypted so that passwords are not exposed without authorization.

For the Core roles and XP Service roles, below is the method that you can use to secure the credentials of the website.

Step 1: For each of the roles, open App service Editor/Advanced Tools and navigate to the file "site/wwwroot/App_Config/ConnectionStrings.config"

Step 2: For each of the connection strings mentioned in the file ConnectionStrings.config, create a new entry under the App service -> Configuration -> scroll down to "Connection Strings" section. For example, core, web, master, security, etc.

The name of the key will be the same as in the file. And the Value will be the complete "connectionString" value from the file. Type will be SQLAzure for the database ones.

For other connection strings, you can set them to Custom.

Step 3:

Make sure you empty the connectionString value from each of the line items on the file ConnectionStrings.config. App services will fetch the connection strings from the Connection Strings section of the Configuration automatically.

And for the webjobs in the roles, xConnect Search, Cortex Processing and Marketing Operations : Below is the method to move their credentials into Configuration under App services.

Step 1: For each of the webjobs mentioned above, open App service Editor/Advanced Tools and navigate to the file wwwroot\App_Data\jobs\continuous\xxx\App_Config where xxx is among the folders ProcessingEngine/IndexWorker/AutomationEngine depending on the webjob.

Step 2: Create an App setting by name "SITECORE_CONNECTIONSTRINGS_" as prefix followed by name from ConnectionStrings.config. Please make sure, name should be in Capital letters.

Step 3: