Tuesday, November 14, 2023

Sitecore Search : Addressing exceptions on the website crawling for the failed pages

Sitecore Search : Addressing common exceptions on the website crawling for the failed pages

This blog is to assist on how to address common exceptions with crawling errors on Sitecore Search for the failed pages. If you're experiencing issues with a site crawler,  where the crawler is encountering errors while crawling pages. The dashboard page does show the errors , but it does not provide a detailed log of the issue. This blog will help you to locate the full details around these errors. 

For checking the status of the Scheduled scans on the Site Crawler, 
  • Login to the CEC portal and click Sources
  • The summary on the last crawling is displayed here. It basically shows:
    • Last Run Status : shows Finished if it completed crawling or Failed if it stopped due to a failure
    • Last Run time 
    • Items Indexed: Number of items indexed 
    • Also, it also shows a summary of errors if there was any errors while crawling the site.
On the below example, after finishing the Crawling, it shows there are 3 configuration errors, but it does not show any further details for additional troubleshooting. 




So, in order to get more information on the crawling results, we can find it under the Analytics tab. Below are the steps to see more information on the crawling results:

  • Navigate to Analytics -> Sources ->  Overview and then select the Source at the bottom

  • This page shows the details on the last few crawling runs and details like duration of the run, status, Items Indexed and Job Run ID.


  • Click on the last Run to see if there were any documents that was dropped or failed. 



  • The reason of the failure is that the crawling of the page https://devsite.com/about-us/news failed because the page is unavailable or throwing errors while loading, which can be looked into with some troubleshooting. So, using the above method, we can potentially identify the crawling errors for faster troubleshooting.  

Wednesday, November 8, 2023

Executing Tasks with Sitecore PowerShell Extensions: A Practical Guide

Executing Tasks with Sitecore PowerShell Extensions: A Practical Guide 

Sitecore PowerShell Extensions (SPE) is a popular module for the Sitecore CMS that enhances its capabilities by providing a powerful scripting environment and a variety of useful commands for administrators and developers. It's commonly used for automating various tasks within Sitecore, such as content management, reporting, and maintenance. This blog is a practical guide to various uses of the module Sitecore PowerShell Extensions. It is meant to present some of the examples of how Sitecore PowerShell Extensions(SPE) can be leveraged for common tasks on Sitecore.


Below are some of the examples of how Sitecore PowerShell Extensions can be leveraged for common tasks on Sitecore:

Item Manipulation:

  • Creating Items: Sitecore items can be created programmatically using SPE. For example, using the below script, a new item is created under a specific folder Articles with a specified template Article.

    • New-Item -Path "master:\content\MySite\Articles" -Name "NewArticle" -ItemType "MySite/Article"

  • Copying / Moving Items: using the below SPE scripts, items can be copied or move items from one location to another.

    • Copy-Item -Path "master:\content\MySite\Articles\Article1" -Destination "master:\content\MySite\Articles\Article2"

    • Get-Item -Path "master:\content\MySite\Articles\Article1" | Move-Item  -Destination "master:\content\MySite\Articles"

  • Delete Items: using the below SPE command, items can be deleted. Using the permanently parameter, we specify the item should be deleted rather than recycled. 

    • Remove-Item -Path "master:\content\MySite\Articles\Article1" -Permanently

  • Publish Items: using the below SPE command, items can be published from one database to another, such as from the master database to the web database.

    • Get-Item -Path master:\content\home | Publish-Item -Recurse -PublishMode Incremental

  • For publishing to multiple databases 
    • $targets = [string[]]@('web','internet')
    • Publish-Item -Path master:\content\home -Target $targets

  • Bulk Operations: using the below command, bulk updates can be made on items, such as changing the template of multiple items or updating fields.

    • Get-ChildItem -Path "master:\content\MySite\Articles" | ForEach-Object {
              $_.ChangeTemplate("MySite/UpdatedArticleTemplate")
      }
  • Create Users and Roles: Sitecore User and role creation can be automated as well using the below commands.

    • New-User -Name "dev.user" -Password "password" -Email "dev.user@example.com" -Profile "Default Profile" -Roles @("Content Author", "Content Reviewer")

Friday, November 3, 2023

Configuring Sitecore Search Document Extraction: A Step-by-Step Guide

Configuring Sitecore Search Document Extractor: A Step-by-Step Guide 

Document extraction on Sitecore Search typically refers to the process of searching and extracting specific data or DOM elements by crawling over the web pages across the website. Document extraction involves the conversion of content into a structured format that can be processed and indexed for search.

This blog is intended to demo a step-by-step guide to setting up Document Extractors in Sitecore Search. Below are some of the common Key attributes that's required for the Search to work properly:
  • Title or Name of the page 
  • Content/Description of the pages
  • meta tags
  • key Elements from page components that should be included as part of the Search 
Here is a quick demo to configure a JavaScript Document Extractor to extract attribute values for an advanced web crawler or an API crawler: 
  • On the CEC portal, click Sources, and click on the custom Source. Then on the Source Settings configuration -> click on Document Extractors -> click edit on the right.
  • Next step is to add the extractor, add a Name Demo JavaScript Extractor and select JS as the Extractor type
  • In the Taggers section, -> click Add Tagger. The function must use Cheerio syntax and must return an array of objects.


This is a sample JS script already available on the editor on the tagger. This sample already includes extracting the below fields from the pages of the website:
  • title
  • description
  • searchtitle meta tag
  • Open Graph type
  • Open Graph URL
  • Open graph description tag
  • similarly language can be extracted from body or the url

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

Conditional logic can be implemented to extract key variables based on URLs:

    if (url.includes('/blogs')) {
      type = 'Blogs';
    } else if (url.includes('about-us')) {
      type = 'About Us';
    } else ..

One key thing is to configure attributes under Textual relevance under Domain Configuration and also on Global Widget Settings to make these attributes work properly. Here is an article which describe in details to configure the attributes on Textual relevance. 

https://sitecorebasics5.blogspot.com/2023/10/sitecore-search-how-to-configure.html

There are other methods of extraction as well like XPath and CSS document extractors. Depending on your requirement, you can choose them.

Reference:
https://doc.sitecore.com/search/en/users/search-user-guide/configuring-document-extractors.html