0

SharePoint Fast Search Concepts and Terminology – Part 4/4

This blog post is Part 4/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

 

Fast Search Terminology and Concepts

8. Linguistics

Search engines perform a lot of language specific processing. This includes applying rules specific to that language and offensive language filtering, synonyms etc. Explaining in detail about how FS4SP behaves for languages is out of scope of this blog post.

9. Tokenization

When content sources are defined, Search engines crawl and will index those and will generate indexes. Some search engines refers to them as documents. Tokenization is a process of analyzing those documents and breaking the text into terms or tokens recognizable by the search engine.

10. Keyword Rank

This is the technique to improve ranking on documents that contains predefined keywords. This is a useful technique to forcefully enhance ranking among documents for better enhanced search experience.

11. Rank Profiles

Rank profiles are core component with FS4SP that defines how to rank items with in result set. There is a provision to create multiple Rank Profiles via Powershell, that defines how weights are applied to items or any components.

 

 

 

 

9. Keyword Rank

10. Rank Profiles

0

SharePoint Fast Search Concepts and Terminology – Part 1/4

This blog post is Part 1/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

Before we deep dive into these concepts, lets try to capture the overall process flow for any search engine. Typically content for any organization is stored in databases or documents on a file system. These documents can be of any type word, excel, images, videos, pdfs power-points etc. The primary job of any search technology is to crawl, index and surface results.

  1. Crawl:

    1. Once you have identified what content you want to crawl you will make search engine aware of where these are located and will grant appropriate permissions to crawl.  A crawl is basically collecting data and primarily metadata about the content.
  2. Index:

    1. Indexing is similar to how indexes work with books, they are just pointers to the actual location of the content. Typically indexes are physical files that are added to the file system and are output of a crawl.
  3. Search:

    1. Once the crawling and Indexing are complete its time to search these indexes, since these indexes are present on the file system querying these is lot faster.

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

Fast Search Terminology and Concepts

Fast Search Terminology and Concepts

Lets get started:

1. Content Processing:

This is the nexus of any search engine, this defines what data-sources the search engine should crawl and the quality of the content itself for crawling. This is all about enriching the content even before it is being crawled.  Some of the tasks include

  1. Making sure that the search engine doesn’t crawl lot of noise consequently impacting the recall
  2. Detecting the language and applying rules
  3. Extracting meta data etc. and the list goes on.

2. Query Processing:

This kicks in after user performs the search, the search engine analyzes what the user is actually requesting and will accept additional query parameters if needed. This also matches result items in the search index  and returning search results to the user.

3. Relevancy:

This is the measure of how accurate/precise the search results are. There are various factors that determine how good the relevancy is, for instance the more the user find the intended results in the top search the better the relevancy is.  There are various techniques that can improve relevancy which will be discussed more in other part of this blog post.

4. Query Expansion: Best Bets, Synonyms, Lemmatization

Query expansion is a technique to improve Recall. The user may search for a term, search engines would not only search for specific term but also other relevant terms. This section explains on various techniques on how Query expansion can be accomplished.

4.1 Lemmatization

Lemma is a greek word which mean assumption or the canonical form of the word. For instance if the user searches for ship it would search shipped, ships etc. Not to be confused with Stemming where only the end of the word changes where it substitutes only the ending. For instance, Stemming would search for See, Seen, Seeing but no Saw. Where as Lemmatization would search for ‘Saw’ as well.

4.2 Synonyms

This is one of the most popular Recall technique, where Search engine would return results not only to the search terms but also for its synonyms. For instance if the user searches for ‘Joy’, it may include results for ‘Merry’, ‘Happy’, ‘Elated’, ‘Celebration’ etc.

4.3 Best Bets/Visual Best Bets

Best Bets are usually links displayed on the top of the search results pointing to different pages or content. These links are manually curated by administrator to display for a particular search term. Visual Best Bet similar to Best Bet except that an additional image is provided along with link and description.