0

SharePoint Fast Search Concepts and Terminology – Part 4/4

This blog post is Part 4/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

 

Fast Search Terminology and Concepts

8. Linguistics

Search engines perform a lot of language specific processing. This includes applying rules specific to that language and offensive language filtering, synonyms etc. Explaining in detail about how FS4SP behaves for languages is out of scope of this blog post.

9. Tokenization

When content sources are defined, Search engines crawl and will index those and will generate indexes. Some search engines refers to them as documents. Tokenization is a process of analyzing those documents and breaking the text into terms or tokens recognizable by the search engine.

10. Keyword Rank

This is the technique to improve ranking on documents that contains predefined keywords. This is a useful technique to forcefully enhance ranking among documents for better enhanced search experience.

11. Rank Profiles

Rank profiles are core component with FS4SP that defines how to rank items with in result set. There is a provision to create multiple Rank Profiles via Powershell, that defines how weights are applied to items or any components.

 

 

 

 

9. Keyword Rank

10. Rank Profiles

0

SharePoint Fast Search Concepts and Terminology – Part 3/4

This blog post is Part 3/3 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

Fast Search Terminology and Concepts

7.Rank Tuning

By definition ‘Rank’ is the position in the hierarchy, in the search world ‘Rank’ is the position of the search result item in the result set. For instance, lets consider Bing or Google Search, when you search for ‘shoes’ or ‘vacation’ or in fact any search term you will find lot of search results. The first few results are the one user would click and finally end up being the revenue generators. So it is important to have your website show up in the first few results and is influenced by Rank. Seriously, how many times did you pass the first page of Google search results? So there are various technique that can be used to tune your Ranking. In the Fast Search world we have the following techniques.

7.1 Static/Quality Rank Tuning: URL Depth, Doc Rank, Site Rank, Hard Wired Boost

As the name indicates this the technique works independent of the search term. Which means that you would like to rank items independent of what user searches. Confused? OK, let me go through few examples and you would realize that you can tune Ranking independent of search terms.

URL Depth Ranking:
Observe  the below URLs and think about which of the two link, you would prefer to click?

1. http://www.sdakoju.wordpress.com/post/2015/mostpopular/fastsearchconfiguration

                                       OR

2. http://www.sdakoju.wordpress.com/FS4SPConfiguration

I would choose the second one, because it is more readable, clean and a more friendly URL. The depth of the URL is very less compared to the first one. ‘Depth’ indicates the importance or popularity of the page. The more deep the the page lies with in the site, the less popular the page is.  Obviously, you would always highlight your popular pages as your landing or one level below items. Search engines would like to show popular pages and not disliked or unknown pages lying some where deep down in the site.

Doc Rank:
If you have a page that is referenced across multiple pages, it influences the rank. Pages like these are ranked higher.

Site Rank:
This is similar to Doc Rank, the more links pointing the site or items with in the site the site has a higher rank

Hard Wired Boost:
Items can be give static ranking via Powershell and forcefully shown are high rank items

7.2 Dynamic Ranking:

This ranking value is based on query and its relation to the result set. This ranking is based on an multiple techniques/algorithms that influence ranking. Covering those is out of scope of this blog post. Please refer to Tune Dynamic Rank

 

0

SharePoint Fast Search Concepts and Terminology – Part 2/4

This blog post is Part 2/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

Fast Search Terminology and Concepts

5. Recall and Precision

The total number of results in the result set for a query. You have to find a fair balance between Recall and Precision. The results set should not be too large and should avoid noise as much as possible. If the Recall is too large or too small it will hamper Precision.

There are various techniques to improve Recall such as Synonyms, Stemming etc.

Synonyms:

This is pretty common technique and a very obvious one. For instance, if you search for “happy” the search would also query for “joy”, “elated”, “merry” etc.

Stemming:

The use of Stemming is to get to the root form of a word. Stemming compares the root forms of the search terms to the documents in its content sources. For example, if the user enters “viewer” as the query, the search engine searches for “view” and returns all documents with view, viewer, viewing, preview, review etc

Recall and Precision Balance

Recall and Precision Balance

6.  Corpus

Corpus is Latin term for body. In the Search world, it refers to the scope of all the content sources the crawler would crawl and indexes. Following gives an example of what Corpus can include.

Corpus in Search

Corpus in Search

 

0

SharePoint Fast Search Concepts and Terminology – Part 1/4

This blog post is Part 1/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts  Part 1| Part 2 | Part 3 | Part 4

Before we deep dive into these concepts, lets try to capture the overall process flow for any search engine. Typically content for any organization is stored in databases or documents on a file system. These documents can be of any type word, excel, images, videos, pdfs power-points etc. The primary job of any search technology is to crawl, index and surface results.

  1. Crawl:

    1. Once you have identified what content you want to crawl you will make search engine aware of where these are located and will grant appropriate permissions to crawl.  A crawl is basically collecting data and primarily metadata about the content.
  2. Index:

    1. Indexing is similar to how indexes work with books, they are just pointers to the actual location of the content. Typically indexes are physical files that are added to the file system and are output of a crawl.
  3. Search:

    1. Once the crawling and Indexing are complete its time to search these indexes, since these indexes are present on the file system querying these is lot faster.

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

Fast Search Terminology and Concepts

Fast Search Terminology and Concepts

Lets get started:

1. Content Processing:

This is the nexus of any search engine, this defines what data-sources the search engine should crawl and the quality of the content itself for crawling. This is all about enriching the content even before it is being crawled.  Some of the tasks include

  1. Making sure that the search engine doesn’t crawl lot of noise consequently impacting the recall
  2. Detecting the language and applying rules
  3. Extracting meta data etc. and the list goes on.

2. Query Processing:

This kicks in after user performs the search, the search engine analyzes what the user is actually requesting and will accept additional query parameters if needed. This also matches result items in the search index  and returning search results to the user.

3. Relevancy:

This is the measure of how accurate/precise the search results are. There are various factors that determine how good the relevancy is, for instance the more the user find the intended results in the top search the better the relevancy is.  There are various techniques that can improve relevancy which will be discussed more in other part of this blog post.

4. Query Expansion: Best Bets, Synonyms, Lemmatization

Query expansion is a technique to improve Recall. The user may search for a term, search engines would not only search for specific term but also other relevant terms. This section explains on various techniques on how Query expansion can be accomplished.

4.1 Lemmatization

Lemma is a greek word which mean assumption or the canonical form of the word. For instance if the user searches for ship it would search shipped, ships etc. Not to be confused with Stemming where only the end of the word changes where it substitutes only the ending. For instance, Stemming would search for See, Seen, Seeing but no Saw. Where as Lemmatization would search for ‘Saw’ as well.

4.2 Synonyms

This is one of the most popular Recall technique, where Search engine would return results not only to the search terms but also for its synonyms. For instance if the user searches for ‘Joy’, it may include results for ‘Merry’, ‘Happy’, ‘Elated’, ‘Celebration’ etc.

4.3 Best Bets/Visual Best Bets

Best Bets are usually links displayed on the top of the search results pointing to different pages or content. These links are manually curated by administrator to display for a particular search term. Visual Best Bet similar to Best Bet except that an additional image is provided along with link and description.