Fast Search

This blog post is Part 2/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts Part 1| Part 2 | Part 3 | Part 4

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

5. Recall and Precision

The total number of results in the result set for a query. You have to find a fair balance between Recall and Precision. The results set should not be too large and should avoid noise as much as possible. If the Recall is too large or too small it will hamper Precision.

There are various techniques to improve Recall such as Synonyms, Stemming etc.

Synonyms:

This is pretty common technique and a very obvious one. For instance, if you search for “happy” the search would also query for “joy”, “elated”, “merry” etc.

Stemming:

The use of Stemming is to get to the root form of a word. Stemming compares the root forms of the search terms to the documents in its content sources. For example, if the user enters “viewer” as the query, the search engine searches for “view” and returns all documents with view, viewer, viewing, preview, review etc

Recall and Precision Balance

6. Corpus

Corpus is Latin term for body. In the Search world, it refers to the scope of all the content sources the crawler would crawl and indexes. Following gives an example of what Corpus can include.

Corpus in Search

This blog post is Part 1/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.

Please check other related posts Part 1| Part 2 | Part 3 | Part 4

Before we deep dive into these concepts, lets try to capture the overall process flow for any search engine. Typically content for any organization is stored in databases or documents on a file system. These documents can be of any type word, excel, images, videos, pdfs power-points etc. The primary job of any search technology is to crawl, index and surface results.

Crawl:
1. Once you have identified what content you want to crawl you will make search engine aware of where these are located and will grant appropriate permissions to crawl. A crawl is basically collecting data and primarily metadata about the content.
Index:
1. Indexing is similar to how indexes work with books, they are just pointers to the actual location of the content. Typically indexes are physical files that are added to the file system and are output of a crawl.
Search:
1. Once the crawling and Indexing are complete its time to search these indexes, since these indexes are present on the file system querying these is lot faster.

The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.

Fast Search Terminology and Concepts

Lets get started:

1. Content Processing:

This is the nexus of any search engine, this defines what data-sources the search engine should crawl and the quality of the content itself for crawling. This is all about enriching the content even before it is being crawled. Some of the tasks include

Making sure that the search engine doesn’t crawl lot of noise consequently impacting the recall
Detecting the language and applying rules
Extracting meta data etc. and the list goes on.

2. Query Processing:

This kicks in after user performs the search, the search engine analyzes what the user is actually requesting and will accept additional query parameters if needed. This also matches result items in the search index and returning search results to the user.

3. Relevancy:

This is the measure of how accurate/precise the search results are. There are various factors that determine how good the relevancy is, for instance the more the user find the intended results in the top search the better the relevancy is. There are various techniques that can improve relevancy which will be discussed more in other part of this blog post.

4. Query Expansion: Best Bets, Synonyms, Lemmatization

Query expansion is a technique to improve Recall. The user may search for a term, search engines would not only search for specific term but also other relevant terms. This section explains on various techniques on how Query expansion can be accomplished.

4.1 Lemmatization

Lemma is a greek word which mean assumption or the canonical form of the word. For instance if the user searches for ship it would search shipped, ships etc. Not to be confused with Stemming where only the end of the word changes where it substitutes only the ending. For instance, Stemming would search for See, Seen, Seeing but no Saw. Where as Lemmatization would search for ‘Saw’ as well.

4.2 Synonyms

This is one of the most popular Recall technique, where Search engine would return results not only to the search terms but also for its synonyms. For instance if the user searches for ‘Joy’, it may include results for ‘Merry’, ‘Happy’, ‘Elated’, ‘Celebration’ etc.

4.3 Best Bets/Visual Best Bets

Best Bets are usually links displayed on the top of the search results pointing to different pages or content. These links are manually curated by administrator to display for a particular search term. Visual Best Bet similar to Best Bet except that an additional image is provided along with link and description.

SharePoint Fast Search Concepts and Terminology – Part 2/4

5. Recall and Precision

Synonyms:

6. Corpus

SharePoint Fast Search Concepts and Terminology – Part 1/4

Crawl:

Index:

Search: