This blog post is Part 1/4 of blog post series that will help you to get familiar with few concepts and terminologies referred in any search technology. As this blog series is more focused towards ‘Fast Search for SharePoint’ you may see jargon relevant to this.
Please check other related posts Part 1| Part 2 | Part 3 | Part 4
Before we deep dive into these concepts, lets try to capture the overall process flow for any search engine. Typically content for any organization is stored in databases or documents on a file system. These documents can be of any type word, excel, images, videos, pdfs power-points etc. The primary job of any search technology is to crawl, index and surface results.
-
Crawl:
- Once you have identified what content you want to crawl you will make search engine aware of where these are located and will grant appropriate permissions to crawl. A crawl is basically collecting data and primarily metadata about the content.
-
Index:
- Indexing is similar to how indexes work with books, they are just pointers to the actual location of the content. Typically indexes are physical files that are added to the file system and are output of a crawl.
-
Search:
- Once the crawling and Indexing are complete its time to search these indexes, since these indexes are present on the file system querying these is lot faster.
The following graphic gives ten thousand foot view of what I am trying to capture and explain in this blog post series.
Fast Search Terminology and Concepts
Lets get started:
1. Content Processing:
This is the nexus of any search engine, this defines what data-sources the search engine should crawl and the quality of the content itself for crawling. This is all about enriching the content even before it is being crawled. Some of the tasks include
- Making sure that the search engine doesn’t crawl lot of noise consequently impacting the recall
- Detecting the language and applying rules
- Extracting meta data etc. and the list goes on.
2. Query Processing:
This kicks in after user performs the search, the search engine analyzes what the user is actually requesting and will accept additional query parameters if needed. This also matches result items in the search index and returning search results to the user.
3. Relevancy:
This is the measure of how accurate/precise the search results are. There are various factors that determine how good the relevancy is, for instance the more the user find the intended results in the top search the better the relevancy is. There are various techniques that can improve relevancy which will be discussed more in other part of this blog post.
4. Query Expansion: Best Bets, Synonyms, Lemmatization
Query expansion is a technique to improve Recall. The user may search for a term, search engines would not only search for specific term but also other relevant terms. This section explains on various techniques on how Query expansion can be accomplished.
4.1 Lemmatization
Lemma is a greek word which mean assumption or the canonical form of the word. For instance if the user searches for ship it would search shipped, ships etc. Not to be confused with Stemming where only the end of the word changes where it substitutes only the ending. For instance, Stemming would search for See, Seen, Seeing but no Saw. Where as Lemmatization would search for ‘Saw’ as well.
4.2 Synonyms
This is one of the most popular Recall technique, where Search engine would return results not only to the search terms but also for its synonyms. For instance if the user searches for ‘Joy’, it may include results for ‘Merry’, ‘Happy’, ‘Elated’, ‘Celebration’ etc.
4.3 Best Bets/Visual Best Bets
Best Bets are usually links displayed on the top of the search results pointing to different pages or content. These links are manually curated by administrator to display for a particular search term. Visual Best Bet similar to Best Bet except that an additional image is provided along with link and description.