AASHAN

Creating a Fulltext Search Engine in Rust - Part 1

Wed Dec 06 2023 · 7 min read
Creating a Fulltext Search Engine in Rust - Part 1

Preface

For the past few weeks, I have been working on a full-text search functionality for a few projects in my company. I have written code that process, index and then search for the data using elasticsearch. It was fascinating as well as troublesome at times because of the little knowledge I had about how search engines work internally. So, I had to do a lot of research on elasticsearch queries, filters and different ways of searching for stuff in elasticsearch. However, the queries that I learned were limited to elsticsearch. So, the inner nerd in me kicked in and I wanted to learn how a search engine works instead. This blog documents how learned the fundamentals of a search engine and tried implementing a very basic fulltext search engine from scratch.

Wtf is a full-text search engine?

According to wikipedia, In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). So, naturally, a fulltext search engine is a program that does full-text search on a list of documents (called indices) and gets the documents that match the query based on certain parameters.

Why fullltext search?

There are many advantages of full-text search over normal search. First of which is, fulltext search is blazing fast. It utilizes a data structure called inverted index to get the search results without ever having to look into all the documents that are indexed. Instead, it matches each of the tokens in your search query and aggregates the search results from the documents that are associated with the tokens. We will discuss how an inverted index map works later in the post.

How does it work?

Well, this is a question that has to be answered in parts. Before we can answer how a fulltext search engine works, there are a couple of terms we need to understand. I will briefly talk about the concepts sorrounding fulltext search engine in the points below.

Semantic Analysis

Semantic analysis can be a separate series of blogs in itself if we try to cover it in depth. But for now, we can understand it as a process of pre-processing the text that we are trying to search or index. The process may include simple steps such as removing punctuations from sentences to a bit more complex tasks like getting rid of the stop words and stemming.

Semantic analysis is a series of individual steps. In our particular implementation, we first remove the punctuation and symbols. The alphanumeric text then is fed to a tokenizer which breaks down the large text into small individual words. These words are then filtered for stop words. The filtered words are stemmed and the stem words are individually stored as indices.

Processes like stemming and stop word reduction are specially useful because they get rid of the parts in search data that are of less significance. For example, in the words smile, smiled, smiling, and smiles the root word is smile and this is what we want to store in our search index because this will match with any of the above words when searching and might provide much more insightful search results. Similarly, words like a, the etc are so repetitive that they barely hold any significance in searching for documents. So, even if we remove these words, we still have a pretty good search result based on the words that provide more significance in search while preserving a lot of memory by getting rid of insignificant words.

Tokenization

Tokenization is a process of converting a long text to smaller tokens. For the simplest example, tokens can be individual words on a sentence. We can split the sentences by white spaces and then there we go, we have our tokens ready for indexing.

Inverted Index

use std::collection::HashMap;

struct InvertedIndex {
    documents: HashMap<String, HashMap<String, String>>,
    tokens: HashMap<String, Vec<String>>
}

Alright, lets discuss the inverted index data structure. So, an index is what a table is in a relational database for a search engine. It stores the data that you want to search for. In addition to the data that you ask the index to hold, it also generates additional data that helps in searching. The image above contains the basic structure of an inverted index. It cotains two properties, ie. documents and tokens where the former is the one that is used to store the data provided by the user, whereas the later stores data generated by the search engine.

Suppose we want to index two sentences This is a random text which is being indexed. and This is another text that will be indexed.. Now, after tokenization, we are left with this, is, a, random, text .... be, indexed for the first sentence. Similarly we can generate the tokens for the second sentence like this, is, another, ... , be, indexed.

Now, when indexing, we assign an id to each of the sentences and store them whole in the documents property. For the tokens, we store each token along with their respective document ids in which the tokens can be found. So, the token indexed will have two ids stored in an array (vector in rust) and the token another which only occurs in the second sentence will only hold the id of the second document.

When searching, we tokenize the search query the same way we did when we were indexing the data. After the tokens are created, we check to see if the token is present in the tokens property of our inverted index. If the token is present, we take the document ids associated with the token and add them into our search results. This way, we don't actually go through all the documents when searching, but just go through indices in a hashmap, which is way faster in comparision with looping over thousands and thousands of documents and checking to see if text fragment is present in a long text.

Next?

Since this has already been a very long and boring post, I'll write the code in a separate follow up post to this. I hope you got to learn something today. If you are from future and I've already published the follow up post, I'll update the related posts link below so that you can find the follow up article there.

Update: I have published the second part. Please see the related articles below.

Have some questions? Let's get in touch

Related posts