Previously, we went through the theoritical concepts behind a fulltext search engine. Today, I want to go through the code on how we can implement a Inverted Index, which is the foundation for creating a search engine. So, let's get started.
Note that this post assumes that you have a basic understanding of the rust language. Even if you don't i'll try my best to explain what is going on in each section.
Semantic Analysis
Before we even begin with indexing/searching. We need to prepare some methods that can help us with cleaning up text, tokenization, stop words removal and stemming. So, let's create a function each for this task.
-
Cleaning Up the input text In this part, we will create a function that takes string and removes all punctuations and symbols and just returns alphanumeric values. This is important because if we have a document
Is anyone here?
then we would like the document to be listed in results when we search forhere
,here?
,here.
and so on.pub fn cleanup(input: String) -> String { input .chars() .filter(|&c| c.is_alphanumeric() || c.is_whitespace()) .collect() }
Here, we are preserving alphanumeric and whitespace characters as we need alphanumeric characters for indexing and we will use whitespaces for tokenization later on.
-
Tokenization For tokenization, we will first try to decode unicode strings. This is important because we want to show results for
Mesut Özil
when someone searches forozil
. For this, we are usingunidecode
crate.pub fn tokenize(input: String) -> Vec<String> { unidecode(&input) .split_whitespace() .map(|s| s.to_string()) .collect() }
Here, we are converting unicode to ascii and splitting the text on whitespaces. We collect all the parts generated that way and ship it out as a vector of strings.
-
Stop Words Removal As we already discussed in the first part, stop words are of very low significance because of their repetitive nature. However, stop words depend on language of the text that we are trying to index so it is very complex to exactly pinpoint. For the sake of simplicity we can try to remove a few of the most repetitive words from english language.
pub fn remove_stop_words(words: &Vec<String>) -> Vec<String> { let stop_words: Vec<String> = Vec::from(["a".to_string(), "of".to_string()]); let result: Vec<String> = words .iter() .filter(|word| { let lowercase = word.to_lowercase(); !stop_words.contains(&lowercase) }) .cloned() .collect(); result }
Here, we have an array of stop words in which we are looping over to find if any of the input strings are stop words. If now, we add them to a new array and return the new array.
-
Stemming Stemming is a very complex topic on its own and it also depends on language of the string. In our example, we will be using a very basic stemming algorithm that removes certain part of the word.
pub fn stem(words: &Vec<String>) -> Vec<String> { words.iter().map(|word| stem_word(word)).collect() } fn stem_word(word: &str) -> String { let mut result = String::from(word); // Remove plural forms if result.ends_with("s") { result.pop(); } // Remove common verb suffixes if result.ends_with("ed") { result.pop(); result.pop(); } if result.ends_with("ing") { result.pop(); result.pop(); result.pop(); } // Add more rules as needed result }
Finally, we can now create a function analyse, which combines all of these steps to generate vector of strings like follows:
pub fn analyse(input: String) -> Vec<String> {
let cleaned_sentence: String = cleanup(input);
let words: Vec<String> = tokenize(cleaned_sentence);
let without_stop_words: Vec<String> = remove_stop_words(&words);
let root_words = stem(&without_stop_words);
root_words
.into_iter()
.unique()
.map(|s| s.to_lowercase())
.collect()
}
It's a good idea to bundle these functions to a separate module. So, we create a new file called semantic_analysis.rs
and declare it as a module in our main.rs
as mod semantic_analysis;
Now that we have tokens, we are ready to proceed.
Inverted Index
As we discussed on the previous post, we want to store documents as well as tokens in the inverted index. So, the structure of the inverted index depends on what type of data we want to store. But, the base remains constant. We will have at least two properties, one for document and the other for tokens.
struct InvertedIndex {
documents: HashMap<String, String>,
tokens: HashMap<String, Vec<String>>
}
The equivalent typescript code would be something like this
type InvertedIndex {
documents: Record<string, Record<string, string>>,
tokens: Record<string, Array<string>>
}
In the code above, the document key can only hold string data. For example, it can store a string as a document. If we need to add a bit complex data structure, like for example a key value pair of strings, we will need to modify the type of documents: HashMap<String, String>
to be documents: HashMap<String, HashMap<String, String>>
.
Of course, we can achieve storage multiple data structure by using generics, but for the sake of simplicity (but not just strings), we will use the one with HashMap<String, String>
.
If you don't know what a HashMap<String, String>
is, it's basically a key-value pair in which the key is of type String
and the value is also of type String
. Basically, the first type is the type of key and the second one is the type of value.
So, our final struct looks something like the following
use serde::{Deserialize, Serialize};
use std::collections::{HashMap};
#[derive(Debug, Serialize, Deserialize)]
pub struct InvertedIndex {
pub(crate) documents: HashMap<String, HashMap<String, String>>,
pub(crate) tokens: HashMap<String, Vec<String>>
}
Here, I have used serde
and serde_json
crates so that I can serialize and deserialize our indices to json and store for persistance. But, right now, we won't be using these features. Maybe in the future, we can build a search engine that can actually persist data.
Methods
For our methods, we want two public methods that can be used to store something to our index and search for stuff in our index.
impl InvertedIndex {
pub fn search(&self, query: String) -> HashMap<String, &HashMap<String, String>> // return something here
{
// body goes here
}
pub fn index(&mut self, record: &HashMap<String, String>) -> &InvertedIndex {
// index the record
return self;
}
}
As the method names suggest, the search
method will be responsible for searching through our indices and generating the results whereas the index
method will feed a record to our index. Since indexing is required first, let's look at the index method first.
pub fn index(&mut self, record: &HashMap<String, String>) -> &InvertedIndex {
return self;
}
Here, the index method takes a mutable reference of the inverted index along with reference to a hashmap which is our key value pair to be indexed. In javascript terms, this basically is a regular method of an object which can modify the contents of the object and it takes a parameter of a key-value pair that we need to index.
It's pretty easy to add the document. You just generate a unique id for each document and add insert the document into the documents
hashmap.
pub fn index(&mut self, record: &HashMap<String, String>) -> &InvertedIndex {
let document_id: String = Uuid::new_v4().to_string();
let mut cloned_record = record.clone();
cloned_record.insert("_id".to_string(), document_id.clone());
self.documents.insert(document_id.clone(), cloned_record);
return self;
}
We want to make sure that each document has a unique id. So the id a UUID v4 string generated using uuid
crate.
What about the tokens?
The most important part of indexing is of course storing the tokens and the references to the documents. So, following up on the storage of documents, we generate tokens for each key of the document we are trying to index. Then we loop through the tokens and for each token generated for , we check if it previously exists in the tokens
hashmap.
If it exists, we update the value of the token hashmap to include the new document id. So, our final index function looks something like this.
pub fn index(&mut self, record: &HashMap<String, String>) -> &InvertedIndex {
let document_id: String = Uuid::new_v4().to_string();
let mut cloned_record = record.clone();
cloned_record.insert("_id".to_string(), document_id.clone());
self.documents.insert(document_id.clone(), cloned_record);
let mut tokens: Vec<String> = Vec::new();
for (_key, value) in record.iter() {
let words = semantic_analysis::analyse(value);
for word in words {
tokens.push(word);
}
}
for token in &tokens {
if self.tokens.contains_key(token.as_str()) {
let mut documents = self.tokens.get(token.as_str()).unwrap().clone();
documents.push(document_id.to_string());
self.tokens
.insert(token.to_string(), documents.into_iter().unique().collect());
} else {
self.tokens
.insert(token.to_string(), Vec::from([document_id.to_string()]));
}
}
return self;
}
And just like that, our index method is done.
Searching in an Inverted Index
Searching in an inverted index is pretty straightforward. You already have all the data you need in your indices and you are just checking to see if something exists in the HashMap. Let's implement the search method now.
pub fn search(&self, query: String) -> HashMap<String, &HashMap<String, String>> {
// TODO: Implement search
}
From the signature of the search method, we can see that we are getting a string query
and are trying to return references of the hashmap (ie the documents that are indexed in the inverted index). So, to start, we create a mutable reference to a new HashMap that will hold our search results while we search for stuff in our indices. The next thing is to tokenize our search query just the way we did when indexing. So, we remove punctuations and special characters, decode unicode to ascii, tokenize, remove stop words and stem the tokens to their root words.
pub fn search(&self, query: String) -> HashMap<String, &HashMap<String, String>> {
let mut results = HashMap::new();
let parts = semantic_analysis::analyse(&query);
for part in parts {
let tokens = self.tokens.get(&part);
match tokens {
Some(tokens) => {
for token in tokens {
let document = self.documents.get(token);
match document {
Some(document) => {
let id = document.get("_id").unwrap();
if results.contains_key(id) {
continue;
}
results.insert(id.clone(), document);
}
None => (),
};
}
}
None => (),
};
}
return results;
}
And there we go. We have a full text search engine at our disposal now.
Improvements
This is the bare minimum when we are talking about fulltext search. We can only improve from here. Some of the things that can be improved here are:
- Support for multiple indices
- Support for complex data structure (not just objects with string key and string values)
- CRUD apis for indices and documents
- Mechanism for connecting with external applications (JSON API or something similar)
However, these things are out of scope for this implement. If I decide to develop further on this application, I'll continue to post the updates through these blogs.
I have created a github repository that consists of the works that I have done so far. You can find it here.
If you have any questions or suggestions on anything or if you just want to reach out and say hi, I have my contact links on the website. Please feel free to do so.