Trigrams are a very straightforward solution to this but they have one major limitation: They break with misspellings / alternative spellings. For example the words "vodka" and "votka" share no trigrams yet are obviously very similar.
I recently worked on this problem for a project and came up with a more robust solution using pre-computed Levenshtein indexes. You've probably heard of Levenshtein distance as a measure of word similarity. It counts the number of single-character edits that are required to transform one string into another. For example "vodka" and "votka" have a Levenshtein distance of 1.
Using a related approach (single character edits) it's possible to build a so-called Levenshtein index.
The major issue with a vanilla Levenshtein index is that it's impractically large. There is a way to make a space / time trade-off however that makes it much smaller. The basic idea is that instead of generating all Levenshtein variations for a word you only use the set of variations created by deleting single characters for a word.
Since that doesn't match all possible variations we have to compensate somehow. For example with a full index "votka" would be in the set of variations for "vodka" which means if we had misspelled vodka when searching for it we'd still get a match. With our deletions-only index, however, we'd get "voka" as our closest match. To work around this, we generate the deletion-only variations of our search term as well. In this case our search term "vodka" would become "odka", "vdka", "voka", etc. and each of those terms would be matched against the index. That means our query is slower overall as it contains a number of sub-queries but in practice it's more than fast enough.
There are some additional ideas in there such as how to make a Levenshtein index handle prefix matches better (for e.g., incremental search) that may be helpful if you need to implement something like this.
If you do read that I'd like to point out that I no longer recommend LevelDB. That doesn't change anything of importance in the post though.
Super interesting approach, thanks for contributing it.
From what I see in your article, you used this for searching through mails, so I guess it works on a pretty decent volume of data, but just to confirm it: do you think it's viable on something like gitlab's multi-resources full text search? For reference, what is the size of this index for your mailbox? (assuming it's an usual contractor dev mailbox, with mainly lots of notification mails and discussions with daily customers)
PS: don't worry about other reactions, users have invaded our dev world, nowadays, the sad thing with having attractive salaries
I recently worked on this problem for a project and came up with a more robust solution using pre-computed Levenshtein indexes. You've probably heard of Levenshtein distance as a measure of word similarity. It counts the number of single-character edits that are required to transform one string into another. For example "vodka" and "votka" have a Levenshtein distance of 1.
Using a related approach (single character edits) it's possible to build a so-called Levenshtein index.
The major issue with a vanilla Levenshtein index is that it's impractically large. There is a way to make a space / time trade-off however that makes it much smaller. The basic idea is that instead of generating all Levenshtein variations for a word you only use the set of variations created by deleting single characters for a word.
Since that doesn't match all possible variations we have to compensate somehow. For example with a full index "votka" would be in the set of variations for "vodka" which means if we had misspelled vodka when searching for it we'd still get a match. With our deletions-only index, however, we'd get "voka" as our closest match. To work around this, we generate the deletion-only variations of our search term as well. In this case our search term "vodka" would become "odka", "vdka", "voka", etc. and each of those terms would be matched against the index. That means our query is slower overall as it contains a number of sub-queries but in practice it's more than fast enough.
I wrote a meandering blog post about it here: http://lattejed.com/writing-an-embedded-full-text-search-eng...
There are some additional ideas in there such as how to make a Levenshtein index handle prefix matches better (for e.g., incremental search) that may be helpful if you need to implement something like this.
If you do read that I'd like to point out that I no longer recommend LevelDB. That doesn't change anything of importance in the post though.