Hacker News new | past | comments | ask | show | jobs | submit login

Trigrams are a very straightforward solution to this but they have one major limitation: They break with misspellings / alternative spellings. For example the words "vodka" and "votka" share no trigrams yet are obviously very similar.

I recently worked on this problem for a project and came up with a more robust solution using pre-computed Levenshtein indexes. You've probably heard of Levenshtein distance as a measure of word similarity. It counts the number of single-character edits that are required to transform one string into another. For example "vodka" and "votka" have a Levenshtein distance of 1.

Using a related approach (single character edits) it's possible to build a so-called Levenshtein index.

The major issue with a vanilla Levenshtein index is that it's impractically large. There is a way to make a space / time trade-off however that makes it much smaller. The basic idea is that instead of generating all Levenshtein variations for a word you only use the set of variations created by deleting single characters for a word.

Since that doesn't match all possible variations we have to compensate somehow. For example with a full index "votka" would be in the set of variations for "vodka" which means if we had misspelled vodka when searching for it we'd still get a match. With our deletions-only index, however, we'd get "voka" as our closest match. To work around this, we generate the deletion-only variations of our search term as well. In this case our search term "vodka" would become "odka", "vdka", "voka", etc. and each of those terms would be matched against the index. That means our query is slower overall as it contains a number of sub-queries but in practice it's more than fast enough.

I wrote a meandering blog post about it here: http://lattejed.com/writing-an-embedded-full-text-search-eng...

There are some additional ideas in there such as how to make a Levenshtein index handle prefix matches better (for e.g., incremental search) that may be helpful if you need to implement something like this.

If you do read that I'd like to point out that I no longer recommend LevelDB. That doesn't change anything of importance in the post though.




vodka and votka do share trigrams: __v, _vo, and ka_, as the PG rules for generating them add two spaces at the beginning and one at the end.


Super interesting approach, thanks for contributing it.

From what I see in your article, you used this for searching through mails, so I guess it works on a pretty decent volume of data, but just to confirm it: do you think it's viable on something like gitlab's multi-resources full text search? For reference, what is the size of this index for your mailbox? (assuming it's an usual contractor dev mailbox, with mainly lots of notification mails and discussions with daily customers)

PS: don't worry about other reactions, users have invaded our dev world, nowadays, the sad thing with having attractive salaries


how about some in-database spelling auto-correction:

http://blog.databasepatterns.com/2014/08/postgresql-spelling...


What do you recommend instead of LevelDB?


Or, you know, use a search engine. There are countless little things like that and out-of-the-box search in DBs will never catch up with.


The resulting product is fast, easy to tune (for my particular application) and handles every edge case and language I've thrown at it.

Do you have any specific concern you'd like to share or are you just generally dismissive of other people's hard work?


I'm dismissive of pointless hard work.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: