Categories

  • EN
  • R-language
  • data analysis
  • quant

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option - stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple - you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors - the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Red color indicates the performance of the search, when all terms are in the trie, green - simple dictionary.

Now we come to the second part of the post - why to bother and plot such graphs, if we could check few entries to determine average time and the winner? The reason is simple - we trust in God, all others must bring data. To say it differently - while profiling the code, you should be interested in average time AND variation. As you can see in the graph above, variation of the blue color is very small - it takes approximately the same time to scan whole dictionary. However, red has higher variation - the result can take for while or it can finish just at the beginning, but overall it works faster. Now, imagine, that a programmer wants to define, which implementation A or B for volatile cache is much faster. Let’s assume, that big O notion is not going to help and she conducts 2 test for A and 2 for B. While running test A, cache size expands, while B - shrinks. As the result, B wins over A and she makes wrong choice. However, her colleague claims, that despite A has greater volatility, it is much faster and she tried with 500 queries! Whom should I trust?

I use this piece for code profiling:

simple=read.table('simple.txt')
node=read.table('node.txt')

simple=cbind(simple,as.character(c('simple')))
colnames(simple)=c('time','type')
node=cbind(node,c('node'))
colnames(node)=c('time','type')

rez=data.frame(rbind(simple, node))

require(ggplot2)

ggplot(rez,aes(time,fill=type))+geom_density(alpha=0.6,size=1.3)+scale_x_log10()

The data, C++ code for Levenshtein distance and trie can be find on GitHub.

I found this source very useful: http://stevehanov.ca/blog/index.php?id=114