They told me the solution is too complitated. How to process synonyms at Language Processing?

It was sunny day with opportunity to find a job in Frontend Dev and Machine Learning area.

To get (probably) a job of course – you need to solve The Test Task.

Imagine you have to write a search engine, with:

  • data are given from api, it is a separated list with tags, but tags are at low range of given information
  • search engine has to be aware of different words in comparison to API tags so the synonyms has to be recognized
  • it must be fast

I started to search how to recognize a string of words in comparison to other one.
Result has to be a number at range <0, 1> which describes similarity and so the threshold should be defined.

I have found an algorithm called: cosine similarity.

The implementation should be in JavaScript so I will show code at that language.

Basically cosine similarity in a two strings is build with prepare step and measure step.

With:

  • Given strings:
    • “Hello at banit.pl”
    • “Welcome at banit.pl”

We can observe:

  • First string is built with three words
  • Second string is also build with three words
  • In comparison of these string the is only one difference – “Hello” and “Welcome”

Prepare step should generate a pair of vectors, where in every vector every word exists and every vector separately describes if word exist in opposite one. So, the result should looks like:

termFreqMapToVector [ 1, 1, 1, 0 ] [ 0, 1, 1, 1 ]

with given dictionaries:

dict { Hello: true, at: true, 'banit.pl': true, Welcome: true }

And you’ll see that 1 != 0 and 0 != 1 at words “Hello” and “Welcome”.

A result function is defined by:

function cosineSimilarity (vecA, vecB) {
     return vecDotProduct(vecA, vecB) / (vecMagnitude(vecA) * vecMagnitude(vecB))
}

With helpers:

function vecDotProduct (vecA, vecB) {
    let product = 0
    for (let i = 0; i < vecA.length; i++) {
        product += vecA[i] * vecB[i]
    }
    return product
}

function vecMagnitude (vec) {
    let sum = 0
    for (let i = 0; i < vec.length; i++) {
        sum += vec[i] * vec[i]
    }
    return Math.sqrt(sum)
}

My idea was to get a pairs connected to words “Hello” and “Welcome” a defined weights which the result will be more closed to 1 and not to be:

Comparison result: 0.6666666666666667

So, if the synonym means exactly or more or less the given word – termFreqMapToVector should have three elements:

termFreqMapToVector [ 0.8, 1, 1 ] [ 1, 1, 1 ]

And you see that “Welcome” is similar to “Hello” at weight = 0.8.

So the result is:

Comparison result: 0.9949366763261821

Check the code.

And for sure – this is not complicated as they told. Do you extended it and patent it, morons?

Cheers!

Leave a Reply