Languages
[Edit]
EN

Java - check words similarity (fuzzy compare with bigrams)

8 points
Created by:
Root-ssh
115270

In this article, we would like to show how to check words similarity in Java.

Below logic:

  1. calculates words bigrams,
  2. counts bigram hits to find similarity,
  3. divides hits by bigrams to calculate final words similarity.

Below checkSimilarity() function result indicates how two words are similar.

Similarity measured is from 0, where:

  • 0 - means: the worlds are totally different,
  • >=1 - means: the words are the same or contain similar part.

That kind of approach is may be applied in fuzzy search.

Practical example

Program.java file:

package com.example;

import machine_learning.modeling.FuzzyUtils;

public class Program {

    public static void main(String[] args) {

        System.out.println(FuzzyUtils.checkSimilarity("Chris",  "Chris"));  // 1
        System.out.println(FuzzyUtils.checkSimilarity("John1",  "John2"));  // 0.6
        System.out.println(FuzzyUtils.checkSimilarity("Google", "Gogle"));  // 0.9090909090909091
        System.out.println(FuzzyUtils.checkSimilarity("Ann",    "Matt" ));  // 0
    }
}

Output:

1.0
0.6
0.9090909090909091
0.0

 

FuzzyUtils.java file:

package com.example;

import java.util.Objects;

public class FuzzyUtils {

    public static String[] createBigram(String word) {
        int length = word.length();
        if (length == 0) {
            return new String[0];
        }
        String code = word.toLowerCase();
        String[] vector = new String[length];
        int limit = length - 1;
        for (int i = 0; i < limit; ++i) {
            vector[i] = code.substring(i, i + 2);
        }
        vector[limit] = code.substring(limit, length);
        return vector;
    }

    public static double checkSimilarity(String a, String b) {
        if (a.isEmpty() || b.isEmpty()) {
            return 0.0;
        }
        String[] aBigram = createBigram(a);
        String[] bBigram = createBigram(b);
        int hits = 0;
        for (int x = 0; x < aBigram.length; ++x) {
            for (int y = 0; y < bBigram.length; ++y) {
                if (Objects.equals(aBigram[x], bBigram[y])) {
                    hits += 1;
                }
            }
        }
        if (hits > 0) {
          	int union = aBigram.length + bBigram.length;
          	return (2.0 * hits) / (double)union;
        }
        return 0.0;
    }
}

Note: do not compare sentences or whole texts using the above function - it may lead to comparison mistakes.

References

  1. Bigram - Wikipedia
  2. Approximate string matching - Wikipedia

Native Advertising
🚀
Get your tech brand or product in front of software developers.
For more information Contact us
Dirask - we help you to
solve coding problems.
Ask question.

â€ïžđŸ’» 🙂

Join