EN
Java - check words similarity (fuzzy compare with bigrams)
8
points
In this article, we would like to show how to check words similarity in Java.
Below logic:
- calculates words bigrams,
- counts bigram hits to find similarity,
- divides hits by bigrams to calculate final words similarity.
Below checkSimilarity()
function result indicates how two words are similar.
Similarity measured is from 0
, where:
0
- means: the worlds are totally different,>=1
- means: the words are the same or contain similar part.
That kind of approach is may be applied in fuzzy search.
Practical example
Program.java
file:
package com.example;
public class Program {
public static void main(String[] args) {
System.out.println(FuzzyUtils.checkSimilarity("Chris", "Chris")); // 1
System.out.println(FuzzyUtils.checkSimilarity("John1", "John2")); // 0.6
System.out.println(FuzzyUtils.checkSimilarity("Google", "Gogle")); // 0.9090909090909091
System.out.println(FuzzyUtils.checkSimilarity("Ann", "Matt" )); // 0
}
}
Output:
1.0
0.6
0.9090909090909091
0.0
FuzzyUtils.java
file:
package com.example;
import java.util.Objects;
public class FuzzyUtils {
public static String[] createBigram(String word) {
int length = word.length();
if (length == 0) {
return new String[0];
}
String code = word.toLowerCase();
String[] vector = new String[length];
int limit = length - 1;
for (int i = 0; i < limit; ++i) {
vector[i] = code.substring(i, i + 2);
}
vector[limit] = code.substring(limit, length);
return vector;
}
public static double checkSimilarity(String a, String b) {
if (a.isEmpty() || b.isEmpty()) {
return 0.0;
}
String[] aBigram = createBigram(a);
String[] bBigram = createBigram(b);
int hits = 0;
for (int x = 0; x < aBigram.length; ++x) {
for (int y = 0; y < bBigram.length; ++y) {
if (Objects.equals(aBigram[x], bBigram[y])) {
hits += 1;
}
}
}
if (hits > 0) {
int union = aBigram.length + bBigram.length;
return (2.0 * hits) / (double)union;
}
return 0.0;
}
}
Note: do not compare sentences or whole texts using the above function - it may lead to comparison mistakes.