EN
Java - check words similarity (fuzzy compare with bigrams)
8 points
In this article, we would like to show how to check words similarity in Java.
Below logic:
- calculates words bigrams,
- counts bigram hits to find similarity,
- divides hits by bigrams to calculate final words similarity.
Below checkSimilarity()
function result indicates how two words are similar.
Similarity measured is from 0
, where:
0
- means: the worlds are totally different,>=1
- means: the words are the same or contain similar part.
That kind of approach is may be applied in fuzzy search.
Program.java
file:
xxxxxxxxxx
1
package com.example;
2
3
public class Program {
4
5
public static void main(String[] args) {
6
7
System.out.println(FuzzyUtils.checkSimilarity("Chris", "Chris")); // 1
8
System.out.println(FuzzyUtils.checkSimilarity("John1", "John2")); // 0.6
9
System.out.println(FuzzyUtils.checkSimilarity("Google", "Gogle")); // 0.9090909090909091
10
System.out.println(FuzzyUtils.checkSimilarity("Ann", "Matt" )); // 0
11
}
12
}
Output:
xxxxxxxxxx
1
1.0
2
0.6
3
0.9090909090909091
4
0.0
FuzzyUtils.java
file:
xxxxxxxxxx
1
package com.example;
2
3
import java.util.Objects;
4
5
public class FuzzyUtils {
6
7
public static String[] createBigram(String word) {
8
int length = word.length();
9
if (length == 0) {
10
return new String[0];
11
}
12
String code = word.toLowerCase();
13
String[] vector = new String[length];
14
int limit = length - 1;
15
for (int i = 0; i < limit; ++i) {
16
vector[i] = code.substring(i, i + 2);
17
}
18
vector[limit] = code.substring(limit, length);
19
return vector;
20
}
21
22
public static double checkSimilarity(String a, String b) {
23
if (a.isEmpty() || b.isEmpty()) {
24
return 0.0;
25
}
26
String[] aBigram = createBigram(a);
27
String[] bBigram = createBigram(b);
28
int hits = 0;
29
for (int x = 0; x < aBigram.length; ++x) {
30
for (int y = 0; y < bBigram.length; ++y) {
31
if (Objects.equals(aBigram[x], bBigram[y])) {
32
hits += 1;
33
}
34
}
35
}
36
if (hits > 0) {
37
int union = aBigram.length + bBigram.length;
38
return (2.0 * hits) / (double)union;
39
}
40
return 0.0;
41
}
42
}
Note: do not compare sentences or whole texts using the above function - it may lead to comparison mistakes.