EN
Java - i18n equivalent for \w in regular expression (i18l word characters matching)
5 points
In this short article, we would like to show how to improve \w
rule to match i18n word characters in Java.
\w
is equals to[a-zA-Z_0-9]
To match i18n word characters we should use:
xxxxxxxxxx
1
[\p{L}_\p{N}]
In this section, the below program iterates through text finding matched i18n characters organized in words.
On line printed in output represents a single matched word.
xxxxxxxxxx
1
package com.example;
2
3
import java.util.regex.Matcher;
4
import java.util.regex.Pattern;
5
6
public class Program {
7
8
public static void main(String[] args) {
9
10
Pattern pattern = Pattern.compile("[\\p{L}_\\p{N}]+"); // i18n equivalent for \w
11
12
String text = "日本 żółty Россия red";
13
Matcher matcher = pattern.matcher(text);
14
15
while (matcher.find()) {
16
System.out.println(matcher.group());
17
}
18
}
19
}
Output:
xxxxxxxxxx
1
日本
2
żółty
3
Россия
4
red
Note: above rule can have problems with some scripts/alphabets, e.g. Hebrew.