Intl.Segmenter API
Have you ever tried to split a text into parts? Sure, there is string.split(separator)
for that.
So you can just easily split text along whitespace.
"When in doubt, don't! Just do it.".split(" ");
[ "When", "in", "", "doubt,", "don't!", "Just", "do", "it." ]
Even this little example shows, that โjust easilyโ often underrates a problem.
As you can see, punctuation marks are not respected. But you could easily filter them out.
And the typo of having two spaces between words creates another array element. But you could easily replace them.
And what if people start using line breaks โ or the —
character? You could easily fix your code.
And if you want to seperate sentences, similar problems arise. They may be seperated by colons or exclamation marks. And did you consider question marks? Or sub-string dots like in Do it 10.000 times!
?
Well, enough with the constructed problems. There is a browser API for that!
The Intl.Segmenter
API supports meaningful segmentation of sentences and words.
const myString = "When in doubt, don't! Just do it. Do it 10.000 times!";
const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const segments = segmenter.segment(myString);
You construct a segmenter
by giving it the locale and a granularity. And after that, you can put strings into it to split them. The result is an iterator. Converting it into an array reveals that each part is an object, containing the input (the whole string) and the found segment.
const segmentArray = Array.from(segments)
// ๐
[
{
"segment": "When in doubt, don't! ",
"index": 0,
"input": "When in doubt, don't! Just do it. Do it 10.000 times!"
},
{
"segment": "Just do it. ",
"index": 23,
"input": "When in doubt, don't! Just do it. Do it 10.000 times!"
},
{
"segment": "Do it 10.000 times!",
"index": 35,
"input": "When in doubt, don't! Just do it. Do it 10.000 times!"
}
]
To get useful parts, we need to map that:
const myString = "When in doubt, don't! Just do it. Do it 10.000 times!";
const segmenter = new Intl.Segmenter("en", { granularity: "sentence" });
const segmentsIterator = segmenter.segment(myString);
const segments = Array.from(segmentsIterator).map((segment) => segment.segment);
// ๐
[
"When in doubt, don't! ",
"Just do it. ",
"Do it 10.000 times!"
]
As I said, the Intl.Segmenter API does not only split sentences, but words as well:
const myString = "When in doubt, don't! Just do it. Do it 10.000 times!";
const segmenter = new Intl.Segmenter("en", { granularity: "word" });
const segmentsIterator = segmenter.segment(myString);
const segments = Array.from(segmentsIterator).
map((segment) => segment.segment);
// ๐
[
"When",
" ",
"in",
" ",
"doubt",
",",
" ",
"don't",
"!",
" ",
"Just",
" ",
"do",
" ",
"it",
".",
" ",
"Do",
" ",
"it",
" ",
"10.000",
" ",
"times",
"!"
]
As you can see, whitespace is still included. You have to filter that. But the segmenter helps you, because each segment object has the property isWordLike
.
const wordSegments = Array.from(segmentsIterator).
filter((segment) => segment.isWordLike).
map((segment) => segment.segment);
// ๐
[
"When",
"in",
"doubt",
"don't",
"Just",
"do",
"it",
"Do",
"it",
"10.000",
"times"
]
The coolest part is the Intl
part. Because this works for other languages (and punctuation systems) as well.
const myString = "็งใฎ็ซใฎๅๅใฏใใใฃใงใใๅฝผๅฅณใฏใจใฆใใใใใใงใ๏ผ";
const segmenter = new Intl.Segmenter("jp", { granularity: "sentence" });
const segmentsIterator = segmenter.segment(myString);
const segments = Array.from(segmentsIterator).map((segment) => segment.segment);
// ๐
[
"็งใฎ็ซใฎๅๅใฏใใใฃใงใใ",
"ๅฝผๅฅณใฏใจใฆใใใใใใงใ๏ผ"
]
Last but not least, there is a third granularity of segmentation? The grapheme
. The only examples for this I could find are emojis. Emojis consist of unicode parts, making them hard to split.
'โโ๏ธ๐๐ต๐ซ'.split('');
Array(9) [ "โ", "โ", "๏ธ", "\ud83d", "\ude42", "\ud83d", "\ude35", "\ud83d", "\udcab" ]
But with a Segmenter, you can do it.
const myString = "โโ๏ธ๐๐ต๐ซ";
const segmenter = new Intl.Segmenter("de", { granularity: "grapheme" });
const segmentsIterator = segmenter.segment(myString);
const segments = Array.from(segmentsIterator).map((segment) => segment.segment);
// ๐
Array(5) [ "โ", "โ๏ธ", "๐", "๐ต", "๐ซ" ]