How to count words in Unicode string using PHP

How to count words in Unicode string using PHP?
How to count words in Unicode string using PHP? This sounds too easy. After all, PHP has so many string manipulation functions. To count words, we can simply use str_word_count and we are good to go. But there is a problem.

While this function works fine for English strings, developers find this function unpredictable as sometimes it also counts some symbols. But the real problem is that this function doesn’t work accurately for Unicode strings.

I faced this problem while working on one of my websites which is entirely in Hindi. When I searched, I was surprised to find that there is no straight forward solution or function to do this. There should be a standard function which should work for all languages, but the variation in structure of languages does not allow this.

A new function to count words in Unicode string using PHP

So, I wrote a small function which can be used anywhere to count the words in a Unicode string and works for a large number of popular languages. It first removes all the punctuation marks & digits to ensure that we do not count them as words. Then it replaces all white space blocks, including tab, new line etc., by a single space character.

Now all words of the string are separated by a single space separator. We can simply split/explode them into an array and count its elements to find the word count.

You can see the code below.

Just copy this code to your PHP project and start using this function to count words in any Unicode string.

And this is equally good for English strings as well. I found it more accurate than str_word_count.

Remember, it will work accurately for all those strings where spaces are used for word separation. But it may not work accurately for languages like Mandarin, where words are not separated by spaces.

Please do let me know how you like this article “How to count words in Unicode string using PHP” through comments section below.


Leave a Reply

Your email address will not be published. Required fields are marked *