As the title implies, word counts aren’t reasonable for Chinese languages (I’m using Cantonese, but problem applies to Mandarin as well). There are often texts which have 100s or 1000s of characters/words, but only show as 10s of words. The tokenizer shoudl switch to character count for these languages.
I think Readlang may need better logic for counting words, but it should not use character counts for word counts. Lots of Chinese words require more than one character. As an example, counting a word like 自行车 (bicycle) as three words would not work as you will want to learn the word as a concept, with all three characters together.
Well the issue for the user experience is just that Readlang doesn’t seem to give a meaningful estimate of how long the text is. Character count would at least be meaningful, or you could divide that number by two or three. I assume Thai has the same problem and probably would want a higher divisor, though.
@Michael_Yurko You said the text may have “100s or 1000s or characters/words”. Which do you mean? They aren’t the same thing.
Either way it’s quite far from the mark currently. Instead of 1 or 2 or 3 characters per 1 word count I’m seeing it’s like 20 or 30 characters. (Seems to depend on the amount of punctuation and line breaks.)
I’ve just changed it to show character counts instead of word counts for Chinese, Cantonese, Japanese, and Thai.