How does Readlang know the difficulty of a text?

Steve · June 5, 2024, 8:49am

The difficulty is calculated by a combination of:

Automated Readability Index: wikipedia article
The percentage of words which are in the top 2000 most frequent words in the language. The majority of the word frequency lists are based on movie subtitles and come from this site: Invoke IT Word Frequency Lists

It’s far from perfect, but gives somewhat plausible results for Spanish, English, French and German texts.

Anna_Vernerova · July 12, 2024, 12:22pm

Regarding the automated readibility index, I’ve been wondering how much the coefficients differ for various languages, i.e., whether it would improve the estimated level a lot if Readlang had language-fitted coefficients.

For Czech, I’ve found this formula:

Automated Readability Index
= 0.631 × words/sentences + 3.666 × characters/words − 19.491

(Bendová and Cinková: Adaptation of Classic Readability Metrics to Czech)

while the original formula for English is
Automated Readability Index
= 0.5 × words/sentences + 4.71 × characters/words − 21.43

This seems to reflect upon the fact that Czech words are generally longer and there are fewer of them in a sentence (there’s more inflection and fewer auxiliaries).

Maxim_Yazovskij · July 12, 2024, 2:03pm

then how it is can label ease of the text in minor languages, for which list of most used words is unknown? (Kazakh for example)

Maxim_Yazovskij · July 13, 2024, 3:15am

just checked it more precisely, it seems that yes, there is a list of words for every language, even mighty KAZAKH
Buttt, link to source of that list currently not working. And that lists for different languages based on just amount of machine word counting analysis done to big amount of text. So, for example, German list got after around thousands of text, so thats why there is, for example, XXX word 4914 mentioned, and lower
While in minor languages, they literally got the couple of texts. So, after 25 position in 5K LIST, all the words were mentioned just ONCE, that’s for example, xxx word 1 time. Thats 99% of text of words which were mentioned just once! Sometimes there is even a English words! It looks like they just run analysis of 1 wiki-page of Kazakh language, in Kazakh, and called it a day!

So in minor languages (at least for sure in Kazakh) it’s unreliable at best, and outright wrong at worst

Aaron_Smith · July 25, 2024, 12:07am

Yeah, I too would like some more insight into how reading levels are calculated. I’m studying Ukrainian and, when I add what I feel like are basic A1 level texts, the platform marks them at C1-C2. I understand the complexity behind automating this however it does make searching public texts (in my target language at least) hard to use.

I wonder if it might be more useful to allow users to override that value or maybe allow AI to rank the text as a premium feature.

Steve · September 18, 2024, 4:01pm

To clarify: Readlang doesn’t have a word frequency list for Kazakh and neither does it show the difficulty level for Kazakh texts.

It does have a word frequency list for Ukranian but it’s very likely that the CEFR levels are inaccurate as you suggest. I need to change something here. It would probably be preferable to just get the uploader of the text to set the difficulty level. I plan do improve the public libraries at some point in the next few months and I’ll tackle this along with that work.

Anna_Vernerova · September 20, 2024, 9:49pm

I’d suggest that as a first attempt, the calculation should not be based on most common 2000 words in a language, but most common n%, where n would be determined to be such that the selected part of the frequency list covers 80% of all word occurrences (i.e., 80% of the sum of all frequencies given in the frequency list).

My reasoning for that suggestion:

some languages have a much larger number of word forms derived from the same lemma than others, which is somewhat reflected in the size of the word lists (the greatest is for Hungarian, but Arabic, Finnish, Turkish, Serbian and Czech also have larger lists than English, despite the fact that we can expect them to be based upon smaller corpora).

Cassandra · April 27, 2025, 4:08pm

A reason that A1 texts are marked as C1 may be inflection. Highly inflected languages (i.e. with declensions like nominative, genetive, illative, and so forth) often only have the nominative and a couple other forms in the frequency list. Lemmitization could help here. Not sure how verb conjugation is addressed.