Word Frequency for Wordlist

Marius · January 4, 2025, 4:17pm

One thing that I’m struggling when and if to use Flashcard. Reviwing flashcard takes time and I don’t like doing it. I like reading. I’d much rather spend my time reading. Taping on a word while reading it probably takes the same time as reviewing that word separately as part of the flashcard review process. Moreover, when I read I see the word in different contexts, so it’s actually better for retention. So if the time comitment is the same and retention with reading is better given different contexts, and on top of this I get the benefits of reading as opposed to just reviewing flash cards, then why review flashcards at all, unless one just loves reviewing flashcards much more than reading.

Even for those of us who’d rather spend our time reading, there might be a place for flashcards and that is low frequency words. There are words that only show up once in a book. And some words don’t even show up at all even once. So, how many books do I have to read to assimilate those words? 10 books? 20 books? 50 books? What if I want to read in a few languages at a time through rotation? If I see a word, how much time will pass before I see it again? It could be months, a year, a few years? The time interval is too long to actually assimilate those words. That’s the reason taking a language from a high intermediate or low advanced level to a high advanced level take so much time. It might take me 10 years to actually assimilate those words.

That’s a place where I think flashcards, reviewing the words in their context (not classic flascards) can have an important role. But how do we know which words are those? I don’t want to review the high frequency words with flash cards. I want to acquire them though reading. That means that I will look them up many times when I read, so they will be in my word list mixed up with the low frequency words. And since I can’t tell which words are the low frequency ones I end up not using flashcards at all because I keep seeing the high frequency ones which I’d rather acquire through reading.

If we could add a column that would indicate frequency and have the ability to filter by it, then we could just review those. That would trully accelerate our language acquisition and take us to a high advance level significantly faster. That’s where I see the potential for flashcards.

There are many open source solutions for word frequency. For example, Voyant Tools. I can import a text and it makes a list of all unique words and gives me a count. It can do this for a book for all the texts that I import. It also gives me all kinds of statistics that help me determine which text is easier to read.

https://voyant-tools.org/

If we could incorporate something like this in Readlang, then I would upload the next 5-10 books I’m planing to read and if I could get a count across my corpus, then I could select just the words that show up let’s say less than 5 times in the entire corpus and review those.

Alternatively, for those people who like learning through flashcards as opposed to reading, they could start reviewing the highest frequency words first. I think everyone would benefit from somethig like this.

Another benefit would be that with word frequency we could have a readability index so we could rank texts in order of difficulty and read the easier ones first.

This would be a very powerful feature. Again, there are many open source engines that could be used to implement this in Readlang.

Another alternative, less ideal but still good, would be to just count the number of times I select a word and keep track of that. So let’s say I read 5-10 books. I will look up the more frequent words more and the less frequent words less. The more frequent words I will likely just assimilate through reading. The words that I only looked up 1-2 times I will likely not assimilate. So if Readlang could keep track of how many times I looked up a word, I could just select the low frequency words I looked up and review those with flashcards. This would be a much simpler way to implement this feature and still get the benefit of it. Though implementing the actual word frequency one would be the most powerful.

With a tool like Voyant I could just import 50-100 articles, blogs, etc I want to read and read them in order of difficulty. This would provide a similar feature to graded readers since that would be a grading. Of course I can that by importing them to Voyant as well as Readlang then but that’s too much work. I am doing that with books, but doing it with short articles takes just too much time.

Anna_Vernerova · January 4, 2025, 11:13pm

Are you aware that words currently enter practice from the most frequent to the least frequent ones, bases on a predefined large corpus (an old version of the OpenSubtitles corpus, I think)? You could just go to Word list, order by frequency, select the words on the last page a d practice those. However, this would be really frustrating and what’s the point? There are likely thousands of words that only appear once in a collection of 20 books. Most of them will not appear a single time in the next 20 books you read. Unless you already know all the words that appear commonly, there’s no point studying the rare words.

The main problem is, of course, the predefined corpus. For some languages, it’s too small. It also does not reflect the needs and interests of a particular user. I’ve suggested before that each user should have their own frequency list based on either the books they’ve read already or those they have in their personal library. That would personalize the frequency and thus also difficulty ratings.

Marius · January 5, 2025, 3:28pm

I wasn’t aware of that. Thanks for pointing that out!

But to your point, if this is not based on the texts in my library and it doesn’t include a frequency count, then I don’t know how many times it will show up, so it’s hard to figure out which ones to practice.

Here’s the challenge. Not all books are available in electronic format. Many books we still have to buy in print. So how do we read those? With the dictionary. If I have a comprehension of 98%, I will have to look up 7 words per page. It’s inefficient to read a text like that and look up the same words in the dictionary over and over. Those are all reasons why Readlang is so great, since even if I have to look up 7 words per page with Readlang, I can still maintain a good speed. To be able to read a printed text at a good speed I would probably have to be able to read and only look up one word per page. That would require that I know 99.7% of the vocabulary. To build a vocabulary like that through reading alone (on Readlang) will take a very long time. But that also means that I’m restricting myself to only read what’s available in electronic format. However there are works in print that I want to read. Hence the dilema. Can I accelerate my vocabulary acquisition so I can read in print sooner efficiently?

One way would be to take other works in electronic format of the author whose work I want to read in print and use flashcard to internalize the less frequent vocabulary. Another would be to use flashcards of a set of vocabulary from a field. So if I want to read French Philosophy in print, I can read other books in electronic format, review the less frequent words, and that would put me in a better place to read in French Philosophy print sooner.

How did people learn languages before Readlang, before Anki, before flashcards? One method I’ve heard is that they would read a grammar book, then they would start reading a page from a text with a dictionary and make a list of the words they didn’t know. Then read that page over and over until they knew all the words. Then move to the next section and repeat the process. Do this until they could read without a dictionary. No need to mention the disadvantages of this method, but one advantage is that they would internalize the low frequency vocabulary much sooner, which means that they could stop relying on the dictionary much sooner. If I just read and rely on the frequency of the word to internalize it, then it would take a very long time to internalize the low frequency vocabulary, which means I would still have to rely a very long time on a dictionary.

I think looking up words in a dictionary is a very inefficient way to learn a language. At a beginner/ intermediary phase using word for word interlinear texts is much more efficient. But there aren’t enough of them out there to build a large enough vocabulary. And creating interlinear texts with actual translations, like I proposed, while more efficient than Readlang at the intermediary level, or using an actual dictionary, it still less efficient than proper word for word interlinear texts. And reading like this repeatedly the same text is still quite inefficient because you have to find spend time finding the words in the translation; they’re not always right there under or above the unknown words. And reading like this leave us with the the problem of how do we internalize low frequency words so we can also read the print-only books efficiently.

I think there’s no single answer. There’s room for all of these methods. Use manuals at the beginer level, read all the interlinear texts over and over to internalize their entire vocabulary efficienly. Then use graded readers with Readlang. Then create interlinear texts from translations for more difficult books to still keep 100% comprehension and build vocabulary. Read like this a lot, then use flashcards to internalize the low frequency vocabulary.

The challenge is to keep a balance between reading, which is fun, and using flashcards, which is not, and spend as much time as possible reading and use flashcards ony when necessary.

Does anyone have a better method?

Anna_Vernerova · January 6, 2025, 6:16am

I think you’re underestimating the power of reading with less than 100% comprehension. I commonly listen to audiobooks from the librivox collection. The first three chapters of, let’s say, Jane Austen, are painful. I listen to each chapter three times and still only have a vague idea about what’s going on. By the last chapter, I listen once and understand enough for enjoying the book. How is it possible? I’ve internalized much of Jane-Austen-and-the-topic-of-this-book-specific-vocabulary just by listening to the audiobook. I still have no idea about the words that appear only once or twice in the book, but not knowing 3 words per page does not significantly lower enjoyment.

I understand that if you want to read philosophy, you might need more precise comprehension than I need for 19th century novels. But you still could read 19th century novels as a step on your language journey and not worry about translating each word.

It’s not true that with 100% comprehension, you learn the rare words earlier. You learn the rare words at the expense of the frequent words and hence upon e.g. 2000 known words milestone, the fraction of words you know in a new text will be much lower. It’s not a good way unless you’re learning in order to read only a very limited collection of texts, like learning New Testament Greek.

Steve · January 9, 2025, 3:05pm

In general I agree with the point that reading is more enjoyable than flashcards.

I agree with Anna that learning the high frequency words first is best. The problem with learning low frequency words is that there are tons of them and each specific one is not likely to be useful again for a long time (unless it’s a low frequency word in general usage, but high frequency in a domain you are interested in, in which case it is a high frequency word for you and so is worth focussing on).

Even though flashcards are not as enjoyable as reading IMO, there are some advantages:

They are convenient to do in bite-sized sessions if you only have a spare minute or so. Reading can be frustrating to do in such short sessions.
They give you practice in active recall by getting you to translate from your first language to your target language (especially if you have typing enabled). When you want to speak or write you need to be able to think of the words to say, and reading alone doesn’t always feel like enough. I’ve found that it’s possible to know what a Spanish word means when I read it but still be stumped later when I try to remember it in order to write or speak. Flashcards provide this active recall practice.

This is a cool idea. I like it, I’m just scared about the complexity of implementing and whether it’s worth doing vs other potential improvements.

Anna_Vernerova · January 9, 2025, 10:32pm

I’m wondering how computationally expensive it would be to store each user’s frequency list and simply recalculate it every time they add/edit/delete a text by calculating all words in their corpus again. I understand that for very large user corpora, such approach would be somewhat spendthrift, but how many users have more than 10 full-lenghth books worth of text in their library, and how often do such users actually update it? As the code for calculating word forms across a collection of texts would be quite simple, and the use of such a list would be very similar to the way the OpenSubtitles counts are used now, I’m curious what I’m overlooking.

In terms of data storage, saving each user’s personal frequency counts could require at most twice the size of their library in additional data storage (in case of a small corpus, roughly each word appears only once in the corpus, so a full listing of the vocabulary takes up same space as the library itself, plus there’s some space allocated to the numerical value representing the number of appearances of each word form). Luckily, for very large corpora, the ratio is obviously much smaller; Heap’s law gives the following approximation: size-of-vocabulary(corpus with N words) = aN^b, where for Czech a = 118.0895 and b = 0.5113 (i.e., a Czech corpus with 1,000,000 words (roughly 20 books) would contain 138,000 different word forms). This is assuming that we’d only be storing word form - it's count pairs; I see how difficult it would be to construct a nice index of occurrences…

Steve · January 10, 2025, 9:38am

You’re right. It’s doable. The computational complexity or storage required isn’t necessarily that bad, especially since this would be premium only feature. It’s more the added complexity of the whole system including the codebase and the user interface and the work involved in implementing and maintaining it. There are questions to think about like:

Should it include every text in your library or let you specify the texts you want to include? Maybe there could be a special shelf containing all the texts you want to contribute.
It could incentivize people to upload a ton of texts covering a field they’re interested in just to improve their frequency list, which seems wasteful. Should this be restricted? Or should there be a way to upload texts purely for the sake of contributing to a word frequency list, even if the text itself will never be read and therefore doesn’t need to be stored?
Should the frequency list be updated every time you add or edit a text, or maybe you just trigger generation of this list manually?

Even with those questions answered, the point about the work involved to create it and maintain the more complex system stands, and I need to balance it against other potential improvements.

I still think it’s a cool idea, and not ruling it out in the long term, but no promises.

Anna_Vernerova · January 10, 2025, 1:04pm

You’re right that people would be tempted to upload texts just to skew the frequencies

You’re thinking about giving users a lot more choice than I was hoping for. The smallest feature that would be both immediately useful and easily extendable would be to allow users to upload their own frequency list in csv format. It would be less friendly for non-programmers, for sure, but there are online tools for counting frequencies in a text, so the greatest hurdle for the user would be concatenating all of their texts into one. Also, this way, people can grab existing frequency lists.

Anyway, I’m just brainstorming; it’s obvious you cannot immediately implement all suggested improvements. Luckily it’s not necessary, ReadLang is already such a great tool!

Marius · January 10, 2025, 5:00pm

I think it should include uploaded texts not just the ones read. If there is a toggle added, then don’t include by default, but if there isn’t, then do include it.

I think a toggle would be useful because it might allow me to focus on a set of texts. I might have a large library I want to read, but maybe I want to focus on an author first.

The use case is, let’s say I want to read the texts of the French philosopher Rene Girard. Some of his texts are in print only. I upload the texts I’m planning to read first with Readlang. Then I start reading. Do I review any words with flashcards? There’s no point in reviewing the high frequency words since I will assimilate those by the time I’m done reading the texts anyway. But it will help reviewing the lower frequency words since they will save me looking them up in the dictionary when I will read the books in physical format. His vocabulary isn’t very extensive since he’s writing philosophy not literature, so it is possible to master it in entirety so I can read any of his texts without a dictionary.

Again, I’m not saying at all that one should learn the lower frequency words before the high frequency words. What I’m saying is that if one wants to master the entire vocabulary (of an author in a field, which is corpus specific and more limited) then the most efficient way to do that is though a combination of reading and flashcards.

What are the alternatives? One is doing flashcards of all words, starting with high frequency words. Why do that if those words will be assimilated by reading the works themselves? Another is to not do any flashcards, but that means reading the physical books with a dictionary. Both are not ideal.

And I get that one can get the meaning without knowing all the words. That works for literature, but for other fields like philosophy you might miss important points about the argument.

Another way of framing the use case is like this: let’s say I want to master the vocabulary of Girard so I can read his print words without a dictionary. Can we make Readlang inteligent enough to help me achieve that in the most optimal way? If I need to see a word 10, 15, 20 times before I internalize it (whatever the number is) then readlang could tell me what flashcards I need to review by taking into account the texts I haven’t read yet but that I will read. This means that it will not show me the high frequency words because I will assimilate them by reading. It also means that it will take my reading into account. So, for the words that fall below the threshold (let’s say my corpus will only show me a word 8 times but I need to see it 10 times), for those words, it will show them to me fewer times than the ones that appear less frequent in the text.

The more intelligent the tool is the more useful it becomes. So it can use spaced repetition but it could also take the entire exposure into account, including exposure via reading and compensate that with flashcards.

Righ now flashcards only take spaced repetition into account, but if they ignore what I will assimilate voabulary through reading, then they act more as an alternative to reading and less so as a complementary tool. If I spend time reviewing flashcards and the algo feeds me the high frequency words first, then that’s time I don’s spend acquiring those words in various contexts via reading. And the algo still leaves me with the problem of how do I assimilate the low frequency words (assuming I want to do it), since those will be he last words it will feed me. So if I follow the traditional route, let’s say I do a lot of reading, then go to do flaschards, now I have to spend time deleting the words I’ve already internalized. More time not spent reading. So what do I do? I routinely go and delete everything on my word list. But then I also delte my low frequency words which I should review. Or I do partial deltes, which takes time. And where’s the cutoff since I don’t get any quantification.

The only alternative is the one Anna suggested, to go to the last page and manually select the less frequent words and do those. But that’s not optimal. What about the words I’ve seen 8 times and I need 2-5 more reviews to store them to the long term memory?

Bottom line is, spaced repetition works. Reviewing words in context with flash cards are a good tool. Seems like this could be done in a better way versus assuming that we don’t do any reading, which is what the flascard system does now. That’s the biggest drawback to the flashcard model right now, as I see it. It assumes that the users don’t do any reading. Or if they do, it’s only to select words to be memorized via flashcards. It doesn’t adjust the vocabulary it feeds me based on the reading I did. It gives prominence to flascard memorization vs acting as an augmentation and reinforcement to reading. And given the intelligence we have available today, maybe there could be an easier solution to this.

Anna_Vernerova · January 10, 2025, 7:08pm

Now I’ve finally understood your usecase.

What you could do:
we’re assuming you have your “corpus” available in some form. Create the frequency list and decide which words you expect to assimilate by reading. You can import them into ReadLang using the Secret .csv flashcard import feature.

You can set up the “Next Practice Date” a year from now (probably with some variation so that they do not enter practice all at once) and set the revision interval e.g. to be two years.

The advantage: ReadLang will think you have learnt and practiced these words already. You’ll be tested on each of these words after a year. If you have assimilated it by then, the revision interval will become even longer, so long that you probably do not need to worry about it too much. If you forget by then, then the revision interval is shortened - I don’t remember by how much or to what default value, maybe Steve will tell us.

Disadvantage: now your WordList view will be cluttered with words you actually do not want to practice. However, you can filter words which are ready for practice and these words will be hidden (until the date for which you’ve scheduled them).

Marius · January 10, 2025, 10:43pm

But if I import a list, that list will not be tied to the text, right? I won’t get the context. And that’s the most important thing. When I review a flashcard I not only recall the word but it’s context.

Sure I can get a list of low frequency words, but they won’t have the translation, unless I do that separately, and why do that elsewhere? The whole point of using Readlang is to look up words? So now I need to stop using Readlang and find a different solution so I can use Readlang again? And they won’t be in the context in which I read the word, which is also important, since only Readlang has that information, meaning, which word I looked up when I read the text.

This seems way too much work and it doesn’t really work in conjunction with reading. Let’s say I could find a way to do this. It would be a one time thing, and not a complement to daily reading.

I’d rather forget about flashcards and just read until there’s a viable solution to this.

Anna_Vernerova · January 11, 2025, 3:45pm

I agree with you that simply just reading might be the best use of your time.

My idea was that you create fake flashcards only for the words you expect to assimilate by reading, and give them due dates long after you’ll assimilate their meaning. That way, when you start Flashcard practice now, only the cards for medium-to-low frequency words would be scheduled for practice. (Those flashcards would be created during reading, not imported, so they would naturally contain context and relevant translation.)

You can add additional contexts to existing flashcards during reading, so over time, you could populate the fake imported cards with actual contexts. It’s just a few clicks for each context you want to add.

When you use ReadLang for reading and you click on a word for which you already have a flashcard, ReadLang shows you the translation that is on the card - if it doesn’t fit in the context, you have to open the card, use the Explain function and add additional meaning to the card. Thus, it is important that the imported cards contain translations relevant to the type of text you intend to read. However, I think you could take the list of words that you intend to import, tell ChatGPT what kind of texts you intend to read, and ask it to provide translations. I’d expect it will do a reasonably good job. It should also be able to give the answer in the format that can be imported into ReadLang.

Is it worth going through this just to force ReadLang to skip practicing cards that you expect to encounter multiple times while reading? I don’t know.

Marius · January 11, 2025, 5:18pm

Now I get it! Thanks for clarifying!

Readland does an excelent job translating words in context. I’m not sure how to replicate that elsewhere at the same standard. That would require that I translate the entire text and create effectively an interlinear translation. I can’t just take a list of high frequency words and translate each of them since that would mean that they get translated out of context. And if I’m going through the trouble of creating an interlinear text, I’ll just read that.

However, the idea of suppressing the high frequency words is appealing. Maybe one way to do it would be

read the corpus
export the entire word list and delete it from readlang
look up the words against a frequency list from voyant on the same corpus
flag and delete the high frequency words
reimport the low frequency words back to Readlang (assuming I can reimport everything back without any hickups)

Seems like that could be done in a few minutes and as a one time project might be acceptable. Let’s say I read 10 books and do it for that corpus.