We Are Entering the Golden Age of Language Acquisition!

Steve, please let me know if this is something you would be interested in exploring. If not let me know and I will look elsewhere.

I think that Readlang is the best tool out there for reading at an advanced C level. If I know 98% of the words, that means I only have to look up 6-8 words per page and Readlang is the most efficient way to do it.

However, at an intermediate level, where I know much less than 98%, looking up every word is much less efficient. At a B level, I think reading and building up vocabulary with Interlinear texts is the best way.

I believe Readlang has the potential to become the best tool out there for the Intermediate level is well, if we use LLMs to create interlinear texts.

Moreover, I think that Readlang could totally change the language acquisition process.

I’m proposing two ways to do that. I’ve tested both of them with ChatGPT and they work. It’s just a matter of implementing them.

The first is to take a text (let’s say Spanish) in the target language and generate and thought for though translation, the best way to understand the text in the language one learns from (let’s say English). Then also generate a word for word translation keeping the same word order as the Spanish. At an beginner or intermediate level I don’t have enough vocabulary to figure out what’s going on my own. So, reading for comprehension is important, which is why I have the thought for though translation in the first row. So, I read a sentence or part of sentence in English, then I read the Spanish. Most of the time I can figure out the meaning and figure out which word is which. But sometimes there are idioms, and I can get the wrong understanding of the words. That’s when the word for word translation comes in. The word for word translation is there just for instances where thought for thought is confusing. I generate this text in ChatGPT and import it to Readlang. I occasionally look up individual words to check the chat GPT output, but for the most part it is good enough. At a beginner or intermediate level, this improves my reading speed significantly. After I acquire enough vocabulary, I can switch the order and have Spanish first and the two English translations after.

This could be easily implemented in Readlang. It would require implementing a prompt that would generate this text. This became possible with the o1 model. Any prior model couldn’t do it. So I’m doing this with the o1 mini. The actual translation is not perfect since the model is smaller, which is why I still use the Readlang translation occasionally. I can take an article in Spanish and read it at a decent speed even if I only know 25% of the words and acquire the rest of the vocabulary in the process. I can learn a language by reading. With just Readlang alone that takes just too long since I have to looking up so many words. In my prompt I broke up the text into sentences or smaller units so that it would display the three lines of text next to each other. That makes it easy to locate words. If the sentence is too long it would display two or more words of English, then two or more of Spanish, etc, which would slow me down. But the model is pretty good at splitting them up.

The second idea I believe is a revolutionary one and has the potential to change everything. At least it did change everything for me. The idea is to take two electronic translations of a book, let’s say a book in Spanish and its English translation and use a GenAI model to generate an interlinear text. Similar to the format above, I wrote a prompt to take the first sentence of the actual English text (with no alterations) or break the sentence up if it’s too large, then find the corresponding Spanish and place it in the second row. Then generate a word for word translation.

The advantage of this second method is that I’m using a human translation. I use the word for word machine translation mainly when the human translation takes liberties or when it translates idioms. I don’t read the machine translation. I use it occasionally when I’m struggling with the human translation. I still use Readlang to look up words in the dictionary or have the more advanced LLM explains words. But I’m mainly reading the human generated translation.

Sometimes two sentences in English are translated as one in Spanish and vice versa but I asked the model to keep that into account and it did. That’s why this could not be done before LLMs unless one did it manually. The two texts would get misaligned quickly. Also, this is only possible with the O1 or O1 mini model. Any other prior models would not finish the task no matter how many times I asked it to do it.

Why is this revolutionary? Because I can start reading more advanced books much earlier in the learning process. What are the alternatives at a beginner and intermediary levels? One is to look up 50-75% of the words with Readlang. That’s just not effecient. Another is to read two books side by side, but that is impractical as I have to switch my eyes left to right multiple times to find where I am in the text every time. Anyone who tried this knows it’s not very efficient. Another option is just read half a dozen manuals and all the interlinear texts out there 5-10 times until I internalize the vocabulary. Another option is to read with Readlang a little bit and then do flashcard until I internalize the vocabulary then read more and so on. All these options don’t seem appealing to me. I have not met anyone who LOVES flashcards. People do it because they believe it’s an efficient way to acquire vocabulary but even the most committed language learners see it as a chore. Personally, I don’t want to do flashcards, I want to read. So, my version of flashcards is to read a manual or interlinear text 5-10 times until I completely internalize the vocabulary. The downside is that this ignores the spaced repetition aspect and the fact that I memorize some words much faster, but the upside is that all the words are in context. Another downside is that I get bored of the text, unless it’s poetry. But how many interlinear poetry translations have you found out there? Poems are usually not translated word for word so It’s hard to acquire vocabulary that way. The only ones I could find are the Lieds published by Leyerle Publications. They published the complete songs of Schubert, Schuman, Brahms and Strauss. They give you phonetic in the first line, followed by German, then word for word English then thought for though English. Then a commentary at the end. It’s the best way to learn German poetry. They also published opera libretti, but that’s only for German, Italian and French. They have nothing in Spanish. And they are expensive, $100 per book. Reading fairytales 5-10 times in a row gets pretty boring quickly. It becomes a chore.

That’s why I believe that what we are going through right now is so revolutionary. I’m day 50 in my first Assimil manual in Spanish (still a beginner) and have no prior experience with the language. And I started reading the book Papyrus: The Invention of Books in the Ancient World by Irene Vallejo. This is text that has 20,000 unique words, so it’s a C2 text. I took the original Spanish and the English translation and generated the interlinear format as described. By reading the human translation first, I comprehend 100% of the content. Then I read the Spanish and I can mostly understand the Spanish, meaning I can make up which word is which. I occasionally rely on the word for word translation and every now and then use Readlang. I wasn’t planning to start reading in Spanish in my first year because it’s just too inefficient to look up most of the vocabulary. But I wanted to read this book in English so I thought, why not read it in Spanish as well while I’m reading in English. On the downside, it will take me 3 times longer to read, on the upside, I will acquire vocabulary in the process. And it works. I’m starting to internalize the high frequency words without even trying. I’m just reading.

The reason I believe that this is revolutionary is that for me, this is not studying. This is reading! It’s slow reading for sure, but reading, nonetheless. And significantly faster than looking up every word. It’s a book I want to read badly, I’m excited to read it. I love it. I can’t wait to get back to it. And that draws me to it. And I just read it in Spanish as well and acquire vocabulary in the process. This is an entirely different experience than doing flashcards or reading the same texts 5-10 times.

Another reason this is revolutionary is that you can basically take any two texts, in any two languages and read like this. You can take languages in which there’s not interlinear texts at all or not even good manuals. You can also get an audio book and shadow it and learn the sounds of the language like that. You can take a book that you find interesting. All interlinear texts are literary texts. But some people want to read Crime or Mystery or Fantasy novels, or History or Biography or Science or Academic books. Whatever you want, you can read it. When my Spanish is more advanced, I could take a book and generate a Portuguese, Spanish, English (all human translations) and English interlinear and learn Portuguese via Spanish. And improve my Spanish in the process as well. How many interlinear texts are there that have these three languages side by side? Zero! How about Dutch, German and English. Zero. How about French, Italian and English? How about Norwegian, Danish, German and English? The possibilities are endless.

Alexander Arguelles said many times that people want to read the great classics too soon. But you have to work your way up to it. Reading a literary masterpiece should be your goal but it will take time to get there. The reason he said that is that people lack the vocabulary to read at that level. And looking words up in the dictionary and through Readlang or reading with two books side by side is just not an efficient way to read. He said that because these texts are not available in interlinear formats. But now they can be.

But the most important reason this is revolutionary is that it collapses the goal and the process. The reason I (and many other people) want to learn languages is to read in them. But why wait such a long time, why “study” so long and so hard, why struggle with flashcards? Why not just read?!? We can both read and acquire the language at the same time. If the motivation for learning languages is to read, why not use that motivation to drive the entire process, to drive the language acquisition?

This concept is not new. This is the Hamiltonian method. What’s new is technology. Someone had to create this by hand, which is why we have so few interlinear. Now you can pick any texts, the text you want to read right now, and just read it. Whatever grabs you. And because you are reading and not studying flashcards, you will do more of it, and because you will do more of it, you will acquire vocabulary faster. The reward for doing it this way is to get faster at it by internalizing more words, which would allow you to read more.

The technical challenge is that I still have to do it manually in ChatGPT and I can only do it 7000 words at a time via the chat because it limits the text input. The o1 mini model could process about 15,000 words at a time like this through the API, but I don’t have the skill to do it on the back end.

The ideal solution would be to upload the epub files for the two books then have a script in the back end that would break those texts up to a manageable size, run the prompt, then put them back together. This could be done.

A simpler way is to just build into Readlang upload a way to upload multiple texts and then display them and generate the interlinear text. Let’s say upload text 1, then text 1 (and even text 3 as optional) then display text 1 in the first row, text 2 in the second roe and the word for word LLM generated translation in the row of your choice. Maybe have a menu to select the order. And then build a prompt to run this. This would be super easy to implement. It would take a bit more time because we would have to break up the text manually, but we could have this as a temporary solution until we can figure out a solution to handle a whole book. I’m sure the context window for the models will keep expanding so maybe it will be able to handle more soon, but that would be more expensive too. So ideally, we would have a solution to break them up and run them with the cheaper models.

The compute for this can get expensive. I don’t have exact numbers but maybe using the o1 mini to handle a 450 page book like this would cost around $15 dollar. Not cheap, but totally worth it. If you want to buy a book in this format from DoppelText (without the word for word) it will cost you $10-20 dollars. So the cost is comparable. We would probably want to build an API for this or sell credits for the compute.

Most people give up a language at the beginner stage. From those who persist, most give up at an intermediary stage. Very few people get to an advanced stage. Very few people acquire a vocabulary of 20,000 words to be able to read at an advanced level. I believe this has the potential to change everything because it allows people to get to the reward faster and acquire the language through reading what grabs them, whatever that is. They can get the basics by going through a manual 10 times and just start reading like this and in time acquire the language. And since the process is enjoyable they will be much more likely to stick with it.

Here are the two prompts that I wrote to do this. I keep tweaking them so they will likely evolve but they get the job done.


Here’s the prompt to take a text in Spanish and translate it:

For this chat you are a translator and creator of interlinear texts, like those published by Leyerle Publications. The purpose is to help me learn a language by providing a thought for though and a word for word translation. I will give you a text in a language other than English.

Take the 1st sentence of the text I have you. If the sentence has more than 15 words, split it so that each row doesn’t have more than 15 words. When you split it, pay attention to the meaning and make sure that the split is in a place where there is a natural pause, like after a comma or conjunction.

In the first row I want you to create an interlinear text by translating the text though for though into English so that the text can be understood in English. Place this translated thought for thought text in the first row.

In the second row reproduce the original text exactly.

In the third row, translate the original text word for word. Each word must always be translated in context. The meaning of each translated word must always coresponed to the meaning of the word used in that context. The order of the translation must always be exactly the same as the origianl even if it doesn’t wound good in english. Do not add any dashes between words. The purpose of this translation is to easily identify what the words in the original mean.

Then add a blank row. The height of the blank row must always be double the height of text rows.

Then take the 2nd sentence from the text I gave you, or if the 1st sentence has be split, the next part of that sentence, and continue the steps presented above for the entire text.

The output must be a text split by sentence or part of sentence and each sentence or part of sentence must always have three lines: a thought translation in the first row, the exact original reproduction of the text I gave you in the second row and a word for word translation keeping the exact word order of the text I gave you in the third row. The order of the rows must always be the same.

It is absolutely critical that you follow these instructions exactly and that you do not alter the text I give you. The text I give you should be 100% exactly the same in the output text. And the order should always be the same, original English first and then your word for word translation.

Also, always pay attention to the meaning. For example, if you see a row with a special character like a dash (-) or some other special character or number, that doesn’t count as an English row and must be skipped.

Display the output text in a plaintext box in the entirety. Do not stop untill the entire text has been displayed!


Here’s the prompt to take two translation and generate an interlinear texts with them:

I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text.

Take the 1st sentence in the 1st row from the English text I gave you. If the sentence has more than 15 words, split it so that each row doesn’t have more than 15 words. When you split it, pay attention to the meaning and make sure that the split is in a place where there is a natural pause, like after a comma or conjunction. Also, when you split the English sentence you have to make sure that you do it in a place where you can also split the Spanish sentence without loosing meaning.

If two sentences in English are translated into one sentence in Spanish, split the one in Spanish as well but make sure that the parts corespond to the two English ones. If one English sentence coresponds to two sentences in Spanish, split the English sentence but make sure that the parts corespond to the two in Spanish. You need to match the meaning of the Spanish words to the meaning of the English words without altering the words or their order in any way. The only thing you do is split them.

Also, always pay attention to the meaning. For example, if you see a row with a special character like a dash (-) or some other non-word character, that doesn’t count as an English row and must be skipped. The first English row in every triplet should always be a sentence with meaningful English words.

For example, if I give you an English and a Spanish text, in the 1st row take the first English sentence exactly word for word, unless it has more than 15 words, in which case split it at an appropriate place to not break off the meaning. Also, when you split the English sentence you have to make sure that you do it in a place where you can also split the Spanish sentence without loosing meaning.

Then in the 2nd separate row take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. If the English sencence has been split, you will have to split the Spanish sentence, but the the Spanish sentence will have to contain the same meaning as the English sentence.

Then in the 3rd separate row create an interlinear translation by translating the Spanish text word for word into English. Follow exactly the same word order as the Spanish even if it is unnatural in English. For example “pared blanca” must be translated as “wall white”, “gente peligrosa” must be translated as “people dangerous” and “forasteros armados” must be translated to “strangers armed” to allign 100% with the Spanish word order.

Then add a blank row. The height of the blank row should be double the height of text rows.

in the 4th row take the next English sentence exactly word for word or the next part of the first sentence if it was broken off, unless it has more than 15 words, in which case split it at an appropriate place to not break off the meaning. Also, when you split the English sentence you have to make sure that you do it in a place where you can also split the Spanish sentence without loosing meaning.

Then in the 5th separate row take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. If the English sencence has been split, you will have to split the Spanish sentence, but the the Spanish sentence will have to contain the same meaning as the English sentence.

Then in the 6th separate row create an interlinear translation by translating the Spanish text word for word into English. Follow exactly the same word order as the Spanish even if it is unnatural in English. For example “pared blanca” should be translated as “wall white” not as “white wall” to preserve the Spanish word order.

Do this for the entire text. Do not stop untill the entire text has been parsed like this.

The output should have the exact English text that I gave you in the first line, the exact Spanish text I gave you in the second line and the word for word translation that you generated in the third line. There should be three separate output rows for each sentence.

It is absolutely critical that you follow these instructions exactly and that you do not alter the two texts I give you. The texts I give you should be 100% exactly the same in the output text. And the order of the lines should always be the same, original English first, original Spanish second and then your word for word English translation of the Spanish text third.

Display the output text in a plaintext box in the entirety. Do not stop untill the entire text has been displayed!

5 Likes

I like the idea that ReadLang could provide the textual and word-by-word translations without me having to click individual words. I agree that at my current level, I’d find it motivating.

Alignment of two translations has been possible long before GPT, as it’s been necessary for creating machine translation systems ever since the 90s. There might even be some free tools available, at least for non-commercial use (which unluckily doesn’t include ReadLang). The real difficulty is managing copyright issues - for ReadLang to offer texts in the library, both the original and the translation would have to be Public Domain, so most of the time you’d have to get the user to supply both copies themselves, and I guess only a handful of users would bother looking up a professional translation when the AI generated one is enough for understanding the text… However, people studying to become translators would certainly find it useful!

I wasn’t aware that this was possible before LLMs. I don’t know programing. I guess what’s new then is the ability for non-programmers to tell a model to do this in plain English. I can basically do this myself even though I’m not a programmer, but it takes more work.

Now that you tell me this was possible, I’m shocked that nobody has provided this solution. I get that creating interlinear word for word translations by hand is labor intensive, but if this was possible and it could be automated why wasn’t this done at scale before. There are so many texts without copyrights. Doppeltext is the only one I’m aware of that is doing this. And they have only 1 long novel in Spanish. This is the best way to learn a language. There are millions of people trying to learn languages. This could have been done at scale using software and using literally tens of thousands of books in the public domain. And it hasn’t been done. What’s going on people?

Regarding copyright, yes, that’s a limit. But I’m not suggesting that ReadLang should become a publisher and start publishing copyrighted content (thought wouldn’t that be amazing?), just that it provides the tools. That is no different to how we use it now. I buy a book in epub format and I upload it to Readlang. I could buy a book in two languages and upload them to Readlang and read it in interlinear format. It would be just me who has access to it. Sure it’s inefficient because a bunch of us will process the same books and we’ll all pay for compute separately, but there’s no way around it.

However, with content in the public domain, once processed it could be made available to the community, so they don’t have to process it again. I think copyrights expire 70 years in Europe after an author’s death. So for authors that died before 1954, this is doable. We could pair books and translations of texts in the 30-40s and the languages are still fairly close to today. You can find complete works of so many authors in the public domain. There is a large supply of materials.

Going back to human vs AI translations. What I’m suggesting is to read at the intermediary stage. Let’s say I have a vocabulary of only 2,000 words. I could read a book that uses 20,000 words translated by a machine. But have you tried that? It sucks. Why would I do that if I have a translation? The machine generated interlinear translation is really bad. I don’t want to read like that. And again, looking up 50-75% of the words with Readlang is possible, but not efficient. It’s not the best way to build up vocabulary quickly. There is a lot of wasted time which could be used reading. It’s painful. It doesn’t want me want to come back to it.

What I’m suggesting is reading, for pleasure, in both translation and the original. I’m reading the book in English and I enjoy it because the translation is good. And I’m also reading it in the original and I’m able to connect the words and acquire vocabulary in the process.

This is not what I have in mind:
“I guess only a handful of users would bother looking up a professional translation when the AI generated one is enough for understanding the text”

To me that doesn’t sound like reading for pleasure, reading and fully enjoying a text. It’s not just about “understanding the text” it’s about enjoying it. And again, that is possible at the advanced level. And I agree that most people currently using Readlang today are probably at this more advanced level because Readlang is not the best tool for reading at an intermediary level. So, yeah, if I know 98% of the vocabulary, why would I bother looking up a professional translation. How many people read books for pleasure in languages where they only know 2,000 words with Readlang? If any of you do that, are you reading advance texts? Are you reading Proust? And are you able to enjoy the text? I tried and I didn’t enjoy it.

I realize that this is a new a way of doing things. It’s a new use case for Readlang (intermediary reading and not just advanced). And I mean reading, not choosing words to be memorized with flashcards. And I mean reading in a way that is enjoyable. And using that joy to power the reading and language acquisition process.

I’m suggesting to collapse the reword and the process. Reading a text where 75% of words are unknown with a machine or dictionary is not enjoyable. That’s not a reward. That’s extremely frustrating. It’s hard work. At least for most people. Not to mention that reading at an early stage with a machine or dictionary will make it hard to understand the text completely. And if you can’t understand it, you miss important plot points, you miss idioms, you miss style, and as a result you can’t enjoy it, which means you won’t get the dopamine kick that will make you want to keep going.

Why not use our dopamine mechanism to power our language acquisition? Use the dopamine you get from your favorite book to learn languages.

I have not heard anyone talking about acquiring languages like this and yet this is what I’m doing. It works folks!

I just find it mystifying that it hasn’t been done already, at scale. That the tools haven’t been created.

Doppeltext has an interlinear translation of Don Quijote. This English translation is from 1885. that’s 140 years old. I want to read this using the Grossman translation. I can buy the ebook for $3 and the spanish text is free. What will cost the most will be the compute, but even with the compute, the cost is the equivalent of a physical book.

I should say that, this could probably be done using open source models where the compute is much cheaper.

@Marius Have you considered reading easier texts? I’m at a B2 level in Spanish and I’m reading Goosebumps, which can be read at 98% coverage with knowledge of the 5,000 most frequently occurring words (at least in English). I’m about 40% through, and my lookups are less than 2% of the words read. If this series is too easy, I’ll just move up to books at the 6K word level.

Just a thought.

2 Likes

Yes, of course, that was the plan before I discovered this, do extensive reading with graded readers.

The main downside is that those texts don’t grab me. I would never read those texts in English or a language I know well. Would you? So, why read them? Just to build vocabulary? Reading something that doesn’t engage me just to build vocabulary seems like studyign to me. The only reason I would do that is so I can read the fiction or non-fiction that grabs me. Those books have a much wider vocabulary. So, why not do that now?

The question remains, why spent hundreds of hours reading boring graded readers that don’t grab me if I can spend all that time reading books that do grab me? I still haven’t heard a good answer to that that takes into account what I’m proposing. The answer I used to hear is that it’s inefficient to look up words in the dicionary or Readlang or read two print books side by side. But we don’t have to do that.

I should add that this whole debate of intensive (normal texts) vs extensive reading (graded readers) is very old. Reading in the interlinear format allows you to engage in intensive reading without the intensive part (no need to look up words) so its just reading. We can do things differently. All the answers I hear are making the assumption that those texts doesn’t exist. And they don’t, which is why I’m proposing that we build the tools to create them.

I do believe however that reading graded texts is a phenomenal idea. I wrote anoter long post sugesting that we start tracking vocabulary and grade texts we import to Readlang according to the words in it that the reader doesn’t know. That would be a personalized graded reader. Then we can import texts to Readlang that interest us and read the easier ones first. That way we can read efficiently with Readlang. I’d love to do that. I just don’t want to read a text that doesn’t grab me. Life is too short to spend time doing that if I don’t have to. If I don’t have the choice I’ll do it, but if I do have a choice, then I’d rather not. So, until someone builds an ability to grade texts based on one’s own vocabulary, I’ll just create my own interlinear texts to read texts that I find interesting.

Remember, most people who start learning a language don’t learn it to an advanced language. Why? Because instagram and ticktok are grabbing their attention more than the boring readers they’re being told to read. It doesn’t have to be that way.

@NMTK Just curious, is the book you’re reading part of a graded series in Spanish? What is the name of the series? Also, is it available in electronic format so I can read it on Readlang? I look up for graded readers in Spanish but I could only find printed versions and I prefer to read on Readlang.

Thank you!

Thanks for this! Plenty of food for thought!

Offering something like an interlinear text is interesting and I agree with you that LLMs make it very doable by either of the approaches you suggest: getting the LLM to translate or getting it to align an existing human-written translation. It would be fun to create this but my concerns would be around the added technical complexity and surface area of the product. It feels like quite a departure from the way the current reading interface works and the code behind the existing UI is already quite old and messy so it wouldn’t be easy. I am intrigued though so I’ll bear this in mind as I plan future work.

I don’t think that’s necessarty though. It’s more of a nice to have.

One way is to split the text into lines shorter than a certain amount of words. I’m using 15 words. And after the three lines I add a blank row. Maybe allow the user to select the words.

I am curenly using this with the current product, as is. The only thing that would be nice to show all three lines on one page (or two, or four, or whatever the number is). Curently it sometimes shows one line on one page and the rest on the next, though I can push them easily all to the next page by tapping on a few words above.

Obvisouly this is a workaround and I wrote the prompt to make the current product work. Sure, ideally we would have the interface take a whole paragraph and then break it up, but I’d rather have this no interlinear because we can’t get it to be ideal. :smile:

Here’s a screenshot.

2 Likes

That’s cool!

I was imagining a more streamlined presentation that would be possible if this feature was built into Readlang (which would be hard and add complexity) but if what you’ve got there works for you that’s great :slight_smile:

A more streamlined version would definitely be better though :smiley: