Extra text in audio transcription

Hi there,

Unsure if this is specifically a bug or a “feature” but will note than in auto transcription from French mp3 I’m getting the following appearing multiple times in the transcription - mostly during musical interludes:

“Sous-titres réalisés para la communauté d’Amara.org Abonnez-vous ! Traduit avec :heart: par Lyn Abonnez-vous !”

which in English is effectively “subtitles by the Amara.org community” etc.

It’s not a big issue or anything but I’m assuming it’s not expected.

Wow, that is strange! I’m assuming those words weren’t spoken during the musical interlude?

Do you mind sharing the mp3 with me (send to steve@readlang.com) so that I can try it myself.

Also worth noting: you can freely edit the text after the auto transcription has done it’s job and Readlang will do its best to keep the edited version in sync with the audio based on the timing info from the initial auto-transcription.

Correct, those words aren’t spoken at all. I’ll send through the mp3 shortly.

Thanks for sharing the file. I can confirm it happened for me too.

Seems like it’s also happening for other people who use the OpenAI Whisper API: Dataset bias ("❤️ Translated by Amara.org Community") · openai/whisper · Discussion #928 · GitHub. Hopefully they’ll improve the model to avoid this. In the meantime it seems like an annoying quirk but hopefully isn’t too disruptive as long the rest of the spoken text is transcribed correctly. Feel free to delete this extra text in your transcriptions using the “Edit” tab in the reader page sidebar.

2 Likes