Dieser Blogpost ist auch auf Deutsch verfügbar
At INNOQ, new podcast episodes are constantly being created. Transcription requires quite a bit of effort. This should be automatable in the new AI world!
Manual experiments showed that Google’s Gemini models solve the task best. Whisper by OpenAI, for example, couldn’t keep up.
The Gemini models are well-suited for transcribing conversations due to their large context window and audio capabilities. In this blog post, I will share how I overcame some obstacles to find a solution.
Naive Approach and First Setback
What’s the big deal? Upload an audio file in Google AI Studio, write a prompt, and Bob’s your uncle!
It turns out there are not enough output tokens for this. An hour-long podcast produces approximately 20,000 to 30,000 tokens in the transcript. The non-reasoning Gemini models can output a maximum of 8,192 tokens. As much as we appreciate the generous 1-million-token context window, when we need to produce a lot of text, the output tokens are the bottleneck.
Can’t Do Without Programming
Manual work will no suffice. We need to program ourselves a tool. This already presents the next set of obstacles.
- The API limits the audio file size in a request to 20MB. An MP3 of a one-hour podcast is about 70MB.
- Larger files can be uploaded beforehand. They only remain available for 48 hours. So, I’m supposed to manage the lifecycle of these files. No thanks, the whole thing is too fragile with too much state management. I need a tool that can resume at any point in time, ensuring restartability. AI-based tools can be unpredictable, and you need to be able to discard parts of the results and quickly regenerate them without always having to start from scratch.
Chunking to the Rescue - New Problems
The audio file must be split into chunks, and the pieces transcribed individually with each API call. When chunking, care must be taken to separate at a pause in speech, not in the middle of a sentence. I had ChatGPT write the algorithm in Python because I had no idea how to do any of that. It worked right away with the pydub library, and I learned something new. To make it more challenging, I had to equalize the volume with audio compression (again with pydub), because pause detection could fail with quiet recordings.
Who are you?
Now, difficulties arose with speaker recognition. If the introduction round is missing at the beginning, how is the model supposed to know who is speaking?
The idea: Cut out an intro, about 2–3 minutes long, where everyone gets a chance to speak, and attach the intro to each chunk.
This trick works well for speaker recognition, but now each chunk contains the intro, which I then have to remove from the transcript. This is more complicated than expected because you can’t just search for text, not even with similarity search. The model sometimes combines sections of the same person, and it becomes tricky if the last text block in the intro is an “um.” I speak from experience…
Solution: Record a concise, simple separator sentence that will never naturally occur and is always transcribed clearly. The audio of the separator sentence is inserted between the intro and the chunk with a few seconds of pause in between. This method is very reliable.
Transcript Format
I designed JSON format to be flexible in the downstream process.
[
{
"speaker": "Hermann",
"text": "Hello and welcome to the INNOQ podcast. Today, I am with Somebody.",
"timestamp": "00:00"
},
{
"speaker": "Somebody",
"text": "Hello, I am Somebody, thanks for having me.",
"timestamp": "00:05"
}
...
]
Models are Fickle
The first model I tried was Gemini 2.0 Flash experimental. The texts were very precise, but the timestamps were off badly. Additionally, the model often became lazy as it approached 4000 output tokens. People were no longer separated, and at the end of a chunk, I found large text blocks. It also no longer bothered to properly complete the defined JSON structure.
Surprise! We cannot use the full allowed amount of output tokens.
With the Pro version, everything took far too long, and the result wasn’t better.
After a Google update, to my surprise Gemini-2.0-flash-lite-preview-02–05 could output timestamps to the second and made no errors in the text. However, it inherited the laziness.
Models Can’t Do Everything
To get the correct spelling of difficult names, I created a glossary file that is inserted into the transcription prompt. This prevents glitches like “Jason” instead of “JSON” or “Chat GPT” instead of “ChatGPT.” This works very well.
Person recognition hits its limits when the recording quality is mediocre or when people are speaking simultaneously. Also, the model has an easier time recognizing people when the pitch of the voices is different.
Other Implementation Details
All intermediate results are cached in files. A restart is always possible without having to begin from scratch.
A postprocessing step reads prompts from files and applies them to the raw transcript. This allows for fully automatic translation, removal of “uhh” from sentences, simplification of convoluted text, creation of summaries of different lengths and types, etc. The possibilities are endless. Due to caching, experiments with postprocessing take little time.
Flowchart
Recap
The prerequisite for good transcripts is high recording quality.
The precision of the transcript is remarkably high. Every word and sound is recognized. Post-editing is only necessary when new names appear that the model does not recognize or cannot spell correctly. These are added to the glossary file, and the output improves next time.
With second-accurate timestamps, I can randomly check if speakers are correctly assigned in suspicious passages. We believe it’s acceptable if someone is occasionally mixed up, as long as the content is completely correct.
The preparation of the audio material took the most development time. The second most time-consuming task was understanding the peculiarities of the model.
It was worthwhile to build a custom tool for this, as special tasks can be targeted more precisely.