Dieser Blogpost ist auch auf Deutsch verfügbar

At INNOQ, new podcast episodes are constantly being created. Transcription requires quite a bit of effort. This should be automatable in the new AI world!

Manual experiments showed that Google’s Gemini models solve the task best. Whisper by OpenAI, for example, couldn’t keep up.

The Gemini models are well-suited for transcribing conversations due to their large context window and audio capabilities. In this blog post, I will share how I overcame some obstacles to find a solution.

Naive Approach and First Setback

What’s the big deal? Upload an audio file in Google AI Studio, write a prompt, and Bob’s your uncle!

It turns out there are not enough output tokens for this. An hour-long podcast produces approximately 20,000 to 30,000 tokens in the transcript. The non-reasoning Gemini models can output a maximum of 8,192 tokens. As much as we appreciate the generous 1-million-token context window, when we need to produce a lot of text, the output tokens are the bottleneck.

Can’t Do Without Programming

Manual work will no suffice. We need to program ourselves a tool. This already presents the next set of obstacles.

Chunking to the Rescue - New Problems

The audio file must be split into chunks, and the pieces transcribed individually with each API call. When chunking, care must be taken to separate at a pause in speech, not in the middle of a sentence. I had ChatGPT write the algorithm in Python because I had no idea how to do any of that. It worked right away with the pydub library, and I learned something new. To make it more challenging, I had to equalize the volume with audio compression (again with pydub), because pause detection could fail with quiet recordings.

Who are you?

Now, difficulties arose with speaker recognition. If the introduction round is missing at the beginning, how is the model supposed to know who is speaking?
The idea: Cut out an intro, about 2–3 minutes long, where everyone gets a chance to speak, and attach the intro to each chunk.

This trick works well for speaker recognition, but now each chunk contains the intro, which I then have to remove from the transcript. This is more complicated than expected because you can’t just search for text, not even with similarity search. The model sometimes combines sections of the same person, and it becomes tricky if the last text block in the intro is an “um.” I speak from experience…

Solution: Record a concise, simple separator sentence that will never naturally occur and is always transcribed clearly. The audio of the separator sentence is inserted between the intro and the chunk with a few seconds of pause in between. This method is very reliable.

Transcript Format

I designed JSON format to be flexible in the downstream process.

[
  {
    "speaker": "Hermann",
    "text": "Hello and welcome to the INNOQ podcast. Today, I am with Somebody.",
    "timestamp": "00:00"
  },
  {
    "speaker": "Somebody",
    "text": "Hello, I am Somebody, thanks for having me.",
    "timestamp": "00:05"
  }
  ...
]

Models are Fickle

The first model I tried was Gemini 2.0 Flash experimental. The texts were very precise, but the timestamps were off badly. Additionally, the model often became lazy as it approached 4000 output tokens. People were no longer separated, and at the end of a chunk, I found large text blocks. It also no longer bothered to properly complete the defined JSON structure.

Surprise! We cannot use the full allowed amount of output tokens.

With the Pro version, everything took far too long, and the result wasn’t better.

After a Google update, to my surprise Gemini-2.0-flash-lite-preview-02–05 could output timestamps to the second and made no errors in the text. However, it inherited the laziness.

Models Can’t Do Everything

To get the correct spelling of difficult names, I created a glossary file that is inserted into the transcription prompt. This prevents glitches like “Jason” instead of “JSON” or “Chat GPT” instead of “ChatGPT.” This works very well.

Person recognition hits its limits when the recording quality is mediocre or when people are speaking simultaneously. Also, the model has an easier time recognizing people when the pitch of the voices is different.

Other Implementation Details

All intermediate results are cached in files. A restart is always possible without having to begin from scratch.

A postprocessing step reads prompts from files and applies them to the raw transcript. This allows for fully automatic translation, removal of “uhh” from sentences, simplification of convoluted text, creation of summaries of different lengths and types, etc. The possibilities are endless. Due to caching, experiments with postprocessing take little time.

Flowchart

Flowchart of the process from podcast recording through compression, MP3 conversion, transcription, chunking, use of glossaries and prompt files, to translation, cleanup, and summarization

Recap

The prerequisite for good transcripts is high recording quality.

The precision of the transcript is remarkably high. Every word and sound is recognized. Post-editing is only necessary when new names appear that the model does not recognize or cannot spell correctly. These are added to the glossary file, and the output improves next time.

With second-accurate timestamps, I can randomly check if speakers are correctly assigned in suspicious passages. We believe it’s acceptable if someone is occasionally mixed up, as long as the content is completely correct.

The preparation of the audio material took the most development time. The second most time-consuming task was understanding the peculiarities of the model.

It was worthwhile to build a custom tool for this, as special tasks can be targeted more precisely.