Diarization with Whisper …

Alagu Prakalya P
5 min readDec 8, 2022

Before moving on with the diarization on Whisper, for those who are venturing into the world of Automatic Speech Recognition for the first time, have your seat belts on! You are about to have the trivia ride 🤠…

In a galaxy far, far away, a group of rebels were fighting against the evil empire. As they prepared for their next mission, they knew that their success would depend on their ability to communicate effectively. That’s where automatic speech recognition, or ASR, came in.

ASR was a technology that allowed computers to understand and interpret human speech. This was particularly useful for the rebels, as it allowed them to easily communicate with each other and with their droids, even in the midst of battle.

One of the rebels, a young Jedi named Luke Skywalker, was particularly skilled at using ASR. He had spent years training with the technology, learning to control it with his mind. This allowed him to use ASR to quickly and accurately relay commands to his comrades, even in the heat of battle. Only he knew that it works by analyzing the sound waves of human speech and converting them into a machine-readable format, the droids.

But the empire was not to be outdone. They too had access to ASR technology, and they used it to great effect in their own communications. With ASR, the empire was able to quickly and efficiently coordinate their forces, making them a formidable enemy for the rebels.

Despite this, the rebels continued to fight, using their ASR technology to outmaneuver the empire and carry out their missions. And with each successful operation, the empire’s grip on the galaxy began to loosen. So they wrote down the building foundations to make this technology to be eternal, which is

  1. Signal processing algorithms: These are used to analyze the sound waves of human speech and convert them into a machine-readable format.
  2. Machine learning models: These are trained on large datasets of human speech, allowing them to learn the patterns and nuances of human language.
  3. Acoustic models: These are used to model the sounds of human speech, allowing ASR systems to more accurately interpret speech.
  4. Language models: These are used to model the structure and grammar of human language, allowing ASR systems to better understand the meaning of words and sentences.
  5. Vocabulary: The set of words and phrases that an ASR system is able to recognize is called its vocabulary. The larger an ASR system’s vocabulary, the more words and phrases it will be able to understand.
  6. Noise reduction: In noisy environments, ASR systems may have difficulty accurately interpreting speech. To overcome this, ASR systems often use algorithms to reduce background noise and improve speech recognition accuracy.

But the war was not yet won. While the construction was on, the empire still had a powerful weapon in their arsenal: the Death Star, a massive space station with the ability to destroy entire planets. The rebels knew that they would need to use all of their skills, including their mastery of ASR technology, if they were to have any hope of destroying the Death Star and defeating the empire once and for all.

As the rebels prepared for the final battle, they knew that the fate of the galaxy would be decided by their ability to communicate and coordinate their forces. And with ASR technology on their side, they were confident that they could emerge victorious.

In the end, the rebels were successful. With the rise of Whisper by OpenAI 🤩…

Whisper is a natural language processing (NLP) model developed by OpenAI. It is a type of transformer model, which is a type of neural network architecture that has been used extensively in NLP tasks.

The architecture of Whisper is based on the transformer architecture, which consists of multiple “layers” of interconnected nodes. Each layer performs a different function, such as processing input data or generating output. In Whisper, these layers are made up of a combination of self-attention mechanisms and feed-forward neural networks.

One of the key features of Whisper is its ability to process long sequences of data without losing information. This is achieved through the use of self-attention mechanisms, which allow the model to “attend” to different parts of the input data simultaneously and integrate information from across the entire input sequence.

Another key feature of Whisper is its ability to generate high-quality text. This is achieved through the use of a large, carefully-curated dataset of text, which is used to train the model. This allows the model to learn the patterns and nuances of human language, allowing it to generate text that is natural and easy to understand.

It combines the power of transformer models with the latest advances in NLP research to provide a state-of-the-art model for processing human language.

Now let’s get into the climax. Whisper given an audio file, can auto detect the languages and translate/transcribe it. Here we have used this audio. Well we need a friend in help now to label these transcripts. Whisper do gives a heads up on this :

As one can observe, Whisper generates audio segments with the transcribed text, the start and end time with probabilities of silence detected. Though this diarizes the audio pitch perfect, the labels i.e., identification of speaker is missed out.

Whisper can now call his friends pyannote, speechbrain to help with this identification task. Here we will dive into how our former friends lends in the hand.

The above script will generate the .rttm file with the respective start time and duration of each speaker with the label. Two approaches can be done to get the final result:

  1. Merge the consecutive speakers and with the start, end time, we could generate audio chunks either via ffmpeg (subprocess) or librosa, then transcribe since the label now is pre-determined.
  2. There is only a significant difference with the timesteps generated by Whisper and pyannote, a simple range merge can give the desired result.

The results of the former approach can be found here. And Voila! A friend in need is a friend indeed.

The demo of diarization can be found here. Well, the results may not be accurate yet, still exploration of the other diarization modules are left. Also check out the Whisper V2 release that was trained for few more epochs and have a lower Word Error Rate to the V1.

Coming with more ventures of Natural Language Processing… and the Death Star will now sleep for few more centuries!

--

--