Whisper-large-v3 is a Transformer-based speech-to-text model showing 10-20% error reduction compared to Whisper-large-v2, trained on 1 million hours of weakly labeled audio, and can be use for translation and transcription task
Whisper-large-v3 pre-trained model for Automatic Speech Recognition (ASR) and Speech Translation. The Whisper-large-v3 model is a Transformer-based encoder-decoder model developed by OpenAI, designed to handle various languages and domains without the need for fine-tuning.
from clarifai.client.model import Model
from pydub import AudioSegment
from scipy.io import wavfile
# fileswav_file ="test.wav"AUDIO_FILE_LOCATION ='record_out.mp3'# convert any audio file format to .wav audio filesound = AudioSegment.from_file(AUDIO_FILE_LOCATION)sound.export(wav_file,format="wav")samplerate, data = wavfile.read(wav_file)withopen(wav_file,"rb")as f:file_bytes = f.read()inference_params =dict(task="translate", sample_rate = samplerate, language ="spanish")whisper_model = Model("https://clarifai.com/openai/whisper/models/whisper-large-v3", pat="YOUR PAT")model_prediction = whisper_model.predict_by_bytes(file_bytes,"audio", inference_params=inference_params)# Print the translated English outputprint("Translated to English: ", model_prediction.outputs[0].data.text.raw)# Transcribing the audiospanish_audio_url ="https://s3.amazonaws.com/samples.clarifai.com/featured-models/record_out+(3).wav"inference_params =dict(task="transcribe", sample_rate = samplerate)model_prediction = whisper_model.predict_by_url(spanish_audio_url,"audio", inference_params=inference_params)# Print the transcribed outputprint("Transcribed audio: ", model_prediction.outputs[0].data.text.raw)
You can also run Whisper using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.
Whisper-large-v3 vs Whisper-large-v2
Improvements: Whisper-large-v3 demonstrates improved performance, showing a 10% to 20% reduction in errors compared to Whisper-large-v2.
Training Data: Whisper-large-v3 is trained on a more extensive dataset, including 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio.
Use Cases
The Whisper-large-v3 model is versatile and can be applied to various scenarios, including speech recognition and speech translation. It exhibits strong generalization capabilities across different languages and domains.
Evaluation
Performance: The model exhibits improved performance across various languages, achieving a 10% to 20% reduction in errors compared to Whisper large-v2.
Benchmarking: Performance metrics include error rates on Common Voice 15 and Fleurs datasets.
Dataset
Training Data Size: 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio.
Data Collection: Audio data collected using Whisper large-v2.
Data Variety: The model demonstrates the ability to generalize across diverse datasets and domains.
Advantages
Generalization: The model demonstrates strong generalization to diverse datasets and domains.
Improved Performance: Whisper-large-v3 exhibits a significant reduction in errors compared to its predecessor.
Multilingual Support: The model is trained on multilingual data, enabling speech recognition and translation across different languages.
ID
Model Type ID
Audio To Text
Input Type
audio
Output Type
text
Description
Whisper-large-v3 is a Transformer-based speech-to-text model showing 10-20% error reduction compared to Whisper-large-v2, trained on 1 million hours of weakly labeled audio, and can be use for translation and transcription task