whisper-large-v3 model

whisper-large-v3

Whisper-large-v3 is a Transformer-based speech-to-text model showing 10-20% error reduction compared to Whisper-large-v2, trained on 1 million hours of weakly labeled audio, and can be use for translation and transcription task

No input available.

Notes

Introduction

Whisper-large-v3 pre-trained model for Automatic Speech Recognition (ASR) and Speech Translation. The Whisper-large-v3 model is a Transformer-based encoder-decoder model developed by OpenAI, designed to handle various languages and domains without the need for fine-tuning.

Architecture: Transformer-based encoder-decoder (sequence-to-sequence model).
Training Data: 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.
Input Features: Utilizes 128 Mel frequency bins instead of the previous 80.
Language Token: Introduces a new language token for Cantonese.
Training Duration: The model was trained for 2.0 epochs over a combined dataset of weakly labeled and pseudolabeled audio.

Run Whisper with an API

Running the API with Clarifai's Python SDK

You can run the whisper Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

from clarifai.client.model import Model
from pydub import AudioSegment
from scipy.io import wavfile
# files
wav_file = "test.wav"
AUDIO_FILE_LOCATION = 'record_out.mp3'

# convert any audio file format to .wav audio file
sound = AudioSegment.from_file(AUDIO_FILE_LOCATION)
sound.export(wav_file, format="wav")
samplerate, data = wavfile.read(wav_file)
with open(wav_file, "rb") as f:
file_bytes = f.read()

inference_params = dict(task="translate", sample_rate = samplerate, language = "spanish" )
whisper_model = Model("https://clarifai.com/openai/whisper/models/whisper-large-v3", pat="YOUR PAT")
model_prediction = whisper_model.predict_by_bytes(file_bytes, "audio", inference_params=inference_params)

# Print the translated English output
print("Translated to English: ", model_prediction.outputs[0].data.text.raw)

# Transcribing the audio
spanish_audio_url = "https://s3.amazonaws.com/samples.clarifai.com/featured-models/record_out+(3).wav"
inference_params = dict(task="transcribe", sample_rate = samplerate)
model_prediction = whisper_model.predict_by_url(spanish_audio_url, "audio", inference_params=inference_params)

# Print the transcribed output
print("Transcribed audio: ", model_prediction.outputs[0].data.text.raw)

You can also run Whisper using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Whisper-large-v3 vs Whisper-large-v2

Improvements: Whisper-large-v3 demonstrates improved performance, showing a 10% to 20% reduction in errors compared to Whisper-large-v2.
Training Data: Whisper-large-v3 is trained on a more extensive dataset, including 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio.

Use Cases

The Whisper-large-v3 model is versatile and can be applied to various scenarios, including speech recognition and speech translation. It exhibits strong generalization capabilities across different languages and domains.

Evaluation

Performance: The model exhibits improved performance across various languages, achieving a 10% to 20% reduction in errors compared to Whisper large-v2.
Benchmarking: Performance metrics include error rates on Common Voice 15 and Fleurs datasets.

Dataset

Training Data Size: 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio.
Data Collection: Audio data collected using Whisper large-v2.
Data Variety: The model demonstrates the ability to generalize across diverse datasets and domains.

Advantages

Generalization: The model demonstrates strong generalization to diverse datasets and domains.
Improved Performance: Whisper-large-v3 exhibits a significant reduction in errors compared to its predecessor.
Multilingual Support: The model is trained on multilingual data, enabling speech recognition and translation across different languages.

ID
Model Type ID
Audio To Text
Input Type
audio
Output Type
text
Description
Whisper-large-v3 is a Transformer-based speech-to-text model showing 10-20% error reduction compared to Whisper-large-v2, trained on 1 million hours of weakly labeled audio, and can be use for translation and transcription task
Last Updated
Nov 21, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge