whisper-large-v2

Whisper is a versatile pre-trained ASR and speech translation model trained on multilingual data without requiring fine-tuning.

No input available.

Notes

Introduction

Whisper-large-2 is a speech recognition model developed by OpenAI that uses large-scale weak supervision to predict transcripts of audio on the internet. The model is designed to generalize well to standard benchmarks and is often competitive with prior fully supervised results, approaching the accuracy and robustness of humans

Whisper

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

The models were trained on multilingual data. The models were trained on both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.

Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance.

Run Whisper with an API

Running the API with Clarifai's Python SDK

You can run the whisper Model API using Clarifai’s Python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}

from clarifai.client.model import Model

inference_params = dict(task="translate")
whisper_model = Model("https://clarifai.com/openai/whisper/models/whisper-large-v2", pat=CLARIFAI_PAT)

spanish_audio_url = "https://s3.amazonaws.com/samples.clarifai.com/featured-models/record_out+(3).wav"
model_prediction = whisper_model.predict_by_url(spanish_audio_url, "audio", inference_params=inference_params)

# Print the translated English output
print("Translated to English: ", model_prediction.outputs[0].data.text.raw)

# Transcribing the audio
inference_params = dict(task="transcribe")
model_prediction = whisper_model.predict_by_url(spanish_audio_url, "audio", inference_params=inference_params)

# Print the transcribed output
print("Transcribed audio: ", model_prediction.outputs[0].data.text.raw)

You can also run Whisper using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Use Cases

Whisper-Large can be used for various speech recognition tasks, including transcription of audio recordings, voice commands, and speech-to-text translation. The model can be applied to different languages and accents, making it useful for multilingual applications.

Dataset

Whisper-Large is trained on a large-scale weakly supervised dataset that includes 680,000 hours of audio, covering 96 languages. The dataset also includes 125,000 hours of X→en translation data. The models trained on this dataset transfer well to existing datasets zero-shot, removing the need for any dataset-specific fine-tuning to achieve high-quality results

Advantages:

Whisper-Large has several advantages over traditional speech recognition models. The model can be trained on a large-scale weakly supervised dataset, which reduces the need for expensive and time-consuming manual annotation. The model can also be applied to different languages and accents, making it useful for multilingual applications. Finally, the model can be fine-tuned on a subset of transcripts that do not include speaker annotations, which avoids getting stuck in repeat loops.

Limitations:

Whisper-Large's performance is limited by the quality and quantity of the training data. The model still struggles with many languages and accents, and there are remaining errors that need to be addressed. The model's performance may also be affected by the quality of the audio recordings

ID
Model Type ID
Audio To Text
Input Type
audio
Output Type
text
Description
Whisper is a versatile pre-trained ASR and speech translation model trained on multilingual data without requiring fine-tuning.
Last Updated
Oct 17, 2024
Privacy
PUBLIC
Use Case
Toolkit
License
Share
Badge