- Community
- Model
- asr-wav2vec2-large-robust-ft-swbd-300h-english
Notes
Introduction
Facebook Automatic Speech Recognition provides great models and workflows built by Facebook (now known as Meta) that you can use in your apps to carry out automatic speech recognition (ASR).
You can use this model to easily and quickly convert noisy English telephone audio to English text (speech-to-text transcription). Simply upload an English audio file from your local computer or add a publicly accessible audio URL, and the model will output the transcribed text.
Facebook ASR models will help you to effectively transcribe audio content into written words without having to type them manually. These models are also valuable to persons with disabilities who cannot use a keyboard.
Wav2Vec2-Large-Robust finetuned on Switchboard
This model is a fine-tuned version of the wav2vec2-large-robust model. It has been pretrained on:
- Libri-Light: Open-source audio books from the LibriVox project – clean read-out audio data.
- CommonVoice: Crowd-source collected audio data – read-out text snippets.
- Switchboard: Telephone speech corpus – noisy telephone data.
- Fisher: Conversational telephone speech – noisy telephone data.
The model was subsequently finetuned on 300 hours of
- Switchboard: telephone speech corpus – noisy telephone data.
Given that this model was trained on noisy telephone English audio, it is better suited for that kind of input. If you want to translate clean audio or if this model is not returning the expected output, consider using the standard English audio to text model instead.
The audio data used to fine-tune and train this model as sampled at a sample rate of 16kHz.
When using the model make sure that your speech input is also sampled at 16Khz.
More Info
- Meta AI Research post: Wav2vec 2.0: Learning the structure of speech from raw audio
- Hugging Face docs: facebook/wav2vec2-large-robust-ft-swbd-300h
Paper
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli
Abstract
Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training.
Additional Info
Language: en
Data sets:
- libri_light
- common_voice
- switchboard
- fisher
- ID
- Namewav2vec2-large-robust-ft-swbd-300
- Model Type IDAudio To Text
- DescriptionAudio transcription model trained on noisy phone data for converting English audio to English text
- Last UpdatedAug 04, 2022
- PrivacyPUBLIC
- Use Case
- Toolkit
- License
- Share
- Badge