• Community
  • Model
  • asr-wav2vec2-large-xlsr-53-thai

asr-wav2vec2-large-xlsr-53-thai

Audio transcription model for converting Thai audio to Thai text

Notes

huggingface model id: airesearch/wav2vec2-large-xlsr-53-th

wav2vec2-large-xlsr-53-th

Finetuning wav2vec2-large-xlsr-53 on Thai Common Voice 7.0

Read more on our blog

We finetune wav2vec2-large-xlsr-53 based on Fine-tuning Wav2Vec2 for English ASR using Thai examples of Common Voice Corpus 7.0. The notebooks and scripts can be found in vistec-ai/wav2vec2-large-xlsr-53-th. The pretrained model and processor can be found at airesearch/wav2vec2-large-xlsr-53-th.

Eval results on Common Voice 7 "test":

WER PyThaiNLP 2.3.1WER deepcutSERCER
Only Tokenization0.9524%2.5316%1.2346%0.1623%
Cleaning rules and TokenizationTBDTBDTBDTBD

Datasets

Common Voice Corpus 7.0 contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with pythainlp.tokenize.word_tokenize. We preprocess the dataset using cleaning rules described in notebooks/cv-preprocess.ipynb by @tann9949. We then deduplicate and split as described in ekapolc/Thai_commonvoice_split in order to 1) avoid data leakage due to random splits after cleaning in Common Voice Corpus 7.0 and 2) preserve the majority of the data for the training set. The dataset loading script is scripts/th_common_voice_70.py. You can use this scripts together with train_cleand.tsv, validation_cleaned.tsv and test_cleaned.tsv to have the same splits as we do. The resulting dataset is as follows:

DatasetDict({
    train: Dataset({
        features: ['path', 'sentence'],
        num_rows: 86586
    })
    test: Dataset({
        features: ['path', 'sentence'],
        num_rows: 2502
    })
    validation: Dataset({
        features: ['path', 'sentence'],
        num_rows: 3027
    })
})

Training

We finetuned using a configuration on a single V100 GPU and chose the checkpoint with the lowest validation loss.

Evaluation

We benchmark on the test set using WER with words tokenized by PyThaiNLP 2.3.1 and deepcut, and CER. We also measure performance when spell correction using TNC ngrams is applied. Evaluation codes can be found in notebooks/wav2vec2_finetuning_tutorial.ipynb. Benchmark is performed on test-unique split.

WER PyThaiNLP 2.3.1WER deepcutCER
Kaldi from scratch23.047.57
Ours without spell correction13.6340248.1520522.813019
Ours with spell correction17.99639714.1679755.225761
Google Web Speech API※13.71123410.8600587.357340
Microsoft Bing Speech API※12.5788199.6209915.016620
Amazon Transcribe※21.8633414.4875537.077562
NECTEC AI for Thai Partii API※20.10588715.5156319.551027

※ APIs are not finetuned with Common Voice 7.0 data

  • ID
  • Name
    wav2vec2-large-xlsr-53-thai
  • Model Type ID
    Audio To Text
  • Description
    Audio transcription model for converting Thai audio to Thai text
  • Last Updated
    Jun 28, 2022
  • Privacy
    PUBLIC
  • Toolkit
  • License
  • Share
    • Badge
      asr-wav2vec2-large-xlsr-53-thai