general-asr-nemo_jasper

--

Notes

Jasper: An End-to-End Convolutional Neural Acoustic Model

The Jasper Model Jasper (“Just Another Speech Recognizer”) [ASR-MODELS6] is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. The Jasper family of models are denoted as Jasper_[BxR] where B is the number of blocks and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, ReLU, and dropout:

Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocksFigure 2: Jasper Dense Residual

Performance

The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training. | Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | |-----|-----|-------|-------|-------|------|-------| | 8 | 64 | mixed | 3.20 | 9.78 | 3.41 | 9.71 |

Note:

This Jasper model was trained on a combination of seven datasets of English speech, with a total of 7,133 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 600 epochs with Apex/Amp optimization level O1.

The model will work for relatively short (<25 seconds) files.

  • ID
  • Model Type ID
    Audio To Text
  • Description
    --
  • Last Updated
    Nov 23, 2022
  • Privacy
    PUBLIC
  • License
  • Share
    • Badge
      general-asr-nemo_jasper