general-asr-nemo_jasper
--
Notes
Jasper: An End-to-End Convolutional Neural Acoustic Model
The Jasper Model Jasper (“Just Another Speech Recognizer”) [ASR-MODELS6] is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. The Jasper family of models are denoted as Jasper_[BxR] where B is the number of blocks and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, ReLU, and dropout:
Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocks | Figure 2: Jasper Dense Residual |
Performance
The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training. | Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | |-----|-----|-------|-------|-------|------|-------| | 8 | 64 | mixed | 3.20 | 9.78 | 3.41 | 9.71 |
Note:
This Jasper model was trained on a combination of seven datasets of English speech, with a total of 7,133 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 600 epochs with Apex/Amp optimization level O1.
The model will work for relatively short (<25 seconds) files.
- ID
- Model Type IDAudio To Text
- Description--
- Last UpdatedNov 23, 2022
- PrivacyPUBLIC
- License
- Share
- Badge
Concept | Date |
---|