Improving Accuracy with ScribeTrain
Obtaining the highest accuracy transcriptions requires high audio quality and a knowledgeable model. Below are some of the factors that affect transcription accuracy and methods for improving them.
The quality of the input audio file has a direct effect on transcription accuracy.
Methods for improving audio quality:
- Use a higher sample rate when recording audio (16kHz or higher recommended)
- Minimize the amount of noise in the file
- Normalize the volume / gain of the file
- Reduce volume such that clipping of the signal does not occur
- Keep separate speakers on separate channels of an audio file
Voice Onboarding and training data generation
GreenKey has a voice onboarding process that you can participate in. Visit this page, then simply press start, and read the phrases shown (hitting next between each phrase). This page also allows you to upload your own text files. Each line of the file will be treated as a single prompt. At the end of onboarding, you will need to download both the
Add these two files to your training data to improve accuracy for this voice and accent. The files are completely anonymized. Remember to convert all
wav files to
sph format for training.
New Data Collection
A browser tool for cleaning poorly transcribed audio is provided in the scribetrainSDK under
sdk/correction. Add your files to the data folder, then open
index.html in your browser and follow the instructions inside. Once you correct your transcript, the
wav files can be added to your training data to systematically improve accuracy. Remember to convert all
wav files to
sph format for training.
The language model that Scribe uses will have an impact on the words and phrases recognized from an audio sample. If your audio contains words or phrases that Scribe's language model does not know (such as proper names), these will not transcribe accurately.
If you are using the full training process, the language model will automatically be adapted to your training data.
If you are using the extension process, you may want to adapt the language model after you have packaged the acoustic model. Follow the documentation on language model customization to add words or improve the language model.
Contact us for more information about extending the language model.
Tailoring model parameters using a target file
Several model parameters used in decoding may need to be tailored based on your target domain and model type. GreenKey offers a short tool called
scan_params which repeatedly invokes SVTServer for a target file with a reference transcript. To use this tool, do the following:
Step 1 - Download docker.greenkeytech.com/scan_params and docker.greenkeytech.com/svtserver
Please contact your GreenKey Tech representative if this step is unclear.
Step 2 - Load local files
You will need a reference file for using this tool with a corrected transcript [transcript] and an audio file [audio_file]. Store both these files inside your working directory.
After you prepare this data, you'll want to also copy your customized model to the
custom directory in the working directory.
Step 3 - Launch the container
Acceptable modes - e2e and gmm - please consult ScribeTrain documentation. Note that NPROCS should be logical cores / 2 per SVT documentation
docker run \ --rm \ -it \ -e MODE=e2e \ -e GKT_USERNAME=$GKT_USERNAME \ -e GKT_API=$GKT_API \ -e GKT_SECRETKEY=$GKT_SECRETKEY \ -e AUDIO_FILE=[audio_file] \ -e REFERENCE_TRANSCRIPT=[transcript] \ -e NPROCS=[NPROCS] \ -v $(pwd)/custom:/custom \ -v $(pwd):/files \ -v /var/run/docker.sock:/var/run/docker.sock \ docker.greenkeytech.com/scan_params