ScribeTrain Experimentation

How to launch ScribeTrain


ScribeTrain has three different modes of execution based on what stages of the training process has been performed (full model training, acoustic model customization, and language model customization)

In general, the full training process proceeds through the following:

1) Data Preparation

Setup your data as in Preparing your data.

2) Language model training

The language model that Scribe uses will have an impact on the words and phrases recognized from an audio sample. By providing representative training files in train and dev, you ensure that the appropriate n-grams can be predicted by your ScribeTrain model.

3) Nnet acoustic model training

A neural network acoustic model is trained using GPUs for the training data provided. Model training scripts are available in sdk/cluster and sdk/singleuse for cluster or single machine usage.

Available training scripts:

An overview of these three is available below:

Process Amount of Data Use case
Full training 200+ hours Train a completely new model using a custom base data source
Neural net extension 5+ hours Adapt GreenKey Scribe's base neural net with custom acoustic data

After you launch the container using one of the provided scripts, you can view the training progress by accessing its logs (docker logs [model-name] where [model-name] is the name given to the model at runtime). A log.txt file is also created in the working directory where you launch the container, as well as results.txt which holds the final model performance on the test database.


A large number of parameters are configurable for neural net customization. These can be specified using nvidia-docker run ... -e VAR="$VAR" ... for single machine usage or through docker exec train-master bash -c "echo "$VAR" > /VAR" for cluster usage.

ScribeTrain's default values have been tuned to provide optimal out of the box accuracy.

Skipping alignment phases when extending

To experiment with neural net extension parameters, run until the container finishes and then rerun with RESTART_NNET="true", and USE_SAVED_EGS="true" in order to facilitate rapid experimentation. In particular, different values for EPOCHS, LM_WEIGHTING, and DATA_SUBSET may be beneficial for larger training datasets. Please note that should not be run for experiments restarting based on previous extension runs. Further, the saved examples cannot be re-used when the number of layers changes between runs.

The SDK script sdk/shared/ can be sourced to set these variables in your environment.

Learning rate recommendations

Note that the default learning rates are optimal for most sets of training data. For data sets that are augmented with synthetic speech data, lower learning rates (e.g. 1e-6) are necessary to prevent fitting to artifacts in the synthesized data.

MODE gmm_extend Mode for executing scribetrain. By default, this assumes you will extend a base model using a gaussian mixture model (gmm) as part of training. Alternatively, one could employ end to end (e2e) training. Possible values here are e2e, e2e_extend, gmm, gmm_extend.
NNET_STACK tdnn_lstm CNN, TDNN, LSTM, and OPGRU refer to neural network layers. Possible values for this parameter are cnn_tdnn, tdnn, tdnn_lstm, and tdnn_opgru. These can dramatically affect the batch sizes which can run on conventional GPUs. Alternatively, set to customlm to perform only language model customization. See Language model customization for more details.
CUSTOMIZE_LM True Controls whether language model customization is automatically performed during acoustic model customization. Set to False if not using English language model.
LR_INIT 0.001 (initial training),
0.0001 (extension)
The initial learning rate for neural net training.
LR_FINAL 0.0001 The final learning rate for neural net training.
DROPOUT_SCHEDULE '0,0@0.20,0.3@0.50,0' This determines which layers have dropout applied. Please see Kaldi docs for more information.
EPOCHS 8 (initial training),
2 (extension)
Number of training epochs.
MIN_EPOCHS 3 (initial training),
1 (extension)
If early stopping is active, you cannot stop before this number of epochs have passed.
LM_STATES 2000 (initial training),
200 (extension)
This is the number of states used to make the language model.
LR_FACTOR 5.0 This multiplicative factor increases the size of the learning rate.
CLEANUP "false" A value of "true" will trigger the removal of intermediate model files to reduce disk usage.
PRIMARY_LR_FACTOR 0.0 This multiplicative factor reduces how much early layers can adapt to new data for neural net extension.
DATA_SUBSET 150 (initial training),
50 (extension)
Kaldi creates validation and training subsets of this size for shrinkage and diagnostics.
SAVE_EGS "true" By default, save neural net training examples for future use. Set to false if disk space is limited.
USE_SAVED_EGS "false" A value of "true" allows the re-use of previously saved training examples for experiments on neural net extension. Must be set to "false" if the neural net configuration has changed.
RESTART_NNET "false" A value of "true" skips many of the steps for neural net extension except for neural net initialization, language model adaptation, and neural net training.
BATCH_SIZE 64 (initial training),
32 (extension)
Batch size for neural net training. Decrease this value if out of memory errors occur during neural net training.
RETRY_BEAM 40 (initial training),
200 (extension)
Fallback beam size for decoding. Only configurable for neural net extension when using GMMs. To decrease neural net extension time with large corpora, set to a lower value.
CHUNK_LEFT_CONTEXT 40 (end to end models),
0 (otherwise)
How many frames the neural net stack is exposed to from before the training sample
CHUNK_RIGHT_CONTEXT 0 How many frames the neural net stack is exposed to from after the training sample
DO_FINAL_COMBINATION True Combine all possible models to [hopefully] improve overall model quality
EARLY_STOPPING True Stop after MIN_EPOCHS if development set error increases
EARLY_STOPPING_THRESHOLD 0.5 Stop after MIN_EPOCHS if development set error increases
FRAMES_PER_CHUNK 150 Number of frames per example
FRAMES_PER_ITER 15000000 Number of frames in a single iteration of training
NEW_LM_WEIGHT 0.02 Weighting used for mixing old LM with new LM
SRILM False Set to True to ignore dev set and generate a new LM using SRILM

Single-machine usage

NOTE Before starting any job, you will likely want to run the sdk/shared/ script to clear the data and exp directories.

1) Setup your data as in Preparing your data.

2) Start the training process using bash sdk/singleuse/ [model-name] [mode] [license-key], where is or depending on what kind of training you are performing.

The resulting model will be in the models directory

3) [optional] Package the nnet model when done with initial training:

./sdk/shared/ [name]

4) Reset the workspace:


Cluster usage

NOTE Before starting any job, you will likely want to run the sdk/shared/ script to clear the data and exp directories.

1) If you haven't already, run bash [LICENSE-KEY] after copying the file to your working directory.

2) Setup your data as in Preparing your data.

3) Check the health of the nodes by executing the following from the master instance:

docker exec slurm-master sinfo

You should see all nodes reporting as idle

4) Check to make sure NFS is working by making a tmp file and checking that nodes can access it:

$ rm -f data/check
$ touch data/check
$ docker exec slurm-master srun -N2 ls /shared/data/check

N2 should be replaced with N and the number of nodes (including master) on the cluster. You should see as many lines with the check file as you have nodes.

5) Run bash sdk/cluster/ [model-name], where is or depending on what kind of training job you are performing.

6) You can check the logs of train-master for status using sdk/cluster/

7) [optional] Package the nnet model when done with initial training:


8) Reset the workspace: