Installing and Using ScribeTrain
Version 3.2.0 Documentation
Custom acoustic and language model training based on Kaldi
Step 1: Obtain the ScribeTrain SDK and a License Key
All of Scribe's services require an authorized account. Contact us if you are interested in obtaining an account.
All shell scripts referred to below are contained within the ScribeTrain SDK. Base neural net models are available on request.
Step 2: Install Docker and nvidia-docker (if using GPU)
This container requires Docker and nvidia-docker (version 1) to use in order to do GPU-based neural net training.
Obtain a machine with at least the following specifications:
8 CPU 10 GB RAM 200 GB DISK 1 CUDA-COMPATIBLE GPU Ubuntu 16.04
sdk/setup/nvidia_docker_install.sh provided by the SDK to install on a GPU-attached VM with CUDA installed. The example file works for the Linux distribution Ubuntu 16.04.
CPU-based extension of models is available without the need of an NVIDIA GPU for small data sets less than 1 hour in total length. Note that CPU training can take a couple of days to train on even a small data set.
To use CPU-based extension, ignore the nvidia-docker install instructions above, and use the
sdk/singleuse/extend_cpu.sh script in the directions below in place of
NUM_JOBS_INITIAL in the script to the number of CPU cores you wish to use. Note that if you only have an hour or so of data, using more than 8 CPUs will reduce model accuracy due to an effective smaller minibatch size.
Step 3: Download the ScribeTrain Docker image and base model
$ docker login -u [repository-user] -p [repository-password] docker.greenkeytech.com Login Succeeded $ docker pull docker.greenkeytech.com/scribetrain
The credentials provided by GreenKey should include your
Along with the Docker image, you will also need to download the base model, either for the traditional training:
$ wget https://storage.googleapis.com/gk-data/nnet_gk_base.tar.gz
or for end-to-end training, which is currently in beta, but much faster:
$ wget https://storage.googleapis.com/gk-data/nnet_full_e2e.tar.gz
Step 4: Prepare your data
ScribeTrain requires text (
stm-style transcripts) and audio files (
sph-formatted audio files) which have been carefully time-aligned with segments under 30 seconds.
Filenames can only contain lowercase letters, uppercase letters, numbers, and underscores. Other symbols (including hyphens) are not permitted
Transcript files (ending with
.stm) should be of the following format:
[file-name] [channel] [speaker] [start-time] [stop-time] <o,f0,[gender]> [transcript]
These files may contain any number of lines, but the format is strictly enforced. Please set the channel to 1 if this is a single channel file. If speakers have not been identified for your files, setting the speaker to a unique identifier per file will aid in the model-training process. Note that
[file-name] should not have an extension and that
[gender] should be either
female. Transcripts should be all lowercase without punctuation.
You can generate STM files several ways:
1) Manual generation
2) Cleanup via SVTServer
If you have audio that you are planning on transcribing, you can use our transcription engine to do a first-pass and output direclty in stm format, then clean up the transcription using our data cleanup tool provided in the scribetrainSDK under
sdk/correction. Simply place your STM files and the original WAV audio files in the
data folder, then open
index.html and follow the instructions.
3) Generate from a directory of WAV and text files
Download our ASR toolkit and run the script
python /scripts/WAVTXTtoSTM.py /folder/containing/audio/and/text
Audio files must be in
.sph format with a sample rate of 16KHz and a single channel of audio. These can be readily prepared using the command line utility sox. If you are utilizing our ASR toolkit, sox is included within the image.
More info on data preparation using sox can be found here
Create a directory called
input-data, and divide your data into training and testing sets as shown below:
input-data ├── dev │ ├── dev_file_1.stm │ └── dev_file_2.sph ├── test │ ├── test_file_1.stm │ └── test_file_2.sph └── train ├── train_file_1.sph ├── train_file_1.stm ├── train_file_2.sph └── train_file_2.stm
An 80 / 20 split of training and testing data is a starting point for small corpora. As your corpora grows, the ratio will lean slightly more towards training data than testing data. A small development dataset should be used to enable training diagnostics and to know when to stop training your model.
Step 5: Train the model on your data
5.0 Export your license key
$ export LICENSE_KEY="your.license.key"
5.1 Extract the base model environment
$ tar -zxvf nnet_gk_base.tar.gz
Note that if you are using the end to end model, you must extract
nnet_full_e2e.tar.gz instead here.
5.2 Preprocess your input data
Execute the following to copy your training data in place for ScribeTrain.
5.3 Set a model name and start training
$ export MODEL_NAME=some_model_name $ export MODE=gmm_extend $ bash sdk/singleuse/extend.sh $MODEL_NAME $MODE $LICENSE_KEY some_docker_ID $ docker logs -f $MODEL_NAME
You can check the status of the model using the docker logs command or by tailing
log.txt in the working directory.
5.4 Check the results once training is complete
Once model training is complete, you can check the word error rate (WER) of the model before and after training by reading
results.txt in the working directory:
$ cat results.txt
Save your model environment
$ bash sdk/shared/save_env.sh $MODEL_NAME
You will now have a file called
$MODEL_NAME.tar.gz containing a snapshot of your model training environment. In future training, you can extract this tar file instead of
nnet_full_e2e.tar.gz in step 5.1
Restart training with the same data and new parameters
After editing the relevant parameters in
sdk/shared/extend.sh, you can source the following script to automatically set the required environmental variables that will maintain the data and alignments from one model run to another:
$ source sdk/shared/use_saved.sh
From here, you can proceed to step 5.3 to train with your new parameters.
Add additional data or change data
After you've modified the
input-data folder, run the following:
$ bash sdk/shared/data_clean.sh $ bash sdk/shared/data_preprocess.sh
From here, you can proceed to step 5.3 to train with your new data.
Reset the entire environment
$ bash sdk/shared/full_clean.sh
From here, proceed to step 5.1 to extract the base model and train again.
Please consult Training for additional model training modes.
Step 6: Launch SVTServer with your custom model
When model training is complete, an
svt.tar.gz file will appear in the local model directory. To launch SVTServer with your custom model, untar your model (
svt.tar.gz). A folder
svt will appear containing your custom model. Mount this folder to an SVTServer container as follows:
$ docker run \ ... -v $(pwd)/svt:/custom \ -e CUSTOM_MODEL:"True"\ ... docker.greenkeytech.com/svtserver
For more details on using SVTServer, please consult the relevant documentation.
Detailed system requirements
ScribeTrain can be launched on any Docker-compatible machine with at least 8 CPUs, 10 GB RAM, 200 GB DISK, and 1 CUDA-COMPATIBLE GPU. We strongly recommend Ubuntu 16.04. Further increasing the number of GPUS and CPUs will decrease training time.
If you are using a computing cluster and not using the provided install script for Ubuntu, you will need to make sure the NFS common libraries are installed. Our scripts will automatically take care of this for you on Ubuntu.
Note that ScribeTrain models are forwards-compatible with SVTServer images. As of version 3.0, SVTServer versions 3.2.X and later are supported. Models made with previous versions of ScribeTrain are typically forwards-compatible with future versions of ScribeTrain.
Setting up ScribeTrain on a cluster of machines
Cluster machines should ideally have identical CPU, memory, and GPU configurations. All cluster machines must have GPUs. If it is not possible to balance the numbers, then the master instance should have the least number of CPUs and memory.
Configure your cluster for the first time
1) Run the install script
sdk/setup/nvidia_docker_install.sh on the cluster master and every node.
2) On the master instance, copy the
sdk/cluster/initialize_master.sh script to the working directory that you would like to conduct model training in. Check the variables in
initialize_master.sh and run
bash initialize_master.sh [LICENSE-KEY] from the working directory with your Green Key
3) On the master instance, copy the
sdk/cluster/launch_master.sh script to the working directory and run
bash launch_master.sh from the working directory. The script will wait until you launch all nodes. The present working directory will be shared with the nodes via NFS.
4) On the node instances, copy the
sdk/cluster/launch_node.sh script to the working directory on the node. Then, run
bash sdk/cluster/launch_node.sh master.ip.address where
master.ip.address is the local vnet address of the master instance.
5) Once the node instances have launched, press any key in the master instance terminal to continue the
launch_master.sh script. You should see a result that shows the status of the slurm cluster, with all nodes reporting
What if I don't see the nodes reporting as idle, or the launch_master script hangs?
In some cases, the procedure above does not result in the proper cluster configuration. If you do not see your nodes reporting, execute the following command on your master instance:
docker restart slurm-master
Then, check the node status again by executing:
docker exec slurm-master sinfo
If you still have issues, please contact us.
We maintain tagged versions of all Scribe services with major, minor, and incremental version numbers
latest tag should always point to the most recent version, which is currently
Presently, the following ScribeTrain versions are available on our docker repo: 3.2.0, 3.1.9, 3.1.8, 3.1.6, 3.1.5, 3.1.4, 3.1.3, 3.1.2, 3.1.14, 3.1.13, 3.1.12, 3.1.11, 3.1.10, 3.1.1, 3.1.0, 3.0.5, 2.1.0, 1.20190307, 1.20190228