GreenKey

Installing and Using ScribeTrain

Version 3.1.11 Documentation

Custom acoustic and language model training based on Kaldi




Quickstart Guide

Step 1: Obtain the ScribeTrain SDK and a License Key

All of Scribe's services require an authorized account. Contact us if you are interested in obtaining an account.

All shell scripts referred to below are contained within the ScribeTrain SDK. Base neural net models are available on request.


Step 2: Install Docker and nvidia-docker (if using GPU)

This container requires Docker and nvidia-docker (version 1) to use in order to do GPU-based neural net training.

Obtain a machine with at least the following specifications:

8 CPU
10 GB RAM
200 GB DISK
1 CUDA-COMPATIBLE GPU
Ubuntu 16.04

Use the sdk/setup/nvidia_docker_install.sh provided by the SDK to install on a GPU-attached VM with CUDA installed. The example file works for the Linux distribution Ubuntu 16.04.

CPU-based extension

CPU-based extension of models is available without the need of an NVIDIA GPU for small data sets less than 1 hour in total length. Note that CPU training can take a couple of days to train on even a small data set.

To use CPU-based extension, ignore the nvidia-docker install instructions above, and use the sdk/singleuse/extend_cpu.sh script in the directions below in place of sdk/singleuse/extend.sh.

Set NUM_JOBS_INITIAL in the script to the number of CPU cores you wish to use. Note that if you only have an hour or so of data, using more than 8 CPUs will reduce model accuracy due to an effective smaller minibatch size.


Step 3: Download the ScribeTrain Docker image and base model

$ docker login -u [repository-user] -p [repository-password] docker.greenkeytech.com
Login Succeeded
$ docker pull docker.greenkeytech.com/scribetrain

The credentials provided by GreenKey should include your repository-user and repository-password.

Along with the Docker image, you will also need to download the base model, either for the traditional training:

$ wget https://storage.googleapis.com/gk-data/nnet_gk_base.tar.gz

or for end-to-end training, which is currently in beta, but much faster:

$ wget https://storage.googleapis.com/gk-data/nnet_full_e2e.tar.gz


Step 4: Prepare your data

ScribeTrain requires text (stm-style transcripts) and audio files (sph-formatted audio files) which have been carefully time-aligned with segments under 30 seconds.

Filenames can only contain lowercase letters, uppercase letters, numbers, and underscores. Other symbols (including hyphens) are not permitted

Transcripts

Transcript files (ending with .stm) should be of the following format:

[file-name] [channel] [speaker] [start-time] [stop-time] <o,f0,[gender]> [transcript]

These files may contain any number of lines, but the format is strictly enforced. Please set the channel to 1 if this is a single channel file. If speakers have not been identified for your files, setting the speaker to a unique identifier per file will aid in the model-training process. Note that [file-name] should not have an extension and that [gender] should be either male or female. Transcripts should be all lowercase without punctuation.

You can generate STM files several ways:

1) Manual generation

2) Cleanup via SVTServer

If you have audio that you are planning on transcribing, you can use our transcription engine to do a first-pass and output direclty in stm format, then clean up the transcription using our data cleanup tool provided in the scribetrainSDK under sdk/correction. Simply place your STM files and the original WAV audio files in the data folder, then open index.html and follow the instructions.

3) Generate from a directory of WAV and text files

Download our ASR toolkit and run the script python /scripts/WAVTXTtoSTM.py /folder/containing/audio/and/text

Audio files

Audio files must be in .sph format with a sample rate of 16KHz and a single channel of audio. These can be readily prepared using the command line utility sox. If you are utilizing our ASR toolkit, sox is included within the image.

More info on data preparation using sox can be found here

Directory format

Create a directory called input-data, and divide your data into training and testing sets as shown below:

input-data
├── dev
│   ├── dev_file_1.stm
│   └── dev_file_2.sph
├── test
│   ├── test_file_1.stm
│   └── test_file_2.sph
└── train
    ├── train_file_1.sph
    ├── train_file_1.stm
    ├── train_file_2.sph
    └── train_file_2.stm

An 80 / 20 split of training and testing data is a starting point for small corpora. As your corpora grows, the ratio will lean slightly more towards training data than testing data. A small development dataset should be used to enable training diagnostics and to know when to stop training your model.


Step 5: Train the model on your data

5.0 Export your license key

$ export LICENSE_KEY="your.license.key"

5.1 Extract the base model environment

$ tar -zxvf nnet_gk_base.tar.gz

Note that if you are using the end to end model, you must extract nnet_full_e2e.tar.gz instead here.

5.2 Preprocess your input data

Execute the following to copy your training data in place for ScribeTrain.

bash sdk/shared/data_preprocess.sh

5.3 Set a model name and start training

$ export MODEL_NAME=some_model_name
$ export MODE=gmm_extend
$ bash sdk/singleuse/extend.sh $MODEL_NAME $MODE $LICENSE_KEY
some_docker_ID
$ docker logs -f $MODEL_NAME

You can check the status of the model using the docker logs command or by tailing log.txt in the working directory.

5.4 Check the results once training is complete

Once model training is complete, you can check the word error rate (WER) of the model before and after training by reading results.txt in the working directory:

$ cat results.txt

[Optional steps]

Save your model environment

$ bash sdk/shared/save_env.sh $MODEL_NAME

You will now have a file called $MODEL_NAME.tar.gz containing a snapshot of your model training environment. In future training, you can extract this tar file instead of nnet_full_e2e.tar.gz in step 5.1

Restart training with the same data and new parameters

After editing the relevant parameters in sdk/shared/extend.sh, you can source the following script to automatically set the required environmental variables that will maintain the data and alignments from one model run to another:

$ source sdk/shared/use_saved.sh

From here, you can proceed to step 5.3 to train with your new parameters.

Add additional data or change data

After you've modified the input-data folder, run the following:

$ bash sdk/shared/data_clean.sh
$ bash sdk/shared/data_preprocess.sh

From here, you can proceed to step 5.3 to train with your new data.

Reset the entire environment

$ bash sdk/shared/full_clean.sh

From here, proceed to step 5.1 to extract the base model and train again.

Please consult Training for additional model training modes.


Step 6: Launch SVTServer with your custom model

When model training is complete, an svt.tar.gz file will appear in the local model directory. To launch SVTServer with your custom model, untar your model (svt.tar.gz). A folder svt will appear containing your custom model. Mount this folder to an SVTServer container as follows:

$ docker run \
  ...
  -v $(pwd)/svt:/custom \
  -e CUSTOM_MODEL:"True"\
  ...
  docker.greenkeytech.com/svtserver

For more details on using SVTServer, please consult the relevant documentation.




Detailed system requirements

ScribeTrain can be launched on any Docker-compatible machine with at least 8 CPUs, 10 GB RAM, 200 GB DISK, and 1 CUDA-COMPATIBLE GPU. We strongly recommend Ubuntu 16.04. Further increasing the number of GPUS and CPUs will decrease training time.

If you are using a computing cluster and not using the provided install script for Ubuntu, you will need to make sure the NFS common libraries are installed. Our scripts will automatically take care of this for you on Ubuntu.

Note that ScribeTrain models are forwards-compatible with SVTServer images. As of version 3.0, SVTServer versions 3.2.X and later are supported. Models made with previous versions of ScribeTrain are typically forwards-compatible with future versions of ScribeTrain.




Setting up ScribeTrain on a cluster of machines

Cluster machines should ideally have identical CPU, memory, and GPU configurations. All cluster machines must have GPUs. If it is not possible to balance the numbers, then the master instance should have the least number of CPUs and memory.

Configure your cluster for the first time

1) Run the install script sdk/setup/nvidia_docker_install.sh on the cluster master and every node.

2) On the master instance, copy the sdk/cluster/initialize_master.sh script to the working directory that you would like to conduct model training in. Check the variables in initialize_master.sh and run bash initialize_master.sh [LICENSE-KEY] from the working directory with your Green Key [LICENSE-KEY].

3) On the master instance, copy the sdk/cluster/launch_master.sh script to the working directory and run bash launch_master.sh from the working directory. The script will wait until you launch all nodes. The present working directory will be shared with the nodes via NFS.

4) On the node instances, copy the sdk/cluster/launch_node.sh script to the working directory on the node. Then, run bash sdk/cluster/launch_node.sh master.ip.address where master.ip.address is the local vnet address of the master instance.

5) Once the node instances have launched, press any key in the master instance terminal to continue the launch_master.sh script. You should see a result that shows the status of the slurm cluster, with all nodes reporting idle.

What if I don't see the nodes reporting as idle, or the launch_master script hangs?

In some cases, the procedure above does not result in the proper cluster configuration. If you do not see your nodes reporting, execute the following command on your master instance:

docker restart slurm-master

Then, check the node status again by executing:

docker exec slurm-master sinfo

If you still have issues, please contact us.



ScribeTrain Versioning

We maintain tagged versions of all Scribe services with major, minor, and incremental version numbers x.y.z. The latest tag should always point to the most recent version, which is currently 3.1.11.

Presently, the following ScribeTrain versions are available on our docker repo: 3.1.9, 3.1.8, 3.1.6, 3.1.5, 3.1.4, 3.1.3, 3.1.2, 3.1.11, 3.1.10, 3.1.1, 3.1.0, 3.0.5, 2.1.0