ScribeTrain Data Preparation

Formatting data for ScribeTrain




Expected file formats

ScribeTrain requires text (stm-style transcripts) and audio files (sph-formatted audio files) which have been carefully time-aligned with segments under 30 seconds.

Filenames can only contain lowercase letters, uppercase letters, numbers, and underscores. Other symbols (including hyphens) are not permitted

Transcripts

Transcript files (ending with .stm) should be of the following format:

[file-name] [channel] [speaker] [start-time] [stop-time] <o,f0,[gender]> [transcript]

These files may contain any number of lines, but the format is strictly enforced. Please set the channel to 1 if this is a single channel file. If speakers have not been identified for your files, setting the speaker to a unique identifier per file will aid in the model-training process. Note that [file-name] should not have an extension and that [gender] should be either male or female. Transcripts should be all lowercase without punctuation.

You can generate STM files several ways:

1) Manual generation

2) Cleanup via SVTServer

If you have audio that you are planning on transcribing, you can use our transcription engine to do a first-pass and output directly in stm format, then clean up the transcription using our data cleanup tool provided in the scribetrainSDK under sdk/correction. Simply place your STM files and the original WAV audio files in the data folder, then open index.html and follow the instructions.

3) Generate from a directory of WAV and text files

Download our ASR toolkit and run the script python /scripts/WAVTXTtoSTM.py /folder/containing/audio/and/text

Audio files

Audio files must be in .sph format with a sample rate of 16KHz and a single channel of audio. These can be readily prepared using the command line utility sox. If you are utilizing our asrtoolkit, sox is included within the image.

To isolate multiple channels into single channel files, use the following commands:

sox [input-file] [output-left] remix 1
sox [input-file] [output-right] remix 2

Note that both of these files will require their own STM file of only that channel's transcription.

To set the sample rate of a file and convert to sph, use the following command:

sox [input-file] [output.sph] rate 16k




Full model and neural net training

For neural net training, the data need to be split into training, testing, and development sets.

Please create folder input-data on the file system you will use for training with subfolders train, test, and dev.

For example, you can run the following command:

for set in train test dev; do mkdir -p input-data/$set; done

You will have to fill each folder with the appropriate .stm and .sph files.

Advanced configuration settings

Configuration files for the neural net may be placed in input-data, which will be read by the container from nnet.xconfig.

Sample .xconfig files are provided in sdk/samples/. By default, a configuration matching sdk/samples/tdnn_lstm.xconfig is used.

Example directory structure

input-data
├── dev
│   ├── dev_file_1.sph
│   └── dev_file_1.stm
├── nnet.xconfig
├── test
│   ├── test_file_1.stm
│   └── test_file_1.sph
└── train
    ├── train_file_1.sph
    ├── train_file_1.stm
    ├── train_file_2.sph
    └── train_file_2.stm




Neural net extension

When extending a previously-trained neural net, input-data/train, input-data/test, and input-data/dev all must be present.

You will have to fill these folders with the appropriate .stm and .sph files. The test folder will be used by our container to evaluate your acoustic model after training completes, and a subset of the dev folder will be used for early stopping, if configured.

Advanced configuration settings

By default, the neural net configuration is unchanged and output layers are updated at 20 times the rate of previous layers.

Configuration files for the neural net extension may be placed in input-data, which will be read by the container from nnet.xconfig.

Sample .xconfig files are provided in sdk/samples/ (see sdk/samples/tdnn_lstm_add_layer.xconfig for how to add a layer).

Example directory structure

input-data
├── nnet.xconfig
├── dev
│   ├── dev_file_1.sph
│   └── dev_file_1.stm
├── test
│   ├── test_file_1.stm
│   └── test_file_1.sph
└── train
    ├── train_file_1.sph
    ├── train_file_1.stm
    ├── train_file_2.sph
    └── train_file_2.stm




Data preprocessing

After your training datasets have been prepared in input-data, execute bash sdk/shared/data_preprocess.sh to copy your training data in place for ScribeTrain.