GreenKey

Model Customization

Scribe enables complete customization of its language model and acoustic model using custom data.




Customizing the Language Model

The language model that Scribe uses will have an impact on the words and phrases recognized from an audio sample. If your audio contains words or phrases that Scribe's language model does not know (such as proper names), these will not transcribe accurately.

GreenKey Scribe's ScribeTrain container allows you to customize the transcription language model from any word set. You can use this feature to add completely new words or phrases or provide representative text to increase the accuracy of target phrases. Language model customization requires launching both the transcription container and the ScribeTrain container.

We recommend a machine with at least 8 GB of RAM and 4 CPUs for building custom language models. The method below describes how to train a custom language model given a text file or set of text files with representative text.


(1) Compiling a Corpus

The corpus for building a custom language model should contain one or more text files with representative text (the training set). Desired words, phrases and sentences from a specific domain should be added to the training text files.

Text should be all lowercase, separated lines, only containing the following characters:

abcdefghijklmnopqrstuvwxyz' -

The text file(s) can have any name. The ScribeTrain container will assume all text files in the directory are part of the corpus. An example valid text file would have the following format:

this is line one of my custom corpus
this is line two of my custom corpus that i 'll be training my language model with
sometimes i may have custom words that are hyphen-ated or completelymadeup

Custom words in this case are "hyphen-ated" and "completelymadeup" and they can be added individually to the corpus training files.


Creating a corpus from spreadsheets and / or formatted text

Check out the GreenKey ASRToolkit for more information on how to generate a corpus from spreadsheets or formatted text.


Advanced: adding an in-domain development set

You can also train a language model using a large corpus of text and an in-domain "development" set. The development set will not be included in the language model and will instead be used to optimize weighting of the training text to minimize perplexity. See this description of language model training.

To add a development set, name it "dev.txt" and place it in the same corpus directory. Note that adding a list of individual words or phrases (on separate lines) to the dev.txt file runs the risk of having the language model training process remove these words from the training text files. This removal is done by the container to prevent dev lines from also appearing the the training set. It is recommended that individual custom words and phrases are added to the training files only.

Below is an example dev.txt file:

welcome to this apple incorporated first quarter fiscal year two
m cook and he'll be followed by c f o luca maestri
future business outlook actual results
i'm very happy to share with you the outstanding results

This process is not recommended if your corpus is small or contains many repeated phrases.


(2) Training a Custom Language Model

You can train a custom language model from the base model in SVTServer or from a custom acoustic model created via ScribeTrain.


Training from an SVTServer base model

First, pull the latest version of SVTServer and ScribeTrain:

$ docker pull docker.greenkeytech.com/svtserver && \
  docker pull docker.greenkeytech.com/scribetrain

Then, launch SVTServer and expose the current transcription model:

$ docker run -d \
  -v /scribe/gkt/models/svt \
  --name gk_svtserver \
  docker.greenkeytech.com/svtserver

GKT_API, GKT_USERNAME, and GKT_SECRETKEY can be replaced with LICENSE_KEY as shown in the "Deploying" section of the documentation.

Note: As of version 3.3.0, multiple models can be exposed and customized from the container. Two common choices are:

Fast model: -v /scribe/gkt/models/svt

More accurate model: -v /scribe/gkt/models/svt-lstm

Note which you expose here as you will need to mount with the correct label later.

Finally, launch the ScribeTrain container:

$ docker run -d --rm \
  --name gk_customlm \
  -e MODE=CUSTOMLM \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="username" \
  -e GKT_SECRETKEY="secretkey" \
  -e NGRAM_NUMBER=5 \
  -e PRUNE=True \
  -e NEW_LM_WEIGHT="0.02" \
  -v $(pwd)/[corpus]:/corpus \
  -v $(pwd)/[model]:/model \
  --volumes-from gk_svtserver \
  docker.greenkeytech.com/scribetrain


Training from a custom acoustic model

Launch the ScribeTrain container specifying the path to your custom acoustic model:

$ docker run -d --rm \
  --name gk_customlm \
  -e MODE=CUSTOMLM \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="username" \
  -e GKT_SECRETKEY="secretkey" \
  -e NGRAM_NUMBER=5 \
  -e PRUNE=True \
  -e NEW_LM_WEIGHT="0.02" \
  -v $(pwd)/[corpus]:/corpus \
  -v $(pwd)/[model]:/model \
  -v [path-to-custom-acoustic-model]:/scribe/gkt/models/svt \
  docker.greenkeytech.com/scribetrain

Note that extracting the svt.tar.gz file created by ScribeTrain will result in a folder svt/ containing the custom model files. The path to this folder should be the [path-to-custom-acoustic-model] shown above.

Configuration variables for all ScribeTrain runs with mode CUSTOMLM

[corpus] is the directory in your local path containing the input text.

[model] is the directory in your local path that will contain the outputted model files once the container finishes.

The NGRAM_NUMBER is a required parameter denoting the maximum length in words of a phrase in the new language model.

The PRUNE parameter denotes whether the new language model should be pruned before interlopating with the old. This parameter is optional and defaults to False.

The NEW_LM_WEIGHT specifies the percentage weight that the new language model should be interpolated at with the old. Note that if a weight of 0.1 is chosen, this would translate to the old language model and new language model comprising 90% and 10%, respectively, of the interpolated model. This parameter is optional, and the default behavior will simply join the two models together without any weighting.

If you have provided a development set, Scribe will automatically merge your development set and corpus with a small amount of sample text. You can override either of these samples by setting the environmental variables -e OVERRIDE_DEV="True" and -e OVERRIDE_TRAIN="True".

You can track the progress of the language model training using docker logs:

$ docker logs gk_customlm

The container will exit once the model training is complete


(3) Use Your Custom Language Model

Add the following global parameters to the launch of your container

docker run \
  ...
  -v $(pwd)/[model]:/custom \
  -e CUSTOM_MODEL="True" \
  ...

[model] is the same model directory from above

If you are using a custom acoustic model, add the MODEL_LABEL and FORCE_MODEL variables as well:

  ...
  -e MODEL_LABEL=svt-lstm-e2e \
  -e FORCE_MODEL=greenkey_svt_lstm_e2e \
  ...

If you used ScribeTrain from before 3.1.0 or are using a mode other than e2e_extend, use MODEL_LABEL='svt-lstm' and FORCE_MODEL=greenkey_svt_lstm.

Note: we recommend an additional 2 GB of RAM available when using a custom model.


(optional) Querying a Language Model

You can query the base or your custom language model to find the probability of a certain phrase.

(1) Launch the base svtserver container to grab the model from:

$ docker run -d \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="[scribe-username]" \
  -e GKT_SECRETKEY="[scribe-secretkey]" \
  -e KEEP_ALIVE="True" \
  -v /scribe/gkt/models/svt \
  --name gk_svtserver \
  docker.greenkeytech.com/svtserver

GKT_API, GKT_USERNAME, and GKT_SECRETKEY can be replaced with LICENSE_KEY as shown in the "Deploying" section of the documentation.

(2) Launch the ScribeTrain container in CUSTOMLM mode as shown below if you would like to test a custom model:

$ docker run -it --rm \
  --entrypoint /bin/bash \
  -w /customlm \
  --name gk_customlm \
  -e MODE=CUSTOMLM \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="username" \
  -e GKT_SECRETKEY="secretkey" \
  -e NGRAM_NUMBER=5 \
  -e PRUNE=True \
  -e NEW_LM_WEIGHT="0.1" \
  -v $(pwd)/[corpus]:/corpus \
  --volumes-from gk_svtserver \
  docker.greenkeytech.com/scribetrain

or, launch with this reduced command to query the base model:

$ docker run -it --rm \
  --entrypoint /bin/bash \
  -w /customlm \
  --name gk_customlm \
  -e MODE=CUSTOMLM \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="username" \
  -e GKT_SECRETKEY="secretkey" \
  --volumes-from gk_svtserver \
  docker.greenkeytech.com/scribetrain

(3) The container should launch and expose a bash terminal. Initialize the model by running the following command:

$ bash init.sh
Scribe CustomLM Version 3.2.0
Waiting for license..

If you are interested in using the custom model, conduct the language model merging process by running the following command:

$ bash merge.sh

The custom model training process will commence.

(4) Query the model using the following command:

$ python query.py [model-type] "some custom phrase"

where [model-type] is either base or custom. For example:

$ python query.py base "this is a test"
Log10 probability: -5.84754225612


Combining acoustic and language models

Models produced by ScribeTrain contain both an acoustic and language model. The CUSTOMLM mode also allows for cross-combination of the acoustic model from one model and the language model from another.

(1) Run the command below to open up an interactive session:

docker run -it --rm \
  --entrypoint /bin/bash \
  -w /customlm \
  --name gk_customlm \
  -e MODE=CUSTOMLM \
  -e LICENSE_KEY=$LICENSE_KEY \
  -v $(pwd)/[source-ac-model]:/acmodel \
  -v $(pwd)/[source-lm-model]:/scribe/gkt/models/svt \
  -v $(pwd)/[output-model]:/model \
  docker.greenkeytech.com/scribetrain
 ```

 where `[source-ac-model]` and `[source-lm-model]` are the models from which the acoustic and language model, respectively, will be sourced. Note that both of these directories should contain the full set of files for the compiled model. The `[output-model]` directory will contain the new language model files that can be combined with the `[source-ac-model]`.

 (2) In the shell that opens up, initialize the session:

```bash
$ bash init.sh
Scribe CustomLM Version 3.2.0
Waiting for license..

(3) Finally, cross-combine the acoustic and language models

$ bash mkgraph.sh

Once the script finishes, you can exit the shell session within the container.