GreenKey

Deploying Scribe Discovery Engine

An overview and best practices for using Scribe Discovery to find and interpret patterns in voice transcripts




The preferred way to deploy Discovery is with a Docker container. If you are able to use Docker, follow the instructions below for getting started using Docker.

If you are unable to use Docker in your development, you can request compiled binaries for your preferred operating system, and follow the instructions for getting started with compiled binaries.

Getting Started - Docker Image

Scribe's Discovery Engine can be launched on any Docker-compatible machine with at least 1 CPU and 1 GB RAM with valid credentials. Contact us if you are interested in obtaining an account. Discovery can also be launched inside of the Scribe Dictation Server with the same configuration options available here.


(1) Install Docker

Follow the instructions here to install Docker on your machine.


(2) Download the Scribe Discovery Engine Docker image

$ docker login -u [repository-user] -p [repository-password] docker.greenkeytech.com
Login Succeeded
$ docker pull docker.greenkeytech.com/discovery

The credentials provided by GreenKey should include your repository-user and repository-password.


Transferring the image to a machine without internet access

If you would like to install the image on a machine without internet later, you can compress the image.

$ docker save -o greenkeyscribe.tar docker.greenkeytech.com/discovery:latest

Then, on the machine without internet, load the compressed image (no repository login required).

$ docker load -i greenkeyscribe.tar


(3) Launch a Scribe Discovery Engine

Scribe can be launched with or without internet access. GreenKey will provide you with either a scribe-username and scribe-secretkey or a scribe-licensekey.

A scribe-licensekey can be used for deployments with or without internet access. A scribe-username and scribe-secretkey will require internet access to be used.

With a username and secret key (internet required):

$ docker run --rm -d \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="[scribe-username]" \
  -e GKT_SECRETKEY="[scribe-secretkey]" \
  -p [target-port]:1234 \
  -v "$PWD/[upload-directory]":/uploads \
  docker.greenkeytech.com/discovery

With a license key (no internet required):

$ docker run --rm -d \
  -e LICENSE_KEY="[scribe-licensekey]" \
  -p [target-port]:1234 \
  -v "$PWD/[upload-directory]":/uploads \
  -v "$PWD/[license-directory]":/scribe/gktlicense \
  docker.greenkeytech.com/discovery

target-port is an open port on the machine that you will use to access the service.

upload-directory is a directory where you want uploaded and augmented transcript files to be stored.

license-directory is a directory where your license files and the usagelog file is stored. Usage is automatically posted to our servers if your container is launched without a license key.


NOTE some versions of Docker require the following changes to the launch commands above:

1) Remove the --rm flag. This will result in containers remaining on disk after they stop. You can manually remove them by running docker ps -a to get the container ID numbers and docker rm [container-ID] to remove the containers.

2) Add the flag --net host to allow access to the service port.


Renewing License Keys

Your scribe-licensekey will be provided along with its expiration date. To receive a new key, please report your usage as described by the SVTServer documentation.


(4) Confirm that the container is running

$ curl -X GET localhost:1234/status
{
  "message": "No interpreter jobs currently in progress.",
  "status": 0
}

Replace 1234 with whatever target-port you previously specified.


(5) If not already done, transcribe an audio file using SVTServer

Transcribe an audio file using SVTServer with "wordConfusions":"True" to get the full word lattice. Using this configuration option is necessary to prepare the SVTServer output json for Scribe Discovery Engine.

$ curl -X POST localhost:5000/upload \
  -H "Content-type: multipart/form-data" \
  -F 'data={"wordConfusions":"True"};type=application/json' \
  -F "file=@[audio-file]"

Please consult the SVTServer docs for more details on running that Scribe product.

Alternatively, a basic transcript file can be artificially constructed and sent to Discovery. Here is a template file which could be used in development:

{
  "transcript": "this is a word perfect transcript for interpretation which lacks any capitalization or punctuation"
}


(6) Run the Scribe Discovery Engine on a file

Check if the container is working by transcribing a test file.

If you do not have the utility jq, install it with the following:

sudo apt-get install jq
# read json string from file into shell variable
TEN_CODE_JSON_STRING=`cat sample_ten_code.json`
# add key "intent" with value "ten_code" to json string
TEN_CODE_JSON_STRING=$(jq '. + { "intents": ["ten_code"] }' <<<"$TEN_CODE_JSON_STRING")

curl -X POST http://localhost:1234/discover \
     -H "Content-Type: application/json" \
     -d "$TEN_CODE_JSON_STRING"
{
  "intents": [
    {
      "entities": [
        {
          "label": "ten_code",
          "matches": [
            [
              {
                "end_time": 9.93,
                "probability": 0.13,
                "start_time": 9.39,
                "value": "10-4"
              }
            ]
          ]
        }
      ],
      "label": "ten_code",
      "probability": 1
    }
  ]
}

Note that by setting "retainLattice":"True", the full lattice is returned with the extra keys from running Scribe Discovery Engine above.


(7) Shutdown the Service

curl -X GET localhost:1234/shutdown




Getting Started - Compiled Binary Files

Requirements

Installation - Windows

First, download the Discovery zip file. Right-click on the zip file, and extract the contents of the file into the SDK directory.

Navigate into the extracted folder, and install the requirements.

cd greenkey-discovery-sdk\discovery_binaries_windows_10_64bit__python_32bit
pip install -r requirements.txt

Discovery also requires pycrypto, the binaries of which needs to be installed from a url with easy_install:

easy_install http://www.voidspace.org.uk/python/pycrypto-2.6.1/pycrypto-2.6.1.win32-py2.7.exe

If the above commands were successful, you are now ready to begin development.

Installation - Mac/Linux

First, download the Discovery tar file. Move this file into the Discovery SDK directory and extract the contents of the file. Then, navigate into the newly created bundle folder.

# your file's name will depend on your os
cd greenkey-discovery-sdk
tar -xvzf discovery_binaries_[os_name]__python_64_bit.tar.gz
cd discovery_binaries_[os_name]__python_64_bit

Next, install the requirements with pip.

pip install -r requirements.txt

If the above command was successful, you are now ready to begin development.

Starting Discovery

See the Discovery SDK for the latest instructions on how to open the binaries using the SDK tools.

When Discovery starts, the terminal that Discovery was started from will be used for logging. To send requests, you must open a new terminal.




Supported interpreters

Scribe's Discovery Engine provides three example interpreters: license_plates, ten_codes, and addresses. These interpreters rely on ontological tags from the natural language toolkit (nltk) and tokenization algorithms for finding numbers, letters, and other features in transcript lattices from Scribe Voice Transcription Server (SVTServer).

These interpreters tag and tokenize the word lattice from a transcription job with wordConfusions=True. Scribe Discovery Engine finds and returns patterns within this lattice, applying fuzzy logic behind the scenes to find these patterns despite any transcription errors.

start stop ontological tags tokens words
1.3 1.8 CD NUM one
1.8 2.4 CD | TO NUM | None two | to
2.5 3.1 CD NUM three
3.1 3.8 CD NUM four
3.9 4.9 CD NUM five
7.7 8.3 CD NUM six
8.4 9.1 CD NUM seven
9.2 9.7 CD NUM eight
9.8 10.3 CD NUM nine
10.3 10.9 NN NUM ten

Configuration Options

There are additional configurations that can be passed into Discovery via environment variables. To modify these options, simply add the desired parameter to your run command as an environment variable. The following example shows how to change the maximum number of entities that are returned from Discovery.

$ docker run --rm -d \
  -e GKT_API="https://scribeapi.greenkeytech.com/" \
  -e GKT_USERNAME="[scribe-username]" \
  -e GKT_SECRETKEY="[scribe-secretkey]" \
  -e MAX_NUMBER_OF_ENTITIES=5
  -p [target-port]:1234 \
  -v "$PWD/[upload-directory]":/uploads \
  docker.greenkeytech.com/discovery

By default, Discovery will return three entities, but because we are setting MAX_NUMBER_OF_ENTITIES=5, Discovery will return up to five entities (provided that there are 5 to be found).

environment variable description default
NUMBER_OF_INTENTS Sets how many intents will be looked for if Discovery is automatically detecting the intent. 2
MAX_NUMBER_OF_ENTITIES Sets how many entities will be returned by Discovery. If GROUP_ENTITIES is set to True, this will be how many entities are returned per entity competing for a given slot. Otherwise, it will be the total number of entities that are returned for a given entity type. 3
GROUP_ENTITIES Group entities by time, so that entities competing for the same slot are grouped into sub arrays of an entity array. If set to False, all entities of a given type will be returned in one flat array regardless of their timestamp. True
SORT_ENTITIES_BY_LENGTH By default, entities are sorted by probability first, and length second (e.g. if two candidates are found for a given entity with equal probability of being correct, the longer of the two will be ranked higher). Set SORT_ENTITIES_BY_LENGTH to True to switch these two in priority, so that longer entities are always prioritized. False
HIDE_EMPTY_ENTITIES Discovery will not return entities with intents if the entity has no results. Set this to False to see the entities that Discovery searched for for a particular intent in the return json, even if there were no matches found for the entity. True
ENTITY_PROBABILITY_THRESHOLD Sets the probability at which entity candidates are rejected from the results. Set this higher to reject less likely candidates from appearing with the return output from Discovery. The value can be a float between 0 and 1. 0.01
STRUCTURE_CONFIDENCE_THRESHOLD Sets the minimum allowable structure confidence for entities to be considered valid. This option is only in effect if the entity being analyzed has "structure_enforcement": "True" in the intents.json. 0.01
WORD_PROBABILITY_THRESHOLD Sets the relative probability at which words are ignored from the results. Set this higher to reject less likely possible words. This strongly affects Discovery speed. The value can be a float between 0 and 1. 5e-3
MAX_WORDS_PER_TIMESLOT Sets how many possible words are considered for each perceived word in the transcript. This weakly affects Discovery speed. The value can be any integer greater than 1. 50
DISCOVERY_INTENTS Creates a 'white list' of intents that are turned on in the Discovery instance. Use commas to separate multiple values. None (allows all)
DISCOVERY_DOMAINS Creates a 'white list' of domains that are turned on in the Discovery instance. Use commas to separate multiple values. None (allows all)
USE_CUSTOM_JSON_SCHEMA Uses custom json schema from mounted /custom folder to add new keys to the json output False
SUPPRESS_DEFAULT_OUTPUT Suppresses default output from Discovery in favor of custom schema. Default output suppression will only take place if USE_CUSTOM_JSON_SCHEMA is True. True
PORT Changes the port that Discovery is exposed on inside of Docker. Useful, if the default port of 1234 is taken. Note: This option is only valid if Discover is launched standalone, and not inside of SQC or SVT. 1234
FILE_LOG_LEVEL Changes the verbosity of Discovery logging. Options for increasing logs are payload, debug, verbose. (listed in order of increasing verbosity) info
SHUTDOWN_SECRET Used to add security around the /shutdown route for Discovery. User must submit "secret_key": "[shutdown secret]" to the /shutdown route to enable shutdown. greenkeytech
SCHEMA_ENTITY_REPLACEMENT_POLICY Chooses how entities are populated in the schema. Available options are First, Last, and Best. First chooses the best option from the first occurrence of the entity. Last chooses the best option from the last occurrence of the entity. Best chooses the most recent of the most probable options for the entity. Best

Continue reading to learn about Discovery Entities.

To learn more about creating your own intents, read more about them here.