GreenKey

Creating Entities




The Discovery Engine comes with some built-in entity recognition, such as addresses and US license plates. You also have the ability to define your own entities, which can give you the power to extract any information you need from a given command.

Let's examine how this works.

Composing Entities out of Tokens

For entities to be recognized, they must be broken down into individual "tokens." A token is a kind of tag that Discovery attaches to words or phrases in a sentence. For its most basic implementation, a token might be a number, or even a single letter. Because of the wide application of letters and numbers as entities, Discovery comes with these tokens implemented by default.

For a list of all tokens that come with Discovery, click here.

What this means in practice, is that a command like, "Dial five," would, by default, already have the word "five" tokenized as a number. Similarly, the command, "Choose option c," would have the letter "c" tagged as a letter.

Once we have the necessary tokens in place, we can begin to construct our entities out of these tokens. We call a specific ordering of tokens a "pattern." Every entity is made up of at least one pattern, but you can have multiple patterns that make up a single entity. The simplest pattern is made of only one token.

A Simple Example

To try this out, consider the aforementioned example of a command to dial a single digit. We need to let Discovery know that a single digit represents a given entity. We represent this entity as a python dictionary. The code for a single digit entity would look like this:

{
  "patterns": [[["NUM"]]]
}

Let's break this down.

There is only one key, patterns. This must contain an array of arrays of arrays. That is a lot of arrays, but there is a reason for this. Each entity can be composed of many patterns (i.e. the outer array). An individual pattern can be composed of multiple token slots (i.e. the middle array). Each token slot could be one of multiple tokens.

In this example, there is only one token in a token slot, there is only one token slot in the pattern, and only one pattern in the entity, so we have an array of one element that contains an array of one element in an array of another element.

To find a sigle digit, we use the built-in "NUM" token type. We put this string "NUM" into an array to let Discovery know that we are looking for a entity that is made of one token, and that token must be a number. Later examples will demonstrate more complex implementations if the patterns array.

Save this dictionary in a file digit.py, like the following:

digit.py

ENTITY_DEFINITION = {
  "patterns": [[["NUM"]]]
}

Notice that we named the file digit.py and we named the dictionary ENTITY_DEFINITION. All entity definition files will contain a dictionary with the name ENTITY_DEFINITION.

Let's say we fed Discovery a transcription file of someone saying, "dial one eight".

The return value from Discovery will be a JSON object in the following format.

# results
{
  "intents": [
    {
      "entities": [
        {
          "label": "digit",
          "matches": [
            [
              {
                "endTimeSec": 0.0,
                "lattice_path": [[1, 0]],
                "probability": 1.0,
                "startTimeSec": 0.0,
                "value": "1"
              }
            ],
            [
              {
                "endTimeSec": 0.0,
                "lattice_path": [[2, 0]],
                "probability": 1.0,
                "startTimeSec": 0.0,
                "value": "8"
              }
            ]
          ]
        }
      ],
      "label": "room_dialing",
      "probability": 1.0
    }
  ],
  "interpreted_quote": {}
}

We can see that the return value for the intents key is an array of objects. This is because Discovery will match the top probable intents, and return them back to you in an array. For the discussion at hand we will focus on this inner object:

{
  "label": "digit",
  "matches": [
    [
      {
        "endTimeSec": 0.0,
        "lattice_path": [[1, 0]],
        "probability": 1.0,
        "startTimeSec": 0.0,
        "value": "1"
      }
    ],
    [
      {
        "endTimeSec": 0.0,
        "lattice_path": [[2, 0]],
        "probability": 1.0,
        "startTimeSec": 0.0,
        "value": "8"
      }
    ]
  ]
}

We will call this object an entity object. This entity object is given a "label" that corresponds to the name of the file that the entity definition was stored in.

We also have a "matches" portion of our return. This return value is an array, because there can be multiple entities detected in a single transcript.

For each entity it finds, there can be multiple possibilities. For example, in the above transcript we can see that the speaker pronounced the word "one" in a way that could potentially have been interpreted to be the word "nine." Both values are returned as objects with their corresponding probability score for how likely it is that they were the correct interpretation.

In the above example, our engine is letting us know that the number "one" is a much more likely candidate. You can imagine a scenario where you know that the person is not ever going to say "one," however, and having the alternate interpretation could be useful for someone who misspeaks slightly.

Adding Complexity

Now let's imagine we need to find a entity that is a combination of exactly one number and one letter. For example, we receive the command, "Call room five a." We can extract such a pattern with the following entity definition.

room_number.py

ENTITY_DEFINITION = {
  "patterns": [[["NUM"], ["LETTER"]]]
}

What if we realize that the room number can have two numeric digits? All we need to do is add a second pattern to our entity definition.

room_number.py

ENTITY_DEFINITION = {
  "patterns": [
    [["NUM"], ["LETTER"]],
    [["NUM"], ["NUM"], ["LETTER"]]
  ]
}

Now, Discovery will locate any entities in our spoken text that match the pattern of number and letter, or two numbers and letter.

If we feed the transcript, "five a is calling thirteen c" to the instance of Discovery Engine that is defined above, we might see the following be returned.

# results
{
  "intents":
    [
      {
        "entities":
          [
            {
              "label":
                "room_number",
              "matches":
                [
                  [
                    {
                      "endTimeSec": 1.0,
                      "lattice_path": [[0, 0], [1, 0]],
                      "probability": 1.0,
                      "startTimeSec": 2.3,
                      "value": "5 A"
                    }
                  ], [
                    {
                      "endTimeSec": 5.9,
                      "lattice_path": [[4, 0], [5, 0]],
                      "probability": 1.0,
                      "startTimeSec": 4.1,
                      "value": "13 C"
                    },
                    {
                      "endTimeSec": 5.9,
                      "lattice_path": [[4, 0], [5, 0]],
                      "probability": 0.32,
                      "startTimeSec": 4.1,
                      "value": "30 C"
                    },
                    {
                      "endTimeSec": 5.9,
                      "lattice_path": [[4, 0], [5, 0]],
                      "probability": 0.12,
                      "startTimeSec": 4.1,
                      "value": "13 Z"
                    },
                  ]
                ]
            }
          ],
        "label":
          "calling_room",
        "probability":
          1.0
      }
    ],
  "interpreted_quote": {}
}

We will be looking at the "matches" array for the "room_number" entity object. The speaker pronounced "five a" very clearly, so Discovery only returned one object with the transcription string for this room number. The speaker was not as clear in pronouncing "thirteen c," and so we are given other options for what they may have been intending to say.

Enabling Multiple Possible Tokens Per Slot

Sometimes it will be necessary to compose an entity definition where a given token could be either of two options. Look at the following example.

either_example.py

ENTITY_DEFINITION = {
  "patterns": [
    [["LETTER"], ["NUM", "LETTER"]],
  ]
}

Previously, each token was contained in its own array. In this example, we can see that the second token array in this pattern has two tokens in it: ["NUM", "LETTER"]. This signals to Discovery to look for patterns that start with a letter and are followed by either a number or a letter.

Creating Custom Tokens

Eventually you will want to extract entities that are made of things more complex than simple letters and numbers. Consider an interpreter that is looking for a time duration in a string. A transcript might involve someone saying, "I'll be there in one hour." The pattern we are looking for is going to start with one number, and then follow with one word that denotes the unit of time measurement.

We will need to create a custom token to represent a unit of time measure. We accomplish this by adding an "extraTokens" property to our entity object.

time.py

ENTITY_DEFINITION = {
  "extraTokens": ({
    "label": "TIME_UNIT",
    "values": ("day", "days", "hour", "hours", "month", "months", "week", "weeks", "year", "years")
  },),
  "patterns": [[["NUM"], ["TIME_UNIT"]]]
}

Notice that the "extraTokens" property is a tuple of objects.

Every custom token will be an object which will contain the value possibilities for that token. The above example will be looking for any single digit that is immediately followed by one of the "TIME_UNIT" words that we defined. If we wanted to allow up to 3 digits in our search, our entity object would change to the following.

time.py

ENTITY_DEFINITION = {
  "extraTokens": ({
    "label": "TIME_UNIT",
    "values": ("day", "days", "hour", "hours", "month", "months", "week", "weeks", "year", "years")
  },),
  "patterns": [
    [["NUM"], ["TIME_UNIT"]],
    [["NUM"], ["NUM"], ["TIME_UNIT"]],
    [["NUM"], ["NUM"], ["NUM"], ["TIME_UNIT"]]
  ]
}

Custom Formatting

Let's look at a more robust example of making custom entities. We will be parsing an address. An address at its simplest is a number, combined with a street name. There can also be directions and types of streets as well. Taking these factors into consideration, we will need three custom entity definitions to find an address, in addition to the built in "NUM" type.


DIRECTIONS = {
  "label": "DIR",
  "values": ("north", "south", "east", "west"),
}

STREET_NAMES = {
  "label": "STREETNAME",
  "values": (
    "abbott", "aberdeen", "academy", "access", "ada", "adams", "addison",
    "administration", "agatite", "airport", "albany", "albion", "aldine",
    "alexander", "algonquin", "allen", "allport", "alta", "altgeld",
    "anchor", "ancona", "anderson", "anson", "anthon", "anthony",
    "arbor", "arcade", "archer", "argyle", "armitage", "armon", "armour",
    "armstrong", "artesian", "arthington", "arthur", "asher", "ashland",
    "astor", "augusta", "austin", "avalon", "avers", "avondale", "baker",
    "wrightwood", "yale", "yates", "young"
  )
}

STREET_TYPES = {
  "label": "STREET",
  "values": (
    "alley", "avenue", "street", "boulevard", "way", "ave", "place",
    "highway", "lane", "drive", "route", "square", "road", "expressway"
  )
}

We then need to iterate over all the combinations of these tokens that can make an address.

ADDRESS_PATTERNS = []
for i in range(1, 6):
  for j in range(2):
    for k in range(1, 3):
      for l in range(2):
        pattern = [["NUM"] for _ in range(i)] + \
                  [["DIR"] for _ in range(j)] + \
                  [["STREETNAME", "NUM_ORD"] for _ in range(k)] + \
                  [["STREET"] for _ in range(l)]
        ADDRESS_PATTERNS.append(pattern)

When we get the matching addresses back, it would be nice if it was formatted in a human readable way. To do that, we add "cleaning functions." Here are some useful functions for formatting an address.


def capitalize(wordList, spacer):
  """ Generally, capitalize all words for discover """
  return spacer.join(word.capitalize() for word in wordList) if isinstance(wordList, list) else wordList.capitalize()


def format_ordinal(wordList, spacer):
  """ Add suffix to ordinals to make output more human-readable """
  from cleanText import text2int

  def add_suffix(n):
    return str(n) + 'tsnrhtdd' [n % 5 * (n % 100 ^ 15 > 4 > n % 10)::4]

  return add_suffix(int(text2int(wordList, spacer=spacer)))


def format_number(wordList, spacer):
  """ Overwrite number formatting to eliminate interstitial spacing"""
  from cleanText import text2int
  return text2int(wordList, spacer)

You will notice two of the above functions use a function called text2int. This is one of the included cleaning functions that you can use for formatting your output from Discovery.

For a complete list of the included cleaning functions, click here.

By default, some included tokens are automatically formatted with some of the built-in cleaning functions. For example, consecutive numbers will automatically be formatted to numerals. You can override this behavior by customizing the "cleaning functions."

To implement these functions, we pass them into our entity definition with the parameter "extraCleaning".

address.py


ENTITY_DEFINITION = {
  "patterns": ADDRESS_PATTERNS,
  "extraTokens": (STREET_NAMES, STREET_TYPES, DIRECTIONS),
  "extraCleaning": {
    "STREETNAME": capitalize,
    "STREET": capitalize,
    "DIR": capitalize,
    "NUM_ORD": format_ordinal,
  },
}

Custom Spacing of Consecutive Tokens

Another thing you may want to alter with Discovery is how entities are formatted in regard to their spacing. For an address, we probably would like to have any consecutive numbers displayed with no spaces between them ("1600" instead of "1 6 0 0"), but we would like to have spaces between all of the other tokens. We would achieve this by adding in a "spacing" parameter to our ENTITY_DEFINITION dictionary that contained the following object.

address.py


ENTITY_DEFINITION = {
  "patterns": ADDRESS_PATTERNS,
  "extraTokens": (STREET_NAMES, STREET_TYPES, DIRECTIONS),
  "extraCleaning": {
    "STREETNAME": capitalize,
    "STREET": capitalize,
    "DIR": capitalize,
    "NUM_ORD": format_ordinal,
  },
  "spacing": {
    "NUM": "",
    "default": " ",
    "NUM_ORD": "",
  }
}

This will remove any spaces between consecutive numbers, but add spaces in between any other consecutive tokens in the entity pattern.


Composite Entities

For more advanced use cases, Discovery allows users to create "composite entities." As the name implies, a "composite entity" is an entity that is made up of other entities that have been combined to become one new entity. The entities that are used to make up a "composite entity" are called "component entities."

Composite Entity Use Cases

You may wonder why we would want to combine entities together to form new entities. After all, an entity is already composed of smaller pieces called tokens. Why not just create a large entity out of many tokens?

Code Reuse

The first advantage to this method is that it allows the user to reuse code. Some tribal dialects rely on specific key phrases that can be used in many different ways. Take for example dialects that use fractions of numbers frequently. A person in finance might discuss the cost of a product with the words, "Ninety nine and a half". Then they might discuss an interest rate with the words, "Two and a quarter". Creating the token combinations to locate a valid fraction in a transcription would need to be duplicated for a product entity and for a rate entity. Composite entities allow the user to create one entity for a fraction, that can be combined with a product_base_number and a rate_base_number.

Increased Error Tolerance

Another advantage to composite entities is the increased tolerance for bad recordings and unintended user utterances in the audio. If a user says, "55, um, West Monroe Street", this will cause a problem for the address interpreter that was discussed above, in the "Entities" portion of this documentation. The previously created address entity will not be detected, because it was searching the audio for a group of numbers followed by a direction or street. The "um" was not in the tokens that the audio was supposed to look for, which will disqualify the above audio from being categorized as a valid address.

Composite entities come with a spacing_threshold feature. The spacing_threshold quantifies how many extra syllables are allowed between component entities. This allows the user to specify how many extra utterances ("uh"s, "um"s, "hmms", etc.) can come between two pieces of an entity for Discovery to still recognize they are part of a composite entity. The extra utterances will be discarded, and only the valid composite entity will be reported back to the user.


Creating Composite Entities

The first thing you will need to add to your intent configuration file is a "composite_entities" property. This property will contain an array of objects.

intents.json

{
  "label": "address",
  "composite_entities": [
    {
      "label": "complete_address",
      "component_entity_patterns": [
        ["address_number", " street_name"],
        ["address_number", "direction", " street_name"],
        ["address_number", "direction", " street_name", "street_type"],
      ]
    }
  ],
  "examples": [
    "I live at {complete_address}"
  ]
}

In the example above, look at the new "composite_entities" property. It contains a single object. The two required properties for this composite entity object are "label" and "component_entity_patterns".

The "label" property will contain the label for our composite entity that will be used elsewhere in the intent definition file. In the above example you can see that "complete_address" is used in the examples property.

The "component_entity_patterns" is an array of arrays. The arrays contain the component entities that are combined to make a composite entity. The order that the component entities are presented in matters. If Discovery finds entities in a transcript that correspond to the entities in a valid component entity pattern, but the sequence of the discovered entities does not match a listed pattern, then Discovery will not return back any results.

The component entities used to create the component_entity_patterns must all have valid definition files. For the above example to work, the user would need an address_number.py, a street_name.py, a direction.py, and a street_type.py to be in the entities directory of the mounted volume for the container.

Adding A Spacing Threshold

Now, let's add in the aforementioned "spacing_threshold" to add add tolerance for unclear speaking in the audio file.

{
  "label": "address",
  "composite_entities": [
    {
      "label": "complete_address",
      "composite_entities": [
        {
          "label": "complete_address",
          "spacing_threshold": 1,
          "component_entity_patterns": [
            ["address_number", "street_name"],
            ["address_number", "street_name", "street_type"],
            ["address_number", "street_name", "street_type"],
            ["address_number", "street_name", "street_name", "street_name"],
            ["address_number", "direction", "street_name"],
            ["address_number", "direction", "street_name", "street_type"],
            ["address_number", "direction", "street_name", "street_type"],
            ["address_number", "direction", "street_name", "street_name", "street_name"]
          ]
        }
      ],
      "spacing_threshold": 2
    }
  ],
  "examples": [
    "I live at {complete_address}"
  ]
}

In the above example, you can see that the complete_address composite entity now has a "spacing threshold" of 2. This means that a two syllable phrase can be uttered in the audio file between any of the detected entities, and Discovery will still tolerate it as a valid complete_address.

If an audio file contained a recording of someone saying, "The destination is 55 ... um ... West ... madis ... Monroe Street", Discovery would return the following json.


{
  "intents": [
    {
      "entities": [
        {
          "label": "complete_address",
          "matches": [
            [
              {
                "endTimeSec": "8.05",
                "lattice_path": [[4, 0], [5, 0], [7, 0], [9, 0], [10, 0]],
                "probability": "1.00",
                "startTimeSec": "2.16",
                "value": "55 West Monroe Street"
              },
            ]
          ]
        }
      ],
      "label": "address",
      "probability": 1
    }
  ],
  "interpreted_quote": {}
}

The "spacing_threshold" allowed Discovery to correctly parse the address and to remove the garbage words from the transcript as a match for the returned intent.

Adding Custom Spacing To Your Composite Entity

For formatting purposes, You can also control the character used to separate the component entities that make up the composite entity. This is done with the "spacing" property.

{
  "label": "us_phone_number",
  "composite_entities": [
    {
      "label": "formatted_phone_number",
      "component_entity_patterns": [
        ["three_digit_number", "three_digit_number", "four_digit_number"],
      ],
      "spacing_threshold": 1,
      "spacing": "-"
    }
  ],
  "examples": [
    "You can reach me at {formatted_phone_number}"
  ]
}

The example above is used to find telephone numbers in the United States. It would require that the user have a three_digit_number.py entity definition file and a four_digit_number.py entity definition file in the entities directory of the custom mounted folder in the Discovery container. Because the composite entity definition has a "spacing" property of "-", Discovery would return any found phone numbers in the format 555-456-7890.

Entity Cleaning

You also have the ability to specify a function that can be applied to an entire entity. This is accomplished with the "entityCleaning" parameter. The following example demonstrates a function to clean fractions into numbers.

fraction.py


ONE_THROUGH_SEVEN = {
  'label': 'ONE_THROUGH_SEVEN',
  'values': (
    'a',
    'an',
    'one',
    'two',
    'three',
    'four',
    'five',
    'six',
    'seven',
  ),
}

DENOMINATOR = {
  'label': 'DENOMINATOR',
  'values': (
    'quarter',
    'fourth',
    'half',
    'eighth',
    'eight',
  ),
}

FRACTION_PATTERNS = [
  [['ONE_THROUGH_SEVEN'], ['DENOMINATOR']],
]


def clean_fractions(transcript):
  replacements = [
    ("a half", ".5"),
    ("a quarter", ".25"),
    ("an eighth", ".125")
    # etc
  ]
  for key, value in replacements.items():
    transcript = transcript.replace(key, value)

  return transcript


ENTITY_DEFINITION = {
  'patterns': [[['ONE_THROUGH_SEVEN'], ['DENOMINATOR']]],
  'extraTokens': (ONE_THROUGH_SEVEN, DENOMINATOR),
  'entityCleaning': clean_fractions,
}

Entity Validation

Discovery allows the ability to validate entities before they are returned. The validation function gets passed into an entity definition via the "entityValidation" parameter. For example if we want the number to be in a certain range, we can supply a function that returns a Boolean to validate it.

def valid_range(number):
  return number > 5 and number <= 40


ENTITY_DEFINITION = {
  'patterns': [[['NUM']], [['NUM'], ['NUM']]],
  'entityValidation': valid_range,
}

Multiword Phrases

Sometimes you will be looking for Tokens that are multiple words, or are abbreviations. The way to make Discovery locate these words is to use "collapsiblePatterns", which will take words that will be transcribed as multiple words and combine them into one. For example, the transcription engine will hear "BBC" as "b b c". To get Discovery to accurately identify the station, we use "collapsiblePatterns" as in the following example.

STATION_NAME = {
  'label': 'station_name',
  'values': ('CNN', 'NBC', 'BBC')
}


STATION_ABBREVIATIONS = (
  ('c n n', 'CNN'),
  ('n b c', 'NBC'),
  ('b b c', 'BBC'),
  # etc
)

ENTITY_DEFINITION = {
  'patterns': [
     [['STATION_NAME']],
  ],
  'extraTokens': (STATION_NAME,),
  'collapsiblePatterns': STATION_ABBREVIATIONS,
}

Using Ontological Tags from NLTK

Custom entities can use ontological tags referring to the part of speech of a word. These part of speech tags are the same as those used by the Penn Treebank. For instance, a noun might be "NN" if it is a simple, singular noun like "dog" or "cat." This can be used to find certain patterns in spoken text and extract them, as shown below for simple sentences.

Entity for a basic sentence

noun_list = ["PRP", "WP", "NN", "NNS", "NNPS", "NNPS"]
verb_list = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
modifier_list = ["JJ", "JJR", "JJS", "PRP$", "PDT", "DT"]
adverb_list = ["RB", "RBR", "RBS"]
preposition_list = ["IN"]

SENTENCE_PATTERNS = []
# he ate pizza
SENTENCE_PATTERNS.append([noun_list, verb_list, noun_list])
# he ate good pizza
SENTENCE_PATTERNS.append([noun_list, verb_list, modifier_list, noun_list])
# he ate his good pizza
SENTENCE_PATTERNS.append([noun_list, verb_list, modifier_list, modifier_list, noun_list])
# he ate my very good pizza
SENTENCE_PATTERNS.append([noun_list, verb_list, modifier_list, adverb_list, modifier_list, noun_list])

ENTITY_DEFINITION = {
  'patterns': SENTENCE_PATTERNS,
  'spacing': {
    'default': ' ',
  },
  'ontological_tags': True
}

Note that if ontological_tags is set to True, no other tokenization is applied.

Using this on a simple transcript file of {"transcript": "he at my pizza"}, we receive the following json object out of Discovery

json { "intents": [ { "entities": [ { "label": "basic_sentence", "matches": [ [ { "end_time": "0.00", "probability": "1.00", "start_time": "0.00", "value": "he ate my pizza" } ] ] } ], "label": "basic_sentence", "probability": 1.0 } ], "transcript": "he ate my pizza" }

Last thoughts

After you create a custom entity, you will want to save it in the entities folder in the volume that you mount to Discovery. All of your entities must be present in the entities folder when you first run your container for them to be loaded into Discovery.

You can read more about how to extend Discovery here.