How can the matching process of perceived slot values to a predefined list of slots based on phonetic consistency be implemented?

Our solution (Executive Summary):

The slot values as well as the predefined list will be transferred into a phonetic representation by the use of an algorithm. In the English language, we use “Double Metaphone”, for use in the German language, the “Kölner Phonetik” is recommended. In this phonetic representation, the most similar sounding word will be identified and subsequently chosen by use of the Jaro-Winkler distance.

Introduction:

With their Start-Up Future of Voice, Tilmann Böhme, Stefan Oswald and Malte Kosub have been developing skills for Alexa since the end of 2015. In 2016, Tilmann Böhme was awarded with the title Alexa Champion for his activities. He will now introduce one of the problems that were occurring during the skill development and its respective solution.

Problem:

One of our skills, named “Teach Funlexa”, is currently enjoying great popularity in the American and English skill store. By using this skill, a user can teach Alexa his own predetermined answers and therefore provide it with, for instance, secrets about friends which can then be read out loud by Alexa. One simply needs to ask Alexa the following: “Alexa, tell everyone around a secret about {{friendname}}“. However, the technical problem in this case is, that the voice recognition of Alexa replays slot values to the skill, which are not located in the defined target value.

Let me make an example for this problem: For the Slot “friendname”, we chose the type AMAZON.US_FIRST_NAME. As random names have been registered by the user through a web surface, these cannot be part of the speech model in the shape of a custom slot. That means, that the name Stefan is not included in the American list of names and will therefore not be recognized by the automatic voice recognition. For this purpose, we developed a feature, that matches slot entry values with the amount of first names provided to the user and subsequently picks out the most similar one.

Using the distance of strings as the solution?

To display Stephen on Stefan a similarity measure between strings is required. In order to achieve this, we utilize the Jaro-Winkler distance. The Jaro-Winkler distance is a metric, which can be used to measure the editing distance between two strings of characters.

The Jaro-Winkler distance value is a value between 1 and 0 and shall depict the size of matching of the string of characters (1 big, 0 small).

The entry Stefan will be mistaken for Steph, if the entry Steph is filed aside from Stefan. The reason for this is, that Stephen and Steph have a higher similarity than Stephen and Stefan. It generally comes down to this: If the Jaro-Winkler distance is being measured between the input string with the filled first name, flaws will have a serious effect on the voice recognition regarding the distance as similar sounding words can be written in several variations.

Our solution: Distance between phonetic approximation

In order to keep the effects of voice recognition small, we transfer the comparing words before a distance measuring into a phonetic style. The compatible algorithm is speech dependent. For the application in English, we use the Double Metaphone algorithm. For the application in German, the cologne phonetics algorithm is the right choice. These algorithms transfer written words into an approximated and simplified phonetic style. The influence of flaws in the voice recognition will be significantly lowered in this phonetic space. We thus calculate the Jaro-Winkler distance in the phonetic space for the depiction of slot values on the target values, which will get us to the most similar sounding word from the target list. Based on our previous example, this will have the following effect: Stefan will, with the Double Metaphone transformation [“STFN”, “STFN”], turn out exactly the same as Stephen. Steph, on the contrary, will turn out to [“STF”, “STF”]. The strings for Stefan and Stephen are even identical in this phonetic space, which is why the value returned by the ASR, Stephen, will be correctly displayed on Stefan.

Through this trick we were able to increase the recognition rate of the name designation and fix several small nuisances that users had

If you got a question, feel free to contact me on Twitter @tilmannb

Leave a Reply