What we know about the user today
Voice interfaces open up a new field of user interfaces. They do not have the best reputation. Often I hear the following two reasons:
The interactive voice response (IVR) systems we are used when calling a hotline provide an annoying experience. Everyone remember the good old times: “Press 1 when you are calling about your last bill, press 2 when there is a connection error, press 3 if your have questions about our products, press 4 to activate a sim card, … and so on.”. That is not how we have conversation.
The second aspect is that voice recognition has often struggled to understand the users. Words are not correctly understood and the systems just didn’t interpret the intent of the user correctly.
Artificial Intelligence along with improved microphone and noise canceling have dramatically improved the solution to tackle these problems. Spoken dialogs with machines are now possible and actually quiet plesent. Mainstream devices like the Amazon Echo bring these possibilities to everybody.
Amazon was the first one to open their Amazon Echo ecosystem to developers. Just like the Apple did with the Appstore Amazon creates a plattform with their skillstore. Different sources like Business Insider call the Echo next billion dollar device and Jeff Bezos says it could become the 4th pillar of Amazon.
The ecosystem around the Echo is growing rapidly and there are more than 1000 skills available now.
A few months ago when we were building a skill ourselves we realized it is hard to predict what users are actually going to say. It was us and our families who were testing the skill and we got great results. Some might think that this is the right moment to stop testing and launch the skill, but we were sure that this just a fraction of what end users will actually be saying.
How to find out what users are actually saying to Alexa
We were trying to figure out how to get this information. Caution, the next paragraph might get a little technical: We were thinking about adding a lot of intents with different utterances or intents that just had an empty value (AMAZON.Literal) to get everything the users said. But these things don’t work out very well. The recognition quality for the Literal is not high enough.
Unfortunately Amazon does not give you access to the recordings of how the user interacted with your skill. So we knew that we had to find a way around it. We decided to crowdsource this information.
Alexa knows that different people say different things.
We tried to figure out how to improve that process and we built a tool that lets you generate utterances by real people. You give us your example sentence like “Please quote my portfolio”. And we ask the crowd for alternatives they can think of. You can then take this utterances back into the skill as triggers for the correct intent.
The more utterances you provide for your Alexa intents the better Alexa’s voice recognition becomes. And you would not believe how different users interact with your skill. Can you think of 25 possibilities to greet someone or 99 possibilities to say “I love you”?
What is on our roadmap?
We are considering to expand this service to a full blown voice interface testing tool. We want to help you to expose your skill to real people and collect their full responses. Not the trimmed version you get from Amazon.
What do you think? Would that be helpful for you?