“Alexa: My Dog’s Pause Need Washing” — Why Homophones Matter
Building for voice is great and as the technology evolves, the experience is getting better by the day! Understanding the potential for voice interfaces have to revolutionise certain sectors is a really exciting concept, which many of us early adopters and developers are eagerly waiting to take off.
There is however an elephant in the room: while the voice tech is much better at understanding than most of us could ever have imagined 5 years ago. It is still not great at understanding homophones, especially when you put two together, they struggle to understand the context. In order to have a freeform conversation, where you the user can say as much or as little as you like, and an AI backend can analyse what you have said, add in some sentiment analysis, and return back something that feels like a perfectly natural conversation, there needs to be a better understanding of homophones.
Whilst this is a mild irritation, it does not hinder the majority of skills that are available or being produced as we speak. But, it is fun to play with. To understand homophones, first it is important to understand the difference between different word sounds.
Homophones: are words that sound the same, have different meanings, and are spelled differently. This is the type of word that cause the most trouble when capturing freeform text in any way. These include: which/witch, cell/sell, boar/bore, toad/toed, Neil/kneel.
Homographs: are words that are spelled the same, sound different, and have different meanings. These are an issue for the response, as they can be captured accurately, but the AI may miss the context. These include: bow(of a ship)/bow(tie), refuse(rubbish)/refuse(decline), row(the boat)/row(argue).
Homonyms: are words that are spelled the same, sound the same, but have different meanings. As with Homographs, these can really mess up the context of a sentence if not understood correctly. These include: bear(animal)/bear(put up with), bark(of a tree)/bark(of a dog), rose(get up)/rose(flower), spring(season)/spring(bouncy metal coil).
All of the above are particularly tricky when designing for voice interfaces, especially with freeform text. When tested with a simple ‘display what I have just said’ Skill we can see from the logs exactly what Alexa, in this case, thought she heard.
“Neil Armstrong kneels on the moon” — an interesting example, as shown below, it appears to have taken the context of the moon and stayed with it throughout the phrase.
“My dog’s paws need washing.” Unlike the previous example, as shown below, this has not taken into account the context of the dog and decided on the other meaning,.
“I like to rap while wrapping presents.” Here, as shown below, the context has remained with the art of rapping rather than taking the context of the presents and altering it.
“The Marshall knew martial arts.” As shown below, this phrase goes wrong in several places, it understood the difference between marshall and martial, however got new instead of knew. My original intention for the meaning of marshall was as in airline police, however Alexa, understood this as the name Marshall.
I don’t want to imply that the system is fundamentally flawed, it is not, it’s the nuances of the english language that cause these issues, and I have deliberately chosen examples that highlight them. There are many occasions where Alexa gets it right.
“I would like a pair of pears” Something that you can imagine saying if doing your shopping via a voice device.
“I tied in vain to find a vein”. Freeform text that I thought would end up in the above category, and something that may one day be useful for doctors or nurses dictating to a device. However it seems to have understood the context here better than in the examples above.
“I told my son to go out in the sun.” Context here has again been understood and therefore my son will going out in the sunshine.
Homophones can present a challenge when attempting to understand freeform text, and one of the key reasons account linking is the way forward if you want to accurately capture someone’s name, or email address. Because of the way the English language works, and the sounds of the letters, spelling something out is impossible for one of these devices. Trust me we’ve tried: homophones strike again! However this technology is adapting every day, new features are being released, and the Machine Learning behind the platform is getting a better awareness of context. The key for us as developers is to keep this in mind when designing our conversations and when considering the actionable word values.
So, if Hal refuses to open the pod bay doors, it is probably because he assumes we are talking about peas.
It is only a matter of time before I can tell Alexa my dogs paws need washing, or that my interests include rapping while wrapping presents.
Veni Loqui are voice design specialists formed to create bespoke solutions for Alexa and Google Home. We are working to leverage Voice in the Health and Social Care sectors. For more information contact info@veni-loqui.com