Human language is a wonderful thing. Late last year, a team in Microsoft's Artificial Intelligence and Research division reported a speech recognition system that made the same or fewer errors than humans transcribing the same conversation. This is an impressive and thought provoking achievement.
The company announced this on its blog, with the headline Historic Achievement: Microsoft researchers reach parity in conversational human speech recognition, which seemed, at first glance, to be more of a disparaging comment on Microsoft's hiring policy.
One can't help wondering how long this breakthrough will take to filter down to consumer use and application. There are a number of well documented shortcomings with the current crop of vocally controlled "virtual assistants".
"Sorry, I didn't get that…"
From Apple's Newton through to auto-correct fails, the difficulties machines have interpreting human communication has provided fodder for humour and derision. Maybe for good reason. It's possible that our ability to communicate is the last bastion we have in a world where we may be starting to feel more and more surplus to requirements.
As the race towards artificial intelligence gathers pace it is striking how the extent to which learning to parse speech correctly and then respond appropriately turns out to be one of the hardest of human intellectual achievements to emulate.
It is perhaps significant that the first time a robot passed one of the Turing tests – a phrase now used to describe a range of interrogations conducted to determine whether a machine counts as "Intelligent" – it involved giving the robot the personality of a 13-year-old to provide a more plausible explanation for mistakes and misspelling.
"Sorry, I missed that…"
So, why is Automatic Speech Recognition (ASR) so challenging?
ASR tries to identify a sound and associate it with a defined unit – often a word. Essentially, the problem with human speech is the huge amount of variation, vocal and environmental, that occurs.
The sound is assigned to represent the word that has the highest probability value. This approach is workable for small vocabularies, but as they increase , it becomes less feasible to cater for all the possibilities.
There is also the task of finding enough data to adequately learn all the possibilities. And once you add in things such as background noise and echoes things get really tricky.
"I'm not sure what you said there…"
Phonetic analysis – breaking words down into smaller units (there are roughly 50 phonemes needed to produce any English word for instance) and then basing understanding on combination and context – has its own challenges. Many of them related to differentiation and overlap. The example of "Let's recognise speech" and "Let's wreck a nice beach" illustrates the challenge nicely.
This combination and context approach may also explain why an eight-year-old has no problem when I go all Yoda on him while the most advanced virtual assistants seem stymied. That hard for humans, seems to be, it does not.
"I didn't quite get that… "
All of which makes me think a couple of things. Firstly, people's linguistic accomplishments are actually quite astonishing. The fact that we all do it only enhances the achievement and calling it instinct – to paraphrase Douglas Adams – simply labels the phenomenon but doesn't explain it.
Secondly, the fact that we have a level of effective speech recognition, to accomplish even basic tasks, integrated into a hand-held device is pretty extraordinary and will surely only improve. Maybe our expectations to date have simply been too high.