How do I apply Precision and Recall to Google Home?

Target labels are needed for wake words: The foundation for all virtual assistants is a large body of labeled audio data that teaches it when their name (Siri, Google, Alexa, Mercedes) is called so they wake up and respond to your question

What are wake words?

Have you ever called your voice assistant – whether it’s Google Home or Alexa – only to be ignored? That happens to me sometimes when I’m at home with my kids talking loudly at the dining table and my “Hey Google!” gets ignored. It also happens when I’m driving, and despite having my Android device plugged in to the car microphone, I’m ignored. It also happen the other way where Alexa wakes up uncalled and a solid blue light appears on the device while I’m mentioning salmonella or some word that rhymes with Alexa.

A wake word, also known as a trigger word or a hotword, is a special word or sequence of words that “wakes” up or activates a voice assistant to start listening to the rest of the sentence. It’s essentially like being at a boring meeting where you ignore whatever people are saying until your name is mentioned. And that’s when your ears perk up to listen to the comment about you. Examples of wake words for voice assistants are ‘Alexa’, ‘Hey Google’ and ‘Hey Mercedes’.

While you may always respond when your name is called, voice assistants may not always respond when they are called. Why is that? That is because humans are intelligent enough that they can recognize their names being called in different accents, genders, ages, noise levels and different contexts.

Unfortunately, voice assistants need to be trained for all those different scenarios. For example, a voice assistant may be very good at waking up when someone calls it in a quiet environment. But in a situation where I’m driving, it gets confused by the noisy street sounds (especially when I have my window open while driving) and ignores my multiple requests to wake up. So how do you train a voice assistant? By providing hundreds of thousands of audio clips of words in all these different scenarios and paying human annotators to label or annotate these audio clips with a text description that states ‘yes’ for wake words and ‘no’ for non-wake words.

Labels are also known as target labels and they’re just text descriptions of what a deep learning model will use to understand if a wake word was mentioned or not. After being trained on hundreds of thousands (if not millions) of positive and negative examples of wake words, the model will generalize it to new instances. This is essentially a binary classification problem where your deep learning model is trying to identify if a wake word was mentioned or not.

What are examples of wake words?

As you go through some of the popular voice assistants, you’ll notice an interesting thing. There are not a lot of wake words for voice assistants. For example, here’s a list of voice assistants and their wake words.

Voice Assistant Wake Words
Alexa Alexa, Echo, Amazon, Computer
Google Home Ok Google, Hey Google
Mercedes Hey Mercedes
Siri Hey Siri

In case you’re wondering, “Computer”, as a wake word, was added to Alexa as a nod to the TV series, Star Trek. If you’ve ever seen the show, you’ll see people address their space ship’s computer as “Computer”. There’s other references to wake words in pop culture such as in Iron Man where the chief protagonist uses the trigger words “Hey Jarvis! Locate Spiderman and Thor.” Or if you watched the decades old 2001: Space Odyssey, you’ll hear the trigger word in the famous request “Open the pod bay doors, Hal.”

Of course, they got the placement of the trigger word in the wrong location. In that movie, the wake word was at the end of the sentence rather than at the beginning. Why should it be at the beginning of a sentence and not the end? Because the onboard memory of a voice assistant only has a 3 second memory buffer which is just enough to understand a short phrase but not an entire sentence. Oh well. That was just a movie.

Why aren’t there more wake words?

You’ll notice an interesting thing about all these voice assistants. The maximum number of wake words a voice assistant has is four: Alexa has four wake words. Why is that? That’s because it takes a lot of work to create more wake words. Let’s pretend you wanted to add the wake word Jarvis to Google Home. What would that take?

Well, you’ll need four groups of people to make the quality acceptable enough to make this happen such that it doesn’t wake up prematurely or completely ignores your call:

  1. annotators to label hundreds of thousands of examples of wake words and millions of non-wake words
  2. data scientists to build the deep learning model
  3. software engineers to deploy the model into a production environment
  4. TPMs (technical program managers) like yours truly to coordinate the entire process

Yes, I know that on YouTube, you’ll find lots of people showing how to create new trigger words in less than an hour. Notice also that their precision and recall is only around 89%, and that too in a well-controlled, quiet environment. Once you deal with hundreds of millions of people in different environments, accents, contexts, ages and genders, then that performance degrades considerably. For a hobby scenario, 89% is acceptable. For a consumer product, it has to be over 99%.

[simple-author-box]