Speaking of security

Voiceprint authentication a la "Star Trek" could be poised to become reality

Now, what was that password? This was the system that needed eight digits, right? Did it require symbols, or just letters and numbers? 

Today's password-driven electronic security systems have a face only a system security administrator could love. Better to use some kind of biometric--something physically a part of us, marking us as ourselves. Some systems already use fingerprints.

What about a voiceprint? Wouldn't it be great if our computers could recognize us, the way the Enterprise recognized Captain Kirk saying, "Zero-zero-zero destruct zero?" 

Simply recognizing human speech isn't all that hard for today's computers. Take Apple's voice-activated Siri assistant, for example. Matching voiceprints is possible, too, and is already being done by commercially available software from companies such as Nuance and Auraya. 

The problem is making the voiceprints themselves secure and difficult to steal, says Bhiksha Raj, associate professor and non-tenured faculty chair at the School of Computer Science's Language Technologies Institute. But Raj and his colleagues may have found the way forward. 

"Where [the work] really began was when we realized that every time we use our voice to authenticate [ourselves], we put ourselves at risk," Raj explains. "Your voice is supposed to be a viable biometric, but once you've given it away it becomes just another bit of data out there." 

Just as a stolen password can leave your email or financial data wide open, a cracker could potentially steal your voiceprint from a database and take over part of your life, at least as effectively as if he'd stolen your social security number.  

The question that stumped Raj and his colleagues was: Could they somehow get "Siri" to respond to voice commands without sending the actual voiceprint over the network and into the cloud, where it would be vulnerable to theft? 

Old-school encryption 

The obvious answer to the problem was to encrypt the voice recording. A system would store an encrypted version of your voice, identifying you without actually having access to your voice. Each system would have its own encrypted version, impossible to connect with each other or with your original voiceprint. 

To accomplish this, Raj and his colleagues had to solve two problems. The first was authentication itself. 

"Speech is a noisy signal," says Shanatu Rane, a principal research scientist at Mitsubishi Electric Research Laboratories, who collaborated with Raj on his early work. "If you say something now, and then say it again five minutes later, the two signals are not going to be identical." There's simply no way to make a person's voice input completely stable in the same way as a typed password. 

To solve this issue, the voice authentication process employed by Raj's team used a Gaussian mixture models, or GMMs. GMMs are a way to statistically match up a given pattern to a standard sample. In this case, Raj says, the researchers used the parameters of the GMMs they calculated from individual voice recordings to represent the actual recordings. Using GMMs, their system achieved excellent results both in terms of recall (the ability to recognize a matching voice sample) and precision (the ability to avoid accepting a non-matching sample).  

The second problem was more of an obstacle. Encrypting and decrypting voice samples is straightforward--if you have plenty of time to burn. 

"When you encrypt data, the size of the data increases," says Rane. "With the thousand-bit keys typically considered suitable for encrypted-domain processing, the overhead for storage, communication and computation increases a thousand-fold." 

Unlike text passwords, voice samples are complex, and encrypting them generates huge amounts of data. Just moving it back and forth from machine to machine becomes time-prohibitive. In one experiment, Raj, then-CMU graduate student Manas Pathak, and colleagues from Portugal's Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento tried encrypting, exchanging and decoding a 4.4-second voice recording, using a 1024-bit security key. It took more than 14 hours. 

That wasn't going to work--so how could they get around the limitations of encrypted data? 

Taking a trick from text passwords 

The answer, ironically, was to make voiceprints behave more like text-only passwords. 

Most laypeople don't realize that when they type in a password, their computer does not transmit the password to the system. Instead, it uses the password to generate a "hash"--essentially, a mathematically devised "password for your password" that only one particular system possesses and uses. It's almost impossible for someone intercepting the hashed password to decode it. Hashes are as secure as encrypted data, but far smaller, and easy to transmit quickly. 

Raj and his colleagues developed "secure binary embeddings," or SBEs, to convert key features of complex voice signals into simplified hashes. The hashes of a given recording can then be compared with the original voiceprint's hashes in a way that maintains recall and precision without any way for the hashes to be used to reconstruct the original voiceprints. 

In a paper written with Jose Portelo of INESC-ID/IST, Raj and colleagues showed that an SBE system could deliver recall and precision in voice authentication of over 93 percent--slightly under that of encryption-based authentication. Since then, they have improved this performance further. 

"You can actually achieve more or less the same performance that you get with a conventional [encrypted] system, with a fraction of a percent error," Raj says.

They've already achieved recall and precision with less than 0.18 percent error--a level that suggests SBE is a workable identification system for comparing voiceprint information through the cloud. 

Rane, who was not involved in the SBE research, says the results Raj and his colleagues are reporting are "much faster" than encryption. "You take a small hit in accuracy in return for a very large increase in speed," he says. 

A few questions need to be resolved before SBE can be deployed, Raj says. Importantly, they need to decide how much of the conversion is performed on the user's device and how much is done in the system. The more of the conversion done in the end-user's device, the easier it would be for an intruder to use that device to simply reset everything for his or her purposes--for example, if you lost a smart phone. On the other hand, the more that resides in the system, the more the user is at the mercy of that system's security and goodwill.  

Possibly, different levels of voice authentication would be best served by different formulations of conversion. 

Once these issues are decided, Raj says, voice-authentication passwords are "computationally feasible and plausible, and can be implemented today."
For More Information: 
Jason Togyer | 412-268-8721 | jt3y@cs.cmu.edu