Revolutionary AI
Headphones Can Isolate a Single Voice in a Crowd.
Noise-cancelling
headphones have become quite effective at generating an auditory blank slate.
However, permitting certain noises from a wearer's environment through erasure
remains a hurdle for researchers. The most recent version of Apple's AirPods
Pro, for example, automatically adjusts sound levels for wearers — sensing when
they're conversing, for example — but the user has no control over who to
listen to or when this occurs.
A team at the
University of Washington created an artificial intelligence system that allows
a user to "enroll" someone by looking at them for three to five
seconds while wearing headphones. The device, named "Target Speech
Hearing," then eliminates all other noises in the environment and plays
only the enrolled speaker's voice in real time, even when the listener walks
around in noisy places and no longer faces the speaker.
The team's findings
were presented on May 14 in Honolulu at the ACM CHI Conference on Human Factors
in Computing Systems. The code for the proof-of-concept gadget is accessible
for others to develop upon. The system is not commercially available.
We now think of AI
as web-based chatbots that answer queries," said senior author Shyam
Gollakota, a professor at the University of Washington's Paul G. Allen School
of Computer Science and Engineering. "However, in our study, we use AI to
adjust the audio impression of everyone using headphones based on their
preferences. Our technologies allow you to clearly hear a single speaker even
in a crowded area with several other individuals conversing."
A user points their
head toward a speaker while wearing store-bought headphones equipped with
microphones and presses a button to activate the system. The speaker's speech
should then simultaneously be picked up by the microphones on either side of
the headset, with a 16-degree error margin. The team's machine learning program
uses the signal from the headphones to identify the voice patterns of the
target speaker on an embedded computer that is mounted on the vehicle. Even as
the two move about, the system picks up that speaker's voice and keeps playing
it back to the audience. As the speaker continues speaking, the system's
capacity to concentrate on the enrolled voice increases, providing the system
with further training data.
The technology was
evaluated by the researchers on twenty-one individuals, who on average gave the
enrolled speaker's voice clarity almost twice as high a rating as the
unfiltered audio.
The team's earlier "semantic hearing" study, which let users choose
which sound classes, such voices or birds, they wished to hear and muffled
background noise, is built upon in this work.
The TSH system can only enroll one speaker at a time at this time, and it can
only do so when the target speaker's sound is not being joined by another loud
voice coming from the same direction. A user can run a second enrollment on the
speaker to increase clarity if they're not satisfied with the sound quality.In the future, the
team hopes to expand the system to include earphones and hearing aids.
Takuya Yoshioka, director of research at AssemblyAI, and doctorate students
Bandhav Veluri, Malek Itani, and Tuochao Chen from the University of
Washington's Allen School also contributed to the paper as co-authors. The
Thomas J. Cabel Endow Professorship, the Moore Inventor Fellow grant, and the
UW CoMotion Innovation Gap Fund provided funding for this study.
Get in touch with tsh@cs.washington.edu for additional details.