Understanding Speech in Complex Acoustic Environments

Clinical Practice Research, CREd Library

Understanding Speech in Complex Acoustic Environments

This presentation focuses on the cocktail party problem and understanding why listeners with sensorineural hearing loss experience great difficulty in solving this problem.

Gerald Kidd

November, 2016

DOI: 10.1044/cred-pvd-c16005

The following is a transcript of the presentation video, edited for clarity.

Thank you Karen, and thank you for inviting me here I appreciate it. And thanks to ASHA for giving me the opportunity to talk with you today. The little bit more focused title of my talk is: Solving the Cocktail Party Problem by Enhancing Auditory Selective Attention: Initial Findings Using a Visually Guided Hearing Aid.

I’d first like to acknowledge the members of my lab group who have contributed to this project. Jen Best, Lorraine Delhorne, Todd Jennings, Chris Mason, and Ellen Roverud, and the rest of the group that have also made contributions to this, which includes former lab members and affiliates and some folks at Sensimetrics Corporation Kameron Clayton, Joseph Desloge, Sylvain Favrot, Michel Jackson, Tom von Wiegand, Tim Streeter, Ganesh Swaminathan and Pat Zurek.

A quick overview of what I’m going to cover today. I’m going to talk about a very simple model of selective attention, by way of providing an analogy really. I’m going to discuss how sensorineural hearing loss affects source selection, and I’m going to mostly describe our efforts at designing a new type of hearing aid that can enhance auditory selective attention.

The cocktail party problem

Now the analogy that attention acts like a light that shines on a thought, an idea, or a physical object is a very old one. This is a quote from the famous psychologist William James back in 1890, “The things we attend to come to us by their own laws. Attention creates no idea; an idea must already be there before we can attend to it. Attention only fixes and retains what the ordinary laws of association bring before the footlights of consciousness.” So the idea of shining a light on something with our attention volitionally was formalized to some degree by Michael Posner and his group around 1980 through a series of vision perception experiments that led them to conclude, “These findings are consonant with the idea of attention as an internal eye or spotlight…that enhances the detection of events within its beam.”

Now the spotlight theory of attention has a number of caveats and exceptions, but as just an analogy to perception it remains of interest and useful to us even today.

Now according to the spotlight analogy of selective attention, the beam enhances the source on which we choose to focus our attention. The simple concept that’s trying to be conveyed by this analogy is portrayed in this next very simple figure. We see a series of indistinct images corresponding to an unfocused attentional state. We shine the flashlight of attention on one particular image, it brings the features of the image out forward so that they’re distinct from the background. And then by an act of will and through volition we can change the focus of attention to various other sources and separate them from the others.

Now the illustration and the analogy is most commonly given for the sense of vision. And we’re interested in the sense of hearing primarily in audition and speech recognition. And that naturally leads to the question: Is there a parallel to the visual spotlight of attention analogy for the sense of hearing? And I would answer, I would argue: yes, that there is. There is. And we can we can see this analogy if we consider the task of attempting to listen to one voice in a mixture of voices.

Normally people with normal hearing can easily select and attend to one talker in competition with other competing talkers. But it’s a very impressive ability, and it’s one for example that’s very difficult to achieve through automatic speech recognition. It’s very difficult to find a device that can accurately pick out one talker in the midst of other talkers and recognize that speech. Among the first scientist to really consider this very complex ability was Colin Cherry. He not only considered it, he raised the question and provided some possible answers. And here’s a quote from his famous article in 1953: “How do we recognize what one person is saying when others are speaking at the same time (the ‘cocktail party problem’)? On what logical basis could one design a machine (‘filter’) for carrying out such an operation?”

Now one factor, Cherry listed about five or six different factors in that article that could help human listeners solve the cocktail party problem, but one that’s gotten the most attention probably over the years is that the sources vary in spatial location. That creates differences at the two years in interaural in the time of arrival and the level of the different sounds, and that’s illustrated here in this slide. For the talker that up here, a person speaking from the right, the sound from that person reaches the right ear — if the speech from the the person on the right reaches the right ear first and sooner than it does the left ear, that creates an interaural time difference. And also particularly the high frequencies are attenuated in the far ear. The same thing, the converse thing happens with the speaker on the left. The sound reaches the left ear first relative to the right ear and is a little bit higher and level particularly in the high frequencies.

Now a listener can use these interaural differences and sets of interaural differences from different sources to essentially focus a spatial filter toward one of the talkers. The analogy of the flashlight essentially, it’s our way of shining an auditory flashlight on the source we want to listen to and exclude other sources.

Now this idea of using our own ability to filter out different sources with spatial information was recognized by Donald Broadbent early on about the time of Cherry and led to an early theory of attention based on speech recognition called the filter theory. And Broadbent said, “…the main advantage of spatial separation [of speech sources] comes in the case when one or more irrelevant messages have to be ignored… [selective attention makes it] easier to pass one [message] and reject another message…this is effectively a filter theory.”

So here’s our analogy, our spotlight analogy, that’s been transformed into the auditory domain. But again we start with a visual representation of indistinct figures.

So what you should have heard in that case is that the speech of the person who was being highlighted was louder relative to the others, and that person was asking a series of questions. And so the idea is that once we’re able to focus the spotlight of attention, this auditory selective attention using interaural differences on one talker we’re able to enhance that particular speaker and suppress the speakers in competition with it. And this is, as Broadbent would say, an example of a type of filter.

Now in the laboratory we want to be able to measure this and quantify it. And one of the ways that we do it is to compare situations where we have a target talker that’s identified in a group of talkers, and then we move the orientation of the talkers around from being either spatially separated, as they are here to begin with, to being co-located so that all the talkers are originating from the the same location. And then we measure the ability to recognize the speech of that one designated target talker. And in this particular case with all equal level talkers, we were able to measure about forty percent correct intelligibility for that speech.

However if we take the talkers and separate them spatially the intelligibility of the target increases to ninety-seven percent correct. That difference in performance is referred to — that improvement in performance is referred to as a spatial release from masking.

Now more commonly the way that we measure spatial release from masking is not to measure differences in percent correct performance — although we can — we often want to look at a dB difference between the two. And what I’m working up toward is characterizing the properties of this attentional filter, and to do that we need to have dB values for certain situations. So this is an example of, it is a schematic illustrating how we would conduct an experiment — and it actually is from an experiment for my lab — to measure what’s called spatial release from masking and do it in terms of a dB value. So in this case we’re asking the question how much spatial release from masking do listeners with normal hearing achieve. We used, as you’ve heard from previous speakers, multiple same-sex talkers and that’s a particularly difficult case. Which means that people can take advantage of segregation cues and achieve a large advantage because it’s difficult to start out with. The three female talkers are speaking similar sentences. The target is always directly in front. The masker is either co-located as you saw in the last schematic, or they’re spatially separated plus or minus 15 degrees. The masker levels are fixed, and we vary the level of the target to find the relative level that is fifty percent correct, a masked speech reception threshold, and we express that value in terms of dB.

So in computing spatial release from masking normally we measure two thresholds. One is in the co-located condition here. And we achieve a threshold that is at about 3 dB. And then the second is in the spatially separated condition, in which we achieve a threshold that’s at about minus 11 dB. And that leads to — the difference between the two thresholds is our estimate of spatial release from masking.

Now in order to estimate the properties of this attentional filter tuned in the spatial dimension, we measure thresholds and spatial release from masking for a number of different target masker separations. And that’s what I’m going to illustrate next. This filter, we think, passes sounds from the direction that it’s pointed and inhibits or attenuates sounds from other directions.

And here’s an illustration of the concept. Now there are several things to see on this slide as I fill it in. The first here is that —

So this is our listener that is situated in a semi-circle of loudspeakers, and the listener is always listening for the target that’s being played from directly ahead, and then we’re going to measure thresholds for the target speech in the presence of two other same-sex masker talkers. The talkers can be co-located with a target, as you’ve seen before. And we’re going to use the threshold that we obtained in the co-located case as the reference. And arbitrarily assign that to 0 dB. Then what’s going to happen is that we’re going to play parrots of maskers, different maskers, from different symmetrically placed locations around the target — so plus or minus 15 degrees, plus or minus 45 degrees, and plus and minus 90 degrees. And as we separate the maskers from the target the threshold is going to improve its going to get lower there’s going to be less masking. And we take that masking release that spatial release from masking as an estimate of the attenuation that’s produced by our internally generated attentional filter.

So this will show you exactly how the measurements are made. First these are the thresholds that we obtain — this is for the co-located case. These are for the symmetric plus or minus 15 degree case; plus or minus 45 degrees. Relative to 0 the thresholds go down, they improve by about 12 dB I’m plotting two thresholds around zero even though there’s really just one data point because the maskers are symmetrically placed around center, and also because that allows us, that reflection around zero, allows us to estimate a tuned filter. So here is an example of the spatial separation at 15 degrees. We measure the threshold for that condition. It’s about eight and a half dB down and then by taking these data points we filter to it. This is the same type of filter that people use to fit bands in frequency. And this is, we believe, a good estimate of the spatial tuning that’s achieved for the task of speech on speech masking.

Now if you play sounds that are other than speech as the masker, you can get a very different result. So we’ve done the same experiment with two independent envelope modulated noises, speech-shaped noises, that we play co-located with a target, measure a threshold, set it to 0, and then spatially separate the noises, and we get thresholds out here. And the reason that we get so little attenuation is that your attentional filter can work very well when the masking that’s produced is informational in nature — which same-sex similar sentences as maskers tends to produce a high degree of. Whereas the noise maskers are very distinct and the focus of attention and a symmetric masking case actually buys you very little. So when I talk about the properties of an attentional filter be aware that the type of masking that’s produced makes a difference

The problem due to hearing loss

Now what about the problem of hearing loss. Listeners with hearing loss often experience great difficulty understanding speech in noise. And generally — and you have to be, you should be aware of the fact that when people say noise, they don’t always mean noise from an engineering perspective and an acoustic perspective. They’re just talking about unwanted sources of sound. And that creates a lot of confusion because noise has a very particular meaning when you’re talking about signals and systems. But anyway we all know that listeners with particularly sensorineural hearing loss experience a great deal of difficulty listening in noisy situations, in situations where there are in particular competing talkers.

To orient what happens with blisters that have sensorineural hearing loss, I’m showing you again the same schematic of our loudspeaker array and our listener here. And I’m going to plot on this figure data from a group of sensorineural hearing loss listeners. And this is for exactly the same task. And what you see is that the attenuation of the sounds — this is the sound they’re trying to listen to — that their ability to attenuate the sounds on the sides is much reduced relative to normal. And that’s very consistent with the subjective impression of great difficulty picking out one talker to attend to, and suppressing competing talkers.

Ok that’s the problem and I think that’s one of the major problems on a functional level that listeners with sensorineural hearing loss experience. The question is what can we do about it? And obviously the primary means of remediating hearing loss is through the fitting of hearing aids — are certainly one of the most common. What can hearing aids do to restore normal spatial tuning?

Well hearing aids typically provide us with a frequency-dependent gain to improve audibility and make the information accessible. Amplitude compression to restore normal loudness perception. Noise reduction, which most commonly has been successful for steady-state lower frequency noise. It cannot — noise reduction cannot make the decision of which talker you want to attend to. Only the person who is the actual listener can do that selection. And then finally, directionality. And directionality can provide a significant benefit. And that’s the feature of hearing aids that I’m going to focus on now

Possible benefits of a beam-forming microphone array

So we’ve been exploring the possible benefits of a highly directional beam-forming microphone array that could ultimately be incorporated into a hearing aid, that we believe can enhance sound source selection for people with hearing loss. The device that we have come up with and I’m going to talk about is called the visually guided hearing aid.

This is, these are a few pictures of the visually guided hearing aid. It’s currently a laboratory prototype, and my intuition is we may need to do a little bit of work on the cosmetic aspect of it before we start selling it as a product. Although who knows. What you see here is, there are several things. In this figure, this is a flexible printed circuit board. And on the circuit board are four rows of microphones, omnidirectional microphones. And the circuit board has been mounted on a flexible band that spans the head from ear to ear. In addition to the microphone array that does the beam-forming we have a component that steers the beam using eye gaze and for that we have the subject wear an eye tracker. So the visually guided hearing aid has two major functional components: a microphone array that does beam-forming and an eye tracker that steers the beam without having to make head turns. And we also have insert phones that take the output from microphone array and route it to the ear of the listener.

Now what kind of attenuation or what kind of special filtering can you achieve with the beam-forming microphone array like this? Well back to the same reference plot, this shows our spatial tuning filter from listeners with normal hearing for a speech-on-speech masking experiment. And in a way, this is what we’re aiming for to restore normal spatial tuning. So if you remember the hearing impaired listeners had thresholds that were around halfway between 0 and twelve-and-a-half dB and what we’d like to do is be able to push that down.

Here is the attenuation characteristics of the beam-forming microphone array pointed directly in front. In general it works pretty well, the width of the beam is a little bit wider than this particular spatial filter that we measured. But the attenuation once you get 20 degrees or so away is actually pretty good.

Now here is a demo of what the beamformer sounds like. I had him turn up the volume so I’ll stop this real fast if it’s too high. Basically the listener here is going to be sweeping the beam so this is the beam created by the beam-former from left to right. The sounds, the speech sounds which are a bunch of digits are only going to come from directly ahead. So it will start out outside of the beam of amplification, so you won’t be able to hear it very well. And then as the beam is swept across the front, the speech will get louder and then you’ll hear about how much of a difference this beam-former can give you. Notice also that our lister is going to keep his head still and is going to steer the beam with his eyes even though he’s not wearing an eye tracker.

Okay, so you get the idea from that I think. So that was really a schematic of how the beam-forming microphone array works. And I’m going to play you a video recorded in my lab from members of my lab group sitting around the table having just a typical conversation. And what’s going to happen is that the beam-former is not going to be turned on at the beginning. You’re going to be listening to all these people talking with just an omnidirectional microphone. And about halfway through the beam-former is going to be turned on and the spotlight will come on the person that is at the focus of the beam. And then the beam will be shifted around, the spotlight will move too.

So that should give you an example of how the beam-former sounds. It was steered in this case by physically turning the head, it wasn’t scared by eye gaze.

Now I’m going to present some data from several experiments that were, first of all, addressing how much of a benefit you can get from the beam-former in a speech-on-speech masking experiment. The first thing that we’re going to show you is the benefit that you can get if you’re a listener that only has one viable ear. So this is a unilateral — these are a couple of unilaterally deafened subjects who obviously do not have access to normal by binaural cues and therefore have a great deal of difficulty with selective listening.

So the panel on the left shows the two conditions the co-located condition and spatially separated condition for symmetric replaced maskers, two maskers that are either speech or noise. And the difference between the thresholds for the noise than the speech is due primarily to the difference between energetic masking and informational masking. You get more informational masking in this case from the speech maskers. But what you’ll see with listening through natural cues, meaning that they are only getting an input to their one ear. That moving the maskers from the co-located case to the spatially separated case not only doesn’t give them any benefit actually makes it worse slightly because acoustically it’s a little bit worse for them. However if you have them listen through the beam-former they achieve a big benefit in their speech reception threshold. So the co-located thresholds are about the same, but they get about 10 dB benefit for both noise and for speech by listening through the beam-former.

Now these data you I’ve shown you previously. This is a normal hearing listeners listening binaurally to two independent noises that are either co-located in the zero degree case, the reference case, or spatially separated by plus and minus 90 degrees. And the question is what kind of benefit will normal hearing listeners get for this speech-in-noise task if they listen through the beam-forming microphone array, and those are the thresholds that we found for that case. Again about a 10 dB benefit between the thresholds and spatially separated condition for listening with natural binaural cues, and listening through the beam former.

We did the same thing with a group of listeners with sensorineural hearing loss. In the spatially separated case, and this is actually, this was actually simulated through earphones with HRTFs, the thresholds actually were a little bit higher in the spatially separated case, and they were about 3 dB higher than the normal hearing listeners which is a very common finding. But listening through the beam-former they also got the same benefit of about 10 dB.

Now this is great, this is good, that you can achieve this kind of benefit listening to spatially separated sources by listening through a beam-former under certain conditions. However the problem with beam forming is that it’s single channel. So if you play the beam-former to the two ears they hear an image right in the middle of the head regardless of where they’re looking or where the beam is focused. And they get no sense of anything, I mean if they can hear some sounds that are attenuated but they don’t hear any sense of spatial hearing. They don’t hear sounds that are located out in the environment. And so the downside of listening through this kind of system is that it limits our ability to monitor the environment and to localize other sounds. And that’s one of the main functions of the sense of hearing, so that leads to the question: Would it be possible for us to adapt the system so that we could retain the benefits of beam forming, but also provide awareness of the environment for the listeners. And we’ve been exploring that possibility.

We did this by combining the output of the beam-former with the output of natural binaural cues, in this case recorded through the KEMAR manikin so the KEMAR manikin is the manikin that had the microphone array on a few slides back. And that mannequin has natural ear canals with microphones in them so we can record from that manikin and reproduce natural binaural hearing. So we took the KEMAR manikin and the recordings from the two ears and low-pass filtered that, so that when you listen to it you in the low portion of the spectrum you still got natural spatial cues and combined it with the high-pass version of the beam-former — the beam-former actually works better at higher frequencies. And called the combination BEAMAR. And then tested experiments in speech-on-speech masking for these three different microphone conditions — listening to the microphone array alone, the beam-former; listening to KEMAR with the full band natural cues; and then listening the hybrid BEAMAR condition.

And we tested this with both a group of normal hearing young adults and young adults with sensorineural hearing loss. So here are the results of these are group mean data. On in the left panel you see the data from the normal hearing listeners. In the right panel you see the data from the hearing impaired listeners. Within a panel we have thresholds for the co-located condition and for the spatially separated condition — I think it was plus or minus 90 degrees. The bars are for the different microphone conditions — natural binaural cues with KEMAR, listening through the beam-former alone, and then listening through the combination of BEAMAR, the beam and natural cues. And the plot on the ordinate is the, it’s a mass speech reception threshold specified as the level of the target relative to the level of the masker in dB.

Now the things to take away from this. First of all, the normal hearing listeners as expected were better in virtually every condition than the sensorineural hearing loss listeners, as a group. The differences in the co-located condition were very small, while the differences in the spatially separated condition were pretty substantial. For both groups of subjects, though, the best thresholds in the separated condition were obtained in listening to the hybrid BEAMAR condition. That was true and particularly of interest to the listeners with sensorineural hearing loss. A couple of dB, but it was still significant.

The other question about these data that is of interest is if this were ever to become a device that was a clinical entity, you wouldn’t be dealing with group mean data, you’d be dealing with individuals. And so you want to know which individuals would benefit from this kind of approach versus would benefit from say just binaural amplification. And by the way, in the cases where I’ve shown you hearing impaired listeners listening with natural cues, we have amplified the sound for them so that it’s equally intelligible to the normal hearing listeners.

So this is a part of the data that you just saw, but for the individuals that contributed to those group means. And what we’re plotting are the amount of benefit that the subjects received as a function of how well they performed with natural binaural cues. So the farther along they are this way, the less well they were able to use their own binaural hearing to separate sounds and achieve a masking release. And the values on the ordinate all of them that are above the dashed line are for individuals who achieved thresholds that were better using either the beam or the hybrid BEAMAR condition relative to natural binaural cues. So what you see is that typically as people get poorer using natural binaural hearing to reduce masking, they tend to be more likely to benefit from either the beam or the BEAMAR. The filled symbols are for listeners with hearing loss, and just about all of them received some benefit from one of those two devices principally the BEAMAR condition. And interestingly even some of the normal hearing listeners achieve some benefit listening through the beam-former. This case six or seven dB, which raises a rather interesting possibility that persons with normal hearing but who still exhibits selective attention difficulties particularly as manifested in a multiple talker listening situation might benefit from a beam-forming approach even though they don’t have a hearing loss. So we’re going to explore that at some point.

I wanted to also just add on here for reference the benefits that you get with two noise maskers. We tested only three normal hearing subjects and one subject with a hearing loss. But virtually all of them achieve a significant benefit when listening through the beam-former when the masking is noise. So all of the other data that I showed you all of these data were when the maskers were speech.

So in all the cases I’ve shown you so far the microphone array his has basically been pointed directly ahead toward the listener, toward the talker that was the target talker. But of course naturally we have to select sources that are in different locations and the transition from one place to another. If we were to use a beam-forming microphone array that’s fixed to the head and tried to follow turn-taking in conversation, for example, we’d be turning our head all the time. And while that’s possible to do, there are problems with a fixed microphone array that depends on head turns to be steered. Head turns are slow relative to eye gaze, they’re limited in range, they often lag the focus of vision. So you may have what you’re hearing auditorily not being caught up with where your eyes are moving and your visual attention is focused, and obviously it may not be socially acceptable to be turning your head all the time that much to try to follow sources.

So that’s why we have incorporated eye-tracking as a means of steering the beam. This is just a photograph of a person wearing one of the eye trackers that we’ve worked with. This eye tracker has a camera that’s pointed inward to detect eye gaze, and can give you other things like pupil size and tell you when the eyes blink and all that. And it has an outward focused camera as well so that you can calibrate where eye gaze is and absolute coordinates.

Measuring dynamic speech intelligibility or comprehension

This approach is appealing because when you couple gaze and highly directional amplification it would seem to be very well suited for the situation of dynamic listening. However once we began exploring this the question arose: How are you going to measure how well the system works? How are you going to develop a test for measuring dynamic speech intelligibility or comprehension?

So the next thing I’m going to talk about is one of several efforts — but this is one that’s a little further along in my lab group — to try to develop measures of speech understanding that are suitable for this dynamic listening kind of situation that’s a little bit more realistic. And one of the things that we’re doing as an experimental variable is controlling the degree of predictability, when a source transition is about to occur.

So this is a new test that’s just about to be published. Jen Best is the first author on the paper. It’s trial-based, and every trial consists of a question-and-answer pair. The question and the answer would typically be uttered by different persons that could be located at different places. So one trial then would consist of a question and an answer and you’d have to transition from one place to the other to be able to fully understand the question and the answer. And an important part of this is the listener doesn’t just repeat back the words they heard. So many of the tests that we use don’t really involve any degree of comprehension of the message, they only require us to be able to repeat back the words that were spoken. In this case the listener does something very different. The listener responds yes or no by a very simple button press as to whether or not the answer to the question was correct. So in order to do that they have to both be able to — the speech has to be both intelligible and they have to comprehend the response. And you can set the probability of a correct answer to be in anything you want — fifty percent or ten percent.

So more details about the experiment and examples of the questions. They could be things like what is two plus three. Correct answer: five; incorrect answer: seven. What month comes after February? Correct answer: March; incorrect answer: August. And the listener is monitoring these in trying to follow the back and forth, and then pressing a button. And another nice thing about this test is because it only requires a button press, you don’t need them to look at a GUI for example on the screen. That vision can be directed elsewhere. You can use– they can use their eyes to follow visual cues for example for source transitions, and all they have to do is press a button. So that’s a lot a lot better for our purposes than certain other kinds of speech tests.

You can also, in addition to the target talkers which comprise the question and the answer, you can have masker talkers. In the experiment I’m going to talk about the masker talkers were three separate conversations between two people located at different places. So it was a very rich sound field. You had three, two-person conversations and a question-and-answer that the listener was to attend to going back and forth. And they were in a fixed condition focusing on one location because the question-and-answer came from fixed locations. Or in contrast, a condition where the question-and-answer were unpredictable, and they had the monitor the whole sound field find the question — actually they were visually cued — find the question, focus on it and then find the answer. When they were using the beam-former they had to direct their eyes to the correct location to get the beam-former placed at the right place. So if you can imagine you’re guiding this this beam of amplification using eye gaze so you have to have a visual cue where to steer it to in time to get the question, and then stir it somewhere else in time to get the answer, and then you make a decision and press yes or no. So the question was would this task actually work, and would subjects to be able to do it.

And here are our data from first of all the case where the question-and-answer were fixed for normal hearing. And a group of normal hearing subjects and a group of subjects with hearing loss. This is the three microphone conditions that I described earlier: natural binaural cues with KEMAR, listening just through the beam-former, and then listening through the hybrid BEAMAR condition. These are our preliminary data. The subjects are very different. You can see the error bars are large. I wouldn’t make too much of the fact that that the differences between the groups are so large here. But in a fixed condition they were able to do the task. And in the fully dynamic condition they were still able to do the task even with the beam and BEAMAR conditions. They took a hit relative to the fixed case, partly because this is a very complex experiment. The first time we tried to do it, we started the visual cue, for example for the answer, at the same time that the answer was being spoken, so it didn’t really give them time to get their eye gaze over there. But it was interesting that there was no cost for the natural binaural cue so that implies we can refocus our attention using normal spatial cues almost immediately, whereas with eye gaze we have to take some time to get it over there. But in any event we are very encouraged that this type of approach seemed to be successful, and we intend to explore other methods of assessing this ability to follow dynamic source transitions and conversations and multiple talkers situations.

Conclusions

So to conclude. In general, we found that listeners with sensorineural hearing loss show reduced spatial filtering. Highly directional amplification as embodied by a beam-forming microphone array can in many cases provide a significant advantage for these listeners. Steering the beam using eye-tracking allows rapid adjustment of the focus of amplification which potentially could assist the listener in solving the cocktail party problem. And of the microphone conditions that we have tested, the hybrid BEAMAR condition appears to be the most promising because it provides the benefit of a beam-former in terms of improved signal-to-noise ratio while preserving the awareness of the environment and the ability to localize sounds in the environment.

I’d like to thank the other all the members of the psychoacoustics laboratory and hearing research center in Boston University and our grant support from NIDCD and AFOSR, and I’m done. Thank you.

Questions and discussion

Audience Question:

My name is Greer Bailey, I’m an AuD student from West Virginia University and I was just wondering if you looked into at all how this technology would work in reverberate environments.

Yes we have actually. One of my colleagues that I mentioned at the beginning, Sylvain Favrot, actually has a paper that he presented I believe it was at the German Acoustics Conference called DAGA, where he manipulated reverberation. And so there is a paper that has some data on that I’d be happy to send it to you or anybody else if you just email me. But was kind of interesting, as you would expect things get worse as reverberation is increased. But for the beam-former there actually was a little bit of benefit relative to natural binaural hearing in the co-located case, and it appears that because the beam-former is focused directly ahead, you get a little bit of attenuation of the of the echoes. And so while everything tends to go up in reverberation, the co-located case with the beam-former didn’t go up very much. So relative to natural hearing there was actually a little bit of a benefit. But for the spatially separated case, both for natural hearing and for the beam-former things get a little worse.

Audience Question:

I’m Adam Bosen at Boys Town. The dynamic task I think is really interesting. My question though is about the BEAMAR algorithm. I’m not sure I understood it fully. What I think you’re doing, correct me if I’m wrong, is you’re taking the dichotic spatial cues from the KEMAR and then you’re mixing that with the diotic cues from the beam. Is that correct? I’m a little confused as to why you wouldn’t try from the beam-forming half of the equation, try to reintroduce some of those spatial cues. Because now if you’re looking at this mix of these two separate sources you’ve got one source and the dominant source because it’s amplified relative to everything else telling you it’s straight ahead. Whereas the KEMAR cues are telling you that source that you’re trying to attend to is off to the side. I’m wondering if there’s an opportunity for confusion there and you thought about maybe trying to introduce the dichotic cues into the beam portion of the mix.

Well so the simple answer the last part is yes, that’s one of the things that we’re trying to do. There are several ways of doing that, actually several approaches to doing it. From just an empirical perspective, it was an interesting question as to what people would do with these two images. Because the low-frequency part gives you a spatial sense and the sounds can be separated. Whereas no matter where you look you hear the high-frequency half as a single image in the middle of the head. And would people fuse it or would it seem weird or what? Now a simple thing to do about that, actually, that we explored even early on is just once we know where they’re looking, we can impose an ITD or an ILD that fits where the beam is focused and move the beam image around. And so depending on how well we match it up, we could attempt to move the image according to the information we’re getting from the eye tracker. But there are some other things that you can do. We do a lot of work with reducing the sort of spectro-temporal representation of sounds and small time frequency units. And you can do some manipulations in the time-frequency units. In fact one of the students in biomedical engineering, one of Steve Colburn’s students Jane Mead is actually doing something very much like that where she’s looking at the dominant ITD for example in tiles are the ones that correspond to a particular talker and then getting rid of the others. And so there are a lot of things that you can do to try to do that. But interestingly just the perception of the beam in the middle and the natural cues around as they would naturally appear in the low frequencies doesn’t seem to disrupt people’s ability to use that information, and in some cases it does sound a little weird, but it doesn’t seem to affect performance very much.

Audience Question:

I think it’s a great idea to combine the noise reduction and maintain some spatial hearing. And it seems this would be wonderful for many hearing-impaired people, and many normal hearing people including police officers, the military, people directing traffic during a marathon. And I do wonder however if you’ve considered some sort of training mechanism because it seemed like some people could — I would expect everybody to benefit in all situations if they knew what to do. And so the question is, is training part of this process?

Well it hasn’t been so far except to the extent that we train subjects to a certain performance level before we test them in the lab. But we haven’t thought about any way of training them to use the sort of spatial information they’re getting from this. But I think that could be very beneficial. And there’s, right now we basically have a grant to work on this project, and we’re interested in kind of the underlying scientific questions. When would beam-forwarding be useful and how well does the steering work and that sort of thing. But I think if we find enough evidence that it’s going to be worth putting some effort into, then we’re going to think more toward: What would it take to miniaturize it? What would it take to make it portable? What would it take to actually get people to use it and learn to use it effectively? And I believe that there would be a significant training component to that.

Audience Question:

Jacob from Louisiana Tech. With making it more user-friendly, more smaller and everything: How would you go about making it more cost effective? I would assume that having 16 microphones and two eye gazers would cost a little bit of money, I’m assuming.

Well the eye tracker that I showed you that Silvain was wearing cost $30-something thousand so that’s probably on the high side. Actually, one of the nice things — so it right now it’s way away from being a real device, and cost is certainly one of the issues. But what you didn’t see is there’s a computer that’s running outside of the picture to do the filter computation and to do the eye gaze control. So there’s a lot of stuff going on that involves other components, and trying to think about miniaturizing it and making it something that would be palatable to the general public involves a lot of steps that we’re not really equipped to do. We’re not really in the business of trying to make products. We’re just trying to figure out this is worth it for a hearing aid company to really get involved in it. And I think that there are pros and cons to it. But I’ve talked with enough people with serious hearing loss that if this is something that really works, as opposed to just something that gives you more bells and whistles, but that it really gives them a significant benefit in this cocktail party problem — they would pay a significant amount of money for it. They would put up with things that are less cosmetically appealing for function. And so we hope that we can determine whether or not the functional benefits really are there.

Audience Question:

When we’re looking at the results for the hearing-impaired individuals and then comparing all the other aspects to other individuals. Is that with their natural hearing or with hearing aids on? That my first part.

And the reason why I was asking is because if it would be probably even worse with hearing aid because hearing aids go down usually 250 and so if you’re gonna use your interaural cues they’re being cut out below 250 from hearing aids, so that’s another reason why hearing aids don’t help in some way, and in some ways it would be better if the hearing aid could go down to like 60 or 70 hertz but that the problem trying to add it amplifies too much noise, but on the other hand you’re losing some of that low frequency cues.

It’s natural hearing but we use I think a NAL-RP gain algorithm to provide gain for them through the headphones or whatever we however we play it to them. But it’s not their own hearing aid.

It’s a complication. At some point we’d like to compare subjects wearing their own hearing aids to wearing this device, but we’re not quite there yet.

Audience Question:

What is the connection with the Air Force since you acknowledged them as a contributor?

We’ve had a long relationship with AFOSR. It started actually with grant that Doug Brungart and Ned Durlock and I wrote maybe 12 years ago, basically on just simple spatial hearing and informational masking. And its continued up to this time. I’ve had talks with the program officer at AFOSR about potential applications for this. So in any situation, they’re very interested in spoken communication that’s the Air Force’s interest, point-to-point radio. They already have headgear so you can put electronics in it pretty easily. No that’s a good point.

Gerald Kidd
Boston University

Presented at the 26th Annual Research Symposium at the ASHA Convention (November 2016).
The Research Symposium is hosted by the American Speech-Language-Hearing Association, and is supported in part by grant R13DC003383 from the National Institute on Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH).
Copyrighted Material. Reproduced by the American Speech-Language-Hearing Association in the Clinical Research Education Library with permission from the author or presenter.

Follow ASHA Journals on Twitter

Tweets by ASHAJournals

Understanding Speech in Complex Acoustic Environments

This presentation focuses on the cocktail party problem and understanding why listeners with sensorineural hearing loss experience great difficulty in solving this problem.

Gerald Kidd

DOI: 10.1044/cred-pvd-c16005

The cocktail party problem

The problem due to hearing loss

Possible benefits of a beam-forming microphone array

Measuring dynamic speech intelligibility or comprehension

Conclusions

Questions and discussion

Clinical Research Education

Categories

More From the CREd Library

Innovative Treatments for Persons With Dementia

Implementation Science Resources for CRISP

When the Ears Interact with the Brain

Follow ASHA Journals on Twitter