Paper Translation: Learning Rate Adjustment in Double-Talk Frequency-Domain Echo Cancellation

《On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk》


Abstract : One of the main difficulties of echo cancellation is that the learning rate needs to be varied according to conditions such as two-way talk and echo path changes. This paper proposes a new method to vary the learning rate of the frequency-domain echo canceller. The method is derived based on the optimal learning rate of the Normalized Least Mean Square (NLMS) algorithm in the presence of noise. The method is evaluated in combination with adaptive filters. We demonstrate that it outperforms current double-talk detection techniques and is easy to implement.

Index terms: Acoustic echo cancellation, adaptive learning rate, echo path variation, multi-delay block frequency domain (MDF) algorithm, normalized least mean square (NLMS) algorithm.

I. Introduction

  Robust echo cancellation requires a way to adjust the learning rate to account for the presence of noise and/or interference in the signal. Most echo cancellation algorithms try to detect double talk and then react by freezing the adaptation of the adaptive filter.

  The most commonly used double-talk detection algorithm proposed by Gansler [1] is based on the coherence between the far-end and near-end signals. This algorithm has two main disadvantages. First, the detection threshold depends on echo path loss, energy ratio between speakers and noise. Second, the estimation of coherence requires good knowledge (or estimation) of the echo delay. The double-talk detector proposed by Benesty [2] removes the need for explicit delay estimation and generally reduces the required complexity. However, for computing the decision variable ξ \xiThe simplification of ξ is based on the assumption that the filter has converged. This assumption does not hold when the echo path changes.

  In this paper, we propose a novel approach to make adaptive echo cancellation robust to two-way conversations. We use a continuous learning rate variable instead of trying to explicitly detect the two-talk condition as in [1], [2]. The learning rate is adjusted as a function of disturbances (noise and double talk) as well as filter mistuning. This is achieved by deriving the optimal learning rate from a normalized least mean square (NLMS) filter in the presence of noise and applying the result to a multi-delay block frequency-domain (MDF) adaptive filter [3] of. While techniques for updating the learning rate using gradient adaptive algorithms have also been proposed in the past [4], [5], in this paper we focus on designing the learning rate to react quickly under two-talk conditions.

  In Section II, we derive the optimal learning rate for the NLMS algorithm in the presence of noise. In Section III, we propose a technique to adjust the learning rate of the MDF algorithm based on the derivative obtained by the NLMS filter. Experimental results and discussions are given in Section IV. The fifth part is the conclusion of this paper.

insert image description here

Figure 1: Block Diagram of Echo Cancellation System

II. Optimal NLMS learning algorithm in noisy environment

  From an information-theoretic point of view, we know that as long as the adaptive filter is not perfectly tuned, the error signal always contains some information about the exact (time-varying) filter weights W k ( n ) W_k(n )Wk( n ) information. However, forW k ( n ) W_k(n)Wk( n ) the amount of new information will increase with the microphone signald ( n ) d(n)The amount of noise in d ( n ) is reduced. In the case of NLMS filters, this means that stochastic gradients become less reliable when noise increases or when filter misalignment decreases (when the filter converges). In this section, we derive the optimal learning rate for the general case of complex NLMS algorithms.

  A complex NLMS filter of length N (Figure 1) is defined as:insert image description here

Its adaptive step size isinsert image description here

insert image description here

where x ( n ) x(n)x ( n ) is the far-end signal, and is timennEstimated filter weightsw ^ ( k ) \hat{w}(k) on nw^ (k), andμ \muμ is the learning rate.

  Considering the error of the filter weight δ k ( n ) = w ^ ( k ) − wk ( n ) \delta_k(n)= \hat{w}(k)-w_k(n)dk(n)=w^(k)wk( n ) , and it is known thatd ( n ) = v ( n ) + ∑ kwk ( n ) x ( n − k ) d(n) = v(n)+\sum_{k}{w_k(n)x( nk)}d(n)=v(n)+kwk(n)x(nk ) , then (3) can be rewritten as:insert image description here

  At each time step, the filter offset Λ ( n ) = ∑ k δ k ∗ ( n ) δ k ( n ) \Lambda(n) = \sum_k{\delta^{*}_k(n)\delta_k( n)}L ( n )=kdk( n ) dk( n ) can be obtained by the following formula:insert image description here

By the (strong) assumption that x ( n ) x(n)x ( n ) sumv ( n ) v(n)v ( n ) are white noise signals uncorrelated with each other, we findinsert image description here

where the operator E { . } E\{.\} is expectedE { . } is only at this pointv ( n ) v(n)v ( n ) takes over andσ v 2 = E { ∣ v ( n ) ∣ 2 } \sigma^2_v = E\{|v(n)|^2\}pv2=E{ v(n)2 }. Because (6) is a convex function, by solvingφ E { Λ ( n + 1 ) } / φ μ \varphi E\{\Lambda(n+1) \}/ \varphi \muφ E { Λ ( n+1 ) } / φ μ andΛ( n ) ≠ 0 \Lambda (n)\neq0L ( n )=0 , can be inμ \muμ minimizes the expected offset.insert image description here

This results in a conditionally optimal learning rate (conditioned on the current misalignment and far-end signal)insert image description here

When there is no near-end noise ( σ v 2 ) (\sigma^2_{v})( pv2) , we can see that (8) simplifies toμ opt ( n ) = 1 \mu_{opt}(n) = 1mopt(n)=1 , consistent with [7]. Now, consideringΛ ( n ) ∑ i = 0 N − 1 ∣ x ( n − i ) ∣ 2 / N \Lambda (n)\sum_{i=0}^{N-1}{|x(ni) |^2/N}L ( n )i=0N1x(ni)2 /Natx ( n ) x(n)The expected value on x ( n ) is equal to the residual echo r ( n ) = y ( n ) − y ^ ( n ) r(n) = y(n) - \hat{y}(n)r(n)=and ( n )y^( n ) varianceσ r 2 ( n ) \sigma _r^2(n)pr2( n ) , and the known output signal variance isσ e 2 ( n ) = σ v 2 ( n ) + σ r 2 ( n ) \sigma_e^2(n) = \sigma^2_v(n) + \sigma^ 2_r(n)pe2(n)=pv2(n)+pr2( n ) , we approximate (as N goes to infinity, the approximation becomes exact) that the optimal learning rate isinsert image description here

This means that the optimal learning rate is approximately proportional to the residual error ratio. Note that σ e 2 ( n ) \sigma^2_e(n)pe2( n ) is easy to estimate; however, the residual echoσ r 2 ( n ) \sigma^2_r(n)pr2Estimation of ( n ) is difficult, as discussed in the next section. Now, if we assume we have estimatesσ ^ r 2 ( n ) \hat{\sigma}^2_r(n)p^r2( n ) andσ ^ e 2 ( n ) \hat{\sigma}^2_e(n)p^e2( n ) , we can choose the learning rateinsert image description here

Among them, the upper limit is the optimal rate in the case of no noise, reflecting σ e 2 ( n ) \sigma^2_e(n)pe2( n ) is always greater thanσ r 2 ( n ) \sigma^2_r(n)pr2( n ) fact.

  Another result that can be obtained from (6) is ( E { Λ ( n + 1 ) } = Λ ( n ) ) (E\{\Lambda (n+1)\}=\Lambda (n))( E { Λ ( n+1)}=Λ ( n ) ) , wheninsert image description here

where σ x 2 ( n ) \sigma^2_x(n)px2( n ) is the variance of the filter input (far-end) signal. Putμ ^ opt \hat{\mu}_{opt}m^optSubstituting the value of (11) into (11), we get that when the filter adaptation stops, the residual echo isinsert image description here

where min ( . ) (.)The first parameter in ( . ) is obtained by solvinginsert image description here

The result in (12) implies that the residual echo is limited by the background noise and half of the estimated residual echo, whichever is lower. Therefore, it is important not to overestimate the residual echo by more than 3dB, at least during a two-way call.

III. Effects of background noise and double-talk on the MDF algorithm

  The derivation in Section II assumes that x ( n ) x(n)x ( n ) sumv ( n ) v( n )v ( n ) is a white noise signal. Although this assumption clearly does not hold in the case of acoustic echo cancellation of speech signals using NLMS algorithms, we propose to apply it to adaptive filter algorithms operating in the frequency domain. In this section, we focus on the multi-delay block frequency-domain (MDF) adaptive filter [3]. Adaptation for the MDF algorithm (and other block frequency algorithms) is similar to applying the NLMS algorithm independently for each frequency. It has been observed that the input signal is less correlated in time (across successive Fast Fourier Transform frames) than the original time-domain signal. In addition, the learning rateμ ( k , ϱ ) \mu(k,\varrho)μ(k,ϱ ) can be frequency dependent. In this section, the variableY ^ ( k , ϱ ) \hat{Y}(k,\varrho)Y^(k,ϱ )E ( k , ϱ ) E(k,\varrho)E(k,ϱ ) isy ^ ( n ) \hat{y}(n)y^( n ) Sume ( n ) e(n)The frequency-domain counterpart variable of e ( n ) , where k is the frequency index,ϱ \varrhoϱ is the frame index.

  Assuming that the signals at each frequency of the MDF algorithm are uncorrelated in time, we approximate the optimal frequency-dependent learning rate byinsert image description here

where kkk is the discrete frequency,ϱ \varrhoϱ is the frame index. To estimate the residual echoσ r 2 ( k , ϱ ) \sigma^2_r(k,\varrho)pr2(k,ϱ ) , we assume that the adaptive filter has a frequency-independent leakage coefficientη ( ϱ ) \eta(\varrho)η ( ϱ ) , which represents the offset of the filter. This leads to estimatesinsert image description here

middle ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) is the estimated leakage coefficient. The advantage of this formula is that it estimates the residualσ r 2 ( k , ϱ ) \sigma^2_r(k,\varrho)pr2(k,ϱ ) decomposes into slowly evolving but hard to estimate terms (η ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) ) and a fast-moving but easy-to-estimate term (σ ^ Y ^ 2 ( k , ϱ ) \hat{\sigma}^2_{\hat{Y}}(k,\varrho)p^Y^2(k,ϱ ) ). Leakage coefficientη ( ϱ ) \eta(\varrho)η ( ϱ ) is actually the inverse of the return loss enhancement (ERLE) of the filter.

  In order for the learning rate to have a fast response in the case of double talk, in order to prevent the filter from diverging when the double talk starts. For this we use the instantaneous estimate σ ^ Y ^ 2 ( k , ϱ ) = ∣ Y ^ ( k , ϱ ) ∣ 2 \hat{\sigma}^2_{\hat{Y}}(k,\varrho)= |\hat{Y}(k,\varrho)|^2p^Y^2(k,ϱ )=Y^(k,ϱ ) 2σ ^ e 2 ( k , ϱ ) = ∣ E ( k , ϱ ) ∣ 2 \hat{\sigma}^2_e(k,\eight) = |E(k,\eight)| ^2p^e2(k,ϱ )=E(k,ϱ ) 2 , and based on (10), this leads to a learning rateinsert image description here

where μ max \mu_{max}mmaxis a design parameter (always less than or equal to 1) that places an upper bound on the learning rate in practical applications and ensures that the learning rate does not cause the adaptive filter to become unstable.

  We see from (16) that the effects of filter offset and double-talk are decoupled. Thus, the learning rate can react quickly to double-talk even though the estimation of the residual echo (leakage coefficient) takes a longer time period.

  An important aspect that needs to be addressed is the initial conditions. When the filter is initialized, all weights are set to zero, so Y ^ ( k , ϱ ) \hat{Y}(k,\varrho)Y^(k,ϱ ) signal is also zero. This makes the learning rate calculated using (16) zero. To start the adaptation process, the learning rateμ ( k , ϱ ) \mu (k,\varrho)μ(k,ϱ ) is set to a fixed constant (we useμ ( k , ϱ ) = 0.25 \mu (k,\varrho)=0.25μ(k,ϱ )=0 . 2 5 ), for a short time equal to twice the filter length (only considerx ( n ) x(n)x ( n ) the nonzero part of the signal). This process is only required during filter initialization, not in case of echo path changes.

A. Leakage Estimation

  We see from (16) that the optimal learning rate depends heavily on the estimated leaky coefficient η ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) . We propose to estimate the leakage coefficient η ( ϱ ) \eta(\varrho)by exploiting the non-stationarity of the signal and using a linear regression between the estimated echo and the power spectrum of the output signalη ( ϱ ) . This choice is based on the fact that the spectrum of the residual echo is highly correlated with the spectrum of the estimated echo, while there is no correlation between the spectrum of the echo and the spectrum of the noise.

  First, a zero-mean version of the power spectrum is obtained using a first-order DC rejection filterinsert image description here

Thus, η ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) is equal to the estimated echo powerPY ( k , ϱ ) P_Y(k,\varrho)PY(k,ϱ ) and output powerPE ( k , ϱ ) P_E(k,\varrho)PE(k,ϱ ) between linear regression coefficientsinsert image description here

where REY ( k , ϱ ) R_{EY}(k,\varrho)REY(k,ϱ )RYY ( k , ϱ ) R_{YY}(k,\varrho)RYY(k,ϱ ) is recursively averaged asinsert image description here

β 0 \beta_0b0is the base learning rate for leaky estimation, σ ^ Y ^ 2 ( ϱ ) \hat{\sigma}^2_{\hat{Y}}(\varrho)p^Y^2( ϱ )σ ^ e ^ 2 ( ϱ ) \hat{\sigma}^2_{\hat{e}}(\varrho)p^e^2( ϱ ) are the total power of the estimated echo and output signal, respectively. Variable mean parameterβ ( ϱ ) \beta(\varrho)β ( ϱ ) prevents the estimate from being adjusted when echoes are not present.

B. Double Talk, Background Noise, and Echo Path Changes

  It can be seen that the adaptive learning rate described above is able to handle double-talk and echo path variations without explicit modeling. From (16), we can see that when double-talk occurs, the denominator ∣ E ( k , ϱ ) ∣ 2 |E(k,\varrho)|^2E(k,ϱ ) 2 increases rapidly, resulting in an instantaneous decrease in the learning rate that lasts only for the duration of the double-talk. In the case of background noise, the learning rate depends on the presence of the echo signal as well as the leakage estimate. As the filter offset gets smaller, the learning rate will also get smaller.

  A major difficulty in double-talk detection is the need to distinguish between double-talk and echo path changes, both of which cause sudden increases in the filter error signal. This distinction is done through leakage estimation. In the case of double talk, there is little correlation between the power spectrum of the error and the power spectrum of the estimated echo, so η ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) is kept small, as is the learning rate. On the other hand, when the echo path is changed, there is a large correlation between the power spectra, which leads toη ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) increases rapidly, and if the change is large and there is no double-talking, thenη ^ ( ϱ ) \hat{\eta}(\varrho)the^( ϱ ) can quickly bring the learning rate close to 1.

insert image description here

insert image description here

Figure 2: (a) Convolved remote signal y ( n ) y(n)y ( n ) (top), near-end signalv ( n ) v(n)v ( n ) (medium), capturing signald ( n ) d(n)d ( n ) (down). (b) Short-run ERLE as a function of time. The gap in the curve is due to the fact that ERLE is undefined when the far-end signal is zero.

insert image description here

Figure 3: Estimated return loss enhancement (ERLE), calculated as the estimated leakage coefficient ( 1 / η ^ ( ϱ ) 1/\hat{\eta}(\varrho)1/the^( ϱ ) ) with the inverse of the measured ERLE.

IV. Results and Discussion

  The proposed system is evaluated in an acoustic echo cancellation environment with background noise, double talk and echo path variation. The two different impulse responses used have a length of 1024 samples and were measured on real recordings in a small office with microphones and speakers placed on a table.

  The proposed algorithm 1 method^1Law1 Comparison with the Gansler double-talk detector [1], the normalized cross-correlation method [2], and a baseline without double-talk detection (no DTD). In the implementation of Gansler's algorithm, latency estimation is performed offline, and the coherence threshold is set to 0.3, as this value was found to be optimal for the current situation. We expect the performance of Gansler's algorithm to degrade if automatic estimation of these parameters is used. The optimal threshold for the normalization algorithm is also 0.3. It is found to chooseμ max = 0.5 \mu_{max} = 0.5mmax=0.5 as an upper bound on the learning rate gives good results for our algorithm. In fact, looking forμ max \mu_{ max }mmaxNot hard, since the algorithm is not sensitive to this parameter. For other algorithms tested, use μ = 0.2 \mu = 0.2m=A learning rate of 0.2 gives the best results.

  For a typical 32s scenario, the near-end and far-end signals are shown in Figure 2(a), and the echo path changes after 16 s. For all algorithms, the measured return loss enhancement (ERLE) is shown in Fig. 2(b). Due to the natural variation in algorithm behavior, it is not possible to immediately determine the most accurate algorithm from this graph. However, we present it here to demonstrate the behavior of our algorithm. For example, it can be observed that when the echo path changes after 16 s, the proposed algorithm re-adapts faster than other algorithms with double-talk detection and almost as fast as the echo canceller without double-talk detection.

  An estimate of ERLE is provided in Figure 3 (computed as 1 / η ^ ( ϱ ) 1/\hat{\eta}(\varrho)1/the^( ϱ ) ). It can be observed that the estimate roughly follows the measured ERLE, although the estimate is clearly noisy. Most importantly, it almost never overestimates residual echo (underestimates ERLE) by more than 3 dB, as required by (12). Also, when the echo path changes, the estimate drops quickly to 0 dB, which is the desired behavior. Figure 4 shows how the learning rate varies as a function of time for all three algorithms. When the learning rate is att = 16 st = 16st=The effect of the leakage estimation can be clearly observed when the echo path at 16 s changes and then rises rapidly, keeping the learning rate much higher than the other algorithms for about 5 s . It is also observed that the learning rate decreases as the filter becomes better adapted. This outperforms Gansler and normalized cross-correlation algorithms which do not account for filter mistuning.

Figure 5 shows the average steady-state (disregarding the first 2 seconds of adaptation) ERLE for the data of Figure 2 with different near-end signal and echo ratios. Clearly, the proposed algorithm performs better than the Gansler and normalized cross-correlation algorithms in all cases, with an average improvement of more than 4 dB in both cases. The perceptual quality of the output speech signal is also assessed by comparing it with the near-field signal v(n) using Perceptual Evaluation of Speech Quality (PESQ) ITU-T Recommendations P.862 [8] and P.862.1 [9]. The perceptual quality of the speech shown in Figure 6 was evaluated based on the entire file including the adaptation time. Again it is clear that the proposed algorithm performs better than the reference double talk detector. It is worth noting that the reason the results in Figure 6 improve with doubletalk (unlike Figure 5) is that the signal of interest is doubletalk v(n), so the higher the doubletalk, the (relatively) The echo is less.

insert image description here

Figure 4: Average learning rates for all three algorithms over a 600 ms moving window. For the proposed algorithm, the learning rate is also averaged over all frequencies. The learning rates of the Gansler and normalized cross-correlation algorithms seem to have a continuous scale simply due to temporal averaging. The learning rate sometimes goes to zero when there is no far-end signal.insert image description here

Figure 5: Steady-state ERLE (without considering the first 2 seconds of adaptation) and vy signal ratio ( 20 log 10 ( σ v 2 / σ y 2 20log_{10}(\sigma^2_v/\sigma^2_y20log10( pv2/ py2) function relationship. When the ratio is equal to or higher than 10 dB, the filter cannot converge in the "no DTD" case.insert image description here

Figure 6: PESQ objective listening quality measure (LQO-MOS) is vy signal ratio ( 20 log 10 ( σ v 2 / σ y 2 20log_{10}(\sigma^2_v/\sigma^2_y20log10( pv2/ py2)The function. When the ratio is equal to or higher than 10 dB, the filter cannot improve in the case of "no DTD".

V. Conclusion

  We have demonstrated a novel method for adjusting the learning rate of a frequency-domain adaptive filter based on the current misalignment as well as the amount of noise and double-talk present. The proposed method performs better than coherent-based double-talk detectors, does not use hard detection thresholds, and does not require explicit estimation of echo path delay. Although the demonstration is done using the MDF algorithm, we believe the technique is general enough and applicable to other frequency-domain adaptive filtering algorithms.

  In future work, the residual echo estimation in (15) can be evaluated as a residual echo estimator for further echo suppression, as proposed in [10]. In addition, more precise methods for estimating leakage coefficients should be investigated.

references

insert image description here

about the author:

  Jean-Marc Valin was born in Montreal, Canada in 1976. He received his BS, MS and PhD degrees in electrical engineering from the Royal Canadian College of Law, Sherbrooke University in 1999, 2001 and 2005, respectively. His doctoral research focused on introducing auditory capabilities to mobile robotic platforms, including sound source localization and separation.

  Since 2005, he has been a postdoctoral fellow at the CSIRO ICT Center in Marsfield, NSW, Australia. His research topics include acoustic echo cancellation and microphone array processing. He is the author of the Speex speech codec.

Introduce:

  1. The complete source code for the algorithm proposed in is available at http://www.speex.org/ as part of the Speex package (version 1.1.12 or later)

Note: Since this article is translated by the blogger himself for his own learning record, it does not affect the readability of the article. Formulas and images are used screenshots directly from the article. If there is an error, please contact the blogger to correct it! ! !

Guess you like

Origin blog.csdn.net/qq_44085437/article/details/127517171