Minimize instantaneous squared error
e
k
2w=
y
k
-xkTw2
e
k
w
2
y
k
x
k
w
2
(1)
ŵk=ŵk-1+μxk
e
k
w
k
w
k
1
μ
x
k
e
k
(2)
Where
wk
w
k
is the new weight vector,
wk-1
w
k
1
is the old weight vector, and
μxk
e
k
μ
x
k
e
k
is a small step in the instantaneous error gradient
direction.
Define
vk=wk-wopt
v
k
w
k
w
opt
(3)
Where
wopt
w
opt
is the optimal weight vector and
ε
k
=
y
k
-xkTwopt
ε
k
y
k
x
k
w
opt
(4)
where
ε
k
ε
k
is the minimum error. The stochastic difference
equation is:
vk=I-μxkxkTvk-1+μxk
ε
k
v
k
I
μ
x
k
x
k
v
k
1
μ
x
k
ε
k
(5)
Show that (tightness)
limB→∞max{Pr∥vk∥≥B}=0
B
v
k
B
0
(6)
With probability 1, the weight error vector is bounded for all
kk.
Chebyshev's inequality is
Pr∥vk∥≥B≤E∥vk∥2B2
v
k
B
v
k
2
B
2
(7)
and
Pr∥vk∥≥B=1B2∥Evk∥2+σvk2
v
k
B
1
B
2
v
k
2
v
k
(8)
where
∥Evk∥2
v
k
2
is the squared bias. If
∥Evk∥2+σvk2
v
k
2
v
k
is finite for all
kk,
then
limB→∞Pr∥vk∥≥B=0
B
v
k
B
0
for all
kk.
Also,
σvk2=trEvkvkT
v
k
tr
v
k
v
k
(9)
Therefore
σvk2
v
k
is finite if the diagonal elements of
Γ
k
≡EvkvkT
Γ
k
v
k
v
k
are bounded.
∥Evk∥→0
v
k
0
as
k→∞
k
. Take expectation of Equation 5 using
smoothing property to simplify the calculation. We have
convergence in mean if
-
Rxx
R
xx
is positive definite (invertible).
-
μ<2
λ
max
Rxx
μ
2
λ
max
R
xx
.
Show that
Γ
k
=EvkvkT
Γ
k
v
k
v
k
, the weight vector error covariance is bounded for
all kk.
We
could have
Evk→0
v
k
0
, but
σvk2→∞
v
k
; in which case the algorithm would not be
stable.
Recall that it is fairly straightforward to show that the
diagonal elements of the transformed covariance
Ck=UΓkUT
C
k
U
Γ
k
U
tend to zero if
μ<1
λ
max
Rxx
μ
1
λ
max
R
xx
(
U
U is the eigenvector matrix of
Rxx
R
xx
;
Rxx=UDUT
R
xx
U
D
U
). The diagonal elements of
Ck
C
k
were denoted by
γ
k
,
i
∀i,i=1…p
γ
k
,
i
i
i
1
…
p
.
σvk2=trΓk=trUTCkU=trCk=∑i=1p
γ
k
,
i
v
k
tr
Γ
k
tr
U
C
k
U
tr
C
k
i
1
p
γ
k
,
i
Thus, to guarantee boundedness of
σvk2
v
k
we need to show that the "steady-state" values
γ
k
,
i
→
γ
i
<∞
γ
k
,
i
γ
i
.
We showed that
γ
i
=μα+
σ
ε
221-μ
λ
i
γ
i
μ
α
σ
ε
2
2
1
μ
λ
i
(10)
where
σ
ε
2=E
ε
k
2
σ
ε
2
ε
k
2
,
λ
i
λ
i
is the
i
th
i
th
eigenvalue of
Rxx
R
xx
(
Rxx=U
λ
1
…0⋮⋱⋮0…
λ
p
UT
R
xx
U
λ
1
…
0
⋮
⋱
⋮
0
…
λ
p
U
), and
α=c
σ
ε
21-c
α
c
σ
ε
2
1
c
.
0<c=12∑i=1pμ
λ
i
1-μ
λ
i
<1
0
c
1
2
i
1
p
μ
λ
i
1
μ
λ
i
1
(11)
We found a sufficient condition for
μμ that guaranteed that the
steady-state
γ
i
γ
i
's (and hence
σvk2
v
k
) are bounded:
μ<23∑i=1p
λ
i
μ
2
3
i
1
p
λ
i
Where
∑i=1p
λ
i
=trRxx
i
1
p
λ
i
tr
R
xx
is the input vector energy.
With this choice of
μμ we have:
- convergence in mean
- bounded steady-state variance
This implies
limB→∞max{Pr∥vk∥≥B}=0
B
v
k
B
0
(12)
In other words, the LMS algorithm is stable about the optimum
weight vector
wopt
w
opt
.
Recall that
e
k
=
y
k
-xkTwk-1
e
k
y
k
x
k
w
k
1
(13)
and
Equation 4. These imply
e
k
=
ε
k
-xkTvk-1
e
k
ε
k
x
k
v
k
1
(14)
where
vk=wk-wopt
v
k
w
k
w
opt
. So the MSE
E
e
k
2=
σ
ε
2+Evk-1TxkxkTvk-1=
σ
ε
2+EEvk-1TxkxkTvk-1|
x
n
ε
n
∀n,n<k=
σ
ε
2+Evk-1TRxxvk-1=
σ
ε
2+EtrRxxvk-1vk-1T=
σ
ε
2+trRxxΓk-1
e
k
2
σ
ε
2
v
k
1
x
k
x
k
v
k
1
σ
ε
2
x
n
ε
n
n
n
k
v
k
1
x
k
x
k
v
k
1
σ
ε
2
v
k
1
R
xx
v
k
1
σ
ε
2
tr
R
xx
v
k
1
v
k
1
σ
ε
2
tr
R
xx
Γ
k
1
(15)
Where
trRxxΓk-1≡
α
k
1
→α=c
σ
ε
21-c
tr
R
xx
Γ
k
1
α
k
1
α
c
σ
ε
2
1
c
. So the limiting MSE is
ε
=limk→∞E
e
k
2=
σ
ε
2+c
σ
ε
21-c=
σ
ε
21-c
ε
k
e
k
2
σ
ε
2
c
σ
ε
2
1
c
σ
ε
2
1
c
(16)
Since
0<c<1
0
c
1
was required for convergence,
ε
>
σ
ε
2
ε
σ
ε
2
so that we see noisy adaptation leads to an MSE
larger than the optimal
E
ε
k
2=E
y
k
-xkTwopt2=
σ
ε
2
ε
k
2
y
k
x
k
w
opt
2
σ
ε
2
(17)
To quantify the increase in the MSE, define the so-called
misadjustment:
M=
ε
-
σ
ε
2
σ
ε
2=
ε
σ
ε
2-1=α
σ
ε
2=c1-c
M
ε
σ
ε
2
σ
ε
2
ε
σ
ε
2
1
α
σ
ε
2
c
1
c
(18)
We would of course like to keep
MM as small as possible.
Fast adaptation and quick convergence require
that we take steps as large as possible. In other words,
learning speed is proportional to
μμ; larger
μμ means faster convergence. How
does μμ affect the
misadjustment?
To guarantee convergence/stability we require
μ<23∑i=1p
λ
i
Rxx
μ
2
3
i
1
p
λ
i
R
xx
Let's assume that in fact
μ≪1∑i=1p
λ
i
≪
μ
1
i
1
p
λ
i
so that there is no problem with convergence. This condition
implies
μ≪1
λ
i
≪
μ
1
λ
i
or
μ
λ
i
≪1
∀i,i=1…p
≪
μ
λ
i
1
i
i
1
…
p
. From here we see that
c=12∑i=1pμ
λ
i
1-μ
λ
i
≈12μ∑i=1p
λ
i
≪1
≪
c
1
2
i
1
p
μ
λ
i
1
μ
λ
i
1
2
μ
i
1
p
λ
i
1
(19)
This misadjustment
M=c1-c≈c=12μ∑i=1p
λ
i
M
c
1
c
c
1
2
μ
i
1
p
λ
i
(20)
This shows that larger step size
μμ leads to larger
misadjustment.
Since we still have convergence in mean, this
essentially means that with a larger step size we "converge"
faster but have a larger variance (rattling) about
wopt
w
opt
.
small μμ implies
- small misadjustment in steady-state
- slow adaptation/tracking
large
μμ implies
- large misadjustment in steady-state
- fast adaptation/tracking
wopt=11
w
opt
1
1
xk∼01001
x
k
0
1
0
0
1
y
k
=xkTwopt+
ε
k
y
k
x
k
w
opt
ε
k
ε
k
∼00.01
ε
k
0
0.01
initialization
w0=00
w
0
0
0
and
wk=wk-1+μxk
e
k
∀k,k≥1
w
k
w
k
1
μ
x
k
e
k
k
k
1
, where
e
k
=
y
k
-xkTwk-1
e
k
y
k
x
k
w
k
1