If y(t)y(t) is an empirically recorded signal with and underlying description, g(t)g(t), a model for the noise addition
process transforming g(t) into y(t) is described by Equation Equation 5.
y
i
=
g
(
t
i
)
+
σ
ϵ
i
,
i
=
1
,
...
,
n
y
i
=
g
(
t
i
)
+
σ
ϵ
i
,
i
=
1
,
...
,
n
(5)where ϵiϵi are independent normal random variables N(0,1)N(0,1) and σσ represents the intensity of the noise
in y(t)y(t). Using this model, it follows that the objective of noise removal is, given a finite set of yiyi values,
reconstruct the original signal gg without assuming a particular structure for the signal.
The usual approach to noise removal models noise as a high frequency signal added to an original signal. Fourier
transform could be used to track this high frequency, ultimately removing it by adequate filtering. This noise removal
strategy is conceptually clear and efficient since depends only on calculating the DFT of a given signal. However, there
are some issues that must be taken into account. The most prominent of such issues ocurrs when the original signal has
important information associated to the same frequency as the noise. When a frequency domain representation of the signal
is obtained, filtering out this frequency will induce noticeable loss of information of the target signal.
In cases as the one described, the wavelets approach is more appropiated due to the fact that the signal will be
studied using a “dual” frequency-time representation, which allows separating noise frequencies from valuable signal
frequencies. Under this approach, noise will be represented as a consistent high frequency signal in the entire time scope
and so its identification will be easier than using Fourier analysis.
Once the noise representation is identified, the removal process starts. It has been proven that a suitable strategy
for noise removal consists in making the coefficients associated to the noise frequency equal to zero. This statement
represents a global perspective for noise removal, different methods for denoising differ in the way these coefficients are
tracked and taken out from the representation. The conceptual details of several of these methods are presented in the next
sections. The main reference for the methods discussed here is antoniadis2001
Before attempting to describe the methods is convenient to discuss an alternative definition for wavelet representation used
for noise removal. First, the description assumes that the representation is achieved using periodised wavelets bases on [0,1][0,1].
Also, the basis functions are generated by dilation and translation of a compactely supported scaling function φφ, also
called father wavelet and a familiar mother wavelet function, ψψ. ψψ must be associated with an r-regular
multiresolution analysis of L2(R)L2(R). An advantage of this approach is that generated wavelets families allow
integration of different kinds of smoothness and vanishing moments. This features lead to the fact that many signals in practice
can be represented sparsely (with few wavelets coefficients) and uniquely under wavelets decomposition. The decomposition
expresion using a father and a mother wavelet is depicted in Equation Equation 6.
g
(
t
)
=
∑
k
=
0
2
j
0
-
1
α
j
0
k
φ
j
0
k
(
t
)
+
∑
j
=
j
0
∞
∑
k
=
0
2
j
-
1
β
j
k
ψ
j
k
(
t
)
,
j
0
≥
0
,
t
∈
[
0
,
1
]
g
(
t
)
=
∑
k
=
0
2
j
0
-
1
α
j
0
k
φ
j
0
k
(
t
)
+
∑
j
=
j
0
∞
∑
k
=
0
2
j
-
1
β
j
k
ψ
j
k
(
t
)
,
j
0
≥
0
,
t
∈
[
0
,
1
]
(6)where j0j0 is a primary resolution level, and αj0kαj0k and βjkβjk are calculated as the inner products shown in Equations
Equation 7 and Equation 8
α
j
0
k
=
〈
g
,
φ
j
0
k
〉
=
∫
0
1
g
(
t
)
φ
j
0
k
(
t
)
d
t
,
j
0
≥
0
,
k
=
0
,
1
,
...
,
2
j
0
-
1
α
j
0
k
=
〈
g
,
φ
j
0
k
〉
=
∫
0
1
g
(
t
)
φ
j
0
k
(
t
)
d
t
,
j
0
≥
0
,
k
=
0
,
1
,
...
,
2
j
0
-
1
(7)
β
j
k
=
〈
g
,
ψ
j
k
〉
=
∫
0
1
g
(
t
)
ψ
j
k
(
t
)
d
t
,
j
≥
j
0
≥
0
,
k
=
0
,
1
,
...
,
2
j
-
1
β
j
k
=
〈
g
,
ψ
j
k
〉
=
∫
0
1
g
(
t
)
ψ
j
k
(
t
)
d
t
,
j
≥
j
0
≥
0
,
k
=
0
,
1
,
...
,
2
j
-
1
(8)When the discrete wavelet transform is used, the coefficients cj0kcj0k, discrete scaling coefficients
and djkdjk, discrete wavelet coefficients are used instead of the continous parameters αj0kαj0k and
βjkβjk. The discrete parameters can be approximately calculated by applying a nn factor to the continous
coefficients.
Finally, when the DWT is applied to Equation Equation 5, these expressions are obtained (Equations Equation 9 and Equation 10):
c
^
j
0
k
=
c
j
0
k
+
σ
ϵ
j
k
,
k
=
0
,
1
,
...
,
2
j
0
-
1
c
^
j
0
k
=
c
j
0
k
+
σ
ϵ
j
k
,
k
=
0
,
1
,
...
,
2
j
0
-
1
(9)
d
^
j
k
=
d
j
k
+
σ
ϵ
j
k
,
j
=
j
0
,
...
,
J
-
1
,
k
=
0
,
...
,
2
j
-
1
d
^
j
k
=
d
j
k
+
σ
ϵ
j
k
,
j
=
j
0
,
...
,
J
-
1
,
k
=
0
,
...
,
2
j
-
1
(10)The original and simpler way to remove noise from a contaminated signal consists in modifying the wavelets coefficients
in a smart way such that the “small” coefficients associated to the noise are basically neglected. The updated
coefficients can thus be used to reconstruct the original underlying function free from the effects of noise. It is
implicit in the strategy that only a few “large” wavelets coefficients djkdjk are associated with the original signal,
and that their identification and elimination of any other coefficients will allow a perfect reconstruction of the
underlying signal gg. Several methods use this idea and implement it in different ways. When attempting to descrease
the influence of noise wavelets coefficients it is possible to do it in particular ways, also the need of information
of the underlying signal leads to different statistical treatments of the available information.
In the linear penalization method every wavelet coefficient is affected by a linear shrinkage particular associated to
the resolution level of the coefficient. A mathematical expression for this type of approach using linear
shrinkage is shown in Equation Equation 11.
d
˜
j
k
=
d
^
j
k
1
+
λ
2
2
j
s
d
˜
j
k
=
d
^
j
k
1
+
λ
2
2
j
s
(11)In Equation Equation 11, parameter ss is the known smoothness index of the underlying signal gg, while parameter
λλ is a smooting factor whose determination is critical for this type of analysis.
It must be said that linear thresholding is adequate only for spatially homogenous signal with important levels of regularity.
When homegeneity and regularity conditions are not met nonlinear wavelet thresholding or shrinkage methods are usually
more suitable.
donoho1995 and donoho1995b proposed a nonlinear strategy for thresholding.
Under their approach, the thresholding can be done by implementing either a hard or a soft thresholding rule. Their mathematical
expressions are shown in Equation Equation 12 and Equation Equation 13 respectively.
In both methods, the role of the parameter λλ as a threshold value is critical as the estimator leading to destruction,
reduction, or increase in the value of a wavelet coefficient.
δ
λ
H
(
d
^
j
k
)
=
0
i
f
|
d
^
j
k
|
≤
λ
d
^
j
k
i
f
|
d
^
j
k
|
>
λ
δ
λ
H
(
d
^
j
k
)
=
0
i
f
|
d
^
j
k
|
≤
λ
d
^
j
k
i
f
|
d
^
j
k
|
>
λ
(12)
δ
λ
S
(
d
^
j
k
)
=
0
i
f
|
d
^
j
k
|
≤
λ
d
^
j
k
-
λ
i
f
d
^
j
k
>
λ
d
^
j
k
+
λ
i
f
d
^
j
k
<
-
λ
δ
λ
S
(
d
^
j
k
)
=
0
i
f
|
d
^
j
k
|
≤
λ
d
^
j
k
-
λ
i
f
d
^
j
k
>
λ
d
^
j
k
+
λ
i
f
d
^
j
k
<
-
λ
(13)Several authors have discussed the properties and limitations of these two strategies; hard thresholding, due to its
induced discontinuity, can be unstable and sensitive even to small changes in the data. On the other hand, soft thresholding
can create unnecessary bias when the true coefficients are large. Although more sophisticated methods has been introduced
to account for the drawbacks of the described nonlinear strategies, the discussion in this report is limited to the hard
and soft approaches.
One apparent problem in applying wavelet thresholding methods is the way of selecting an appropriate value for
the threshold, λλ. There are indeed several approaches for specifying the value of the parameter in question.
In a general sense, these strategies can be classified in two groups: global thresholds and
level-dependent thresholds. Global threshold implies the selection of one λλ value, applied
to all the wavelet coefficients. Level-dependent thresholds implies that a (possibly) different threshold value
lambdajlambdaj is applied for each resolution level. All the alternatives require an estimate of the noise level σσ.
The standard deviation of the data values is clearly not a good estimator, unless the underlying response function gg
is reasonably flat. donoho1995 considered estimating σσ in the wavelet domain by using the expression in
Equation Equation 14.
σ
^
=
m
e
d
i
a
n
(
|
d
^
J
-
1
,
k
|
)
0
.
6745
,
k
=
0
,
1
,
...
,
2
J
-
1
-
1
σ
^
=
m
e
d
i
a
n
(
|
d
^
J
-
1
,
k
|
)
0
.
6745
,
k
=
0
,
1
,
...
,
2
J
-
1
-
1
(14)donoho1995 obtained an optimal threshold value λMλM by minimizing the risk involved in estimating a
function. The porposed minimax threshold depends of the available data and also takes into account the noise
level contaminating the signal (Equation Equation 15).
λ
M
=
σ
^
λ
n
*
λ
M
=
σ
^
λ
n
*
(15)Where, λn*λn* is equal to the value of λλ satisfying Equation Equation 16
λ
n
*
=
inf
λ
sup
d
R
λ
(
d
)
n
-
1
+
R
o
r
a
c
l
e
(
d
)
λ
n
*
=
inf
λ
sup
d
R
λ
(
d
)
n
-
1
+
R
o
r
a
c
l
e
(
d
)
(16)In Equation Equation 16, Rλ(d)Rλ(d) is calculated following Equation Equation 17.
R
λ
(
d
)
=
E
(
δ
λ
d
^
)
2
R
λ
(
d
)
=
E
(
δ
λ
d
^
)
2
(17)while Roracle(d)Roracle(d) is an operators called oracle used to account for the risk associated to the modification
of the value of a given wavelet coeficient. Two of these oracles were introduced by donoho1995:
diagonal linear projection (DLP), and diagonal linear shrinker (DLS). The Equations Equation 18 and Equation 19
show the expressions for the two oracles.
R
o
r
a
c
l
e
D
L
P
(
d
)
=
m
i
n
(
d
2
,
1
)
R
o
r
a
c
l
e
D
L
P
(
d
)
=
m
i
n
(
d
2
,
1
)
(18)
R
o
r
a
c
l
e
D
L
S
(
d
)
=
d
2
d
2
+
1
R
o
r
a
c
l
e
D
L
S
(
d
)
=
d
2
d
2
+
1
(19)antoniadis2001 provided values of the minimax threshold for both the hard and soft nonlinear thresholding rules.
For the soft rule, 1.669 and 2.226 for nn equal to 128 and 1024; for the hard rule, 2.913 and 3.497 again for nn
equal to 128 and 1024.
donoho1995 proposed this threshold as an alternative to the minimax thresholds, applied to all the wavelet
coefficients. The universal threshold is defined in Equation Equation 20.
λ
U
=
σ
^
2
log
n
λ
U
=
σ
^
2
log
n
(20)This threshold is easy to remember and its implementation in software is simpler and the optimization problem implicit
in the minimax method are avoided. Also, the universal threshold ensures, with high probability, that every sample
in the wavelet transform in which the underlying function is exactly zero will be estimated as zero, although the
convergance rate (depending in the size of the sample) is slow.
It has been noted that wavelet thresholding with either minimax thresholds or the universal threshold presents some
inconvenient features. In particular, in the vicinity of discontinuities, these wavelet thresholds can exhibit
pseudo-Gibbs phenomena. While these phenomena are less pronounced than in the case of Fourier analysis
and also are present in a local scale, this situation represents a challenge for the thresholding methods.
coifman1995 proposed the use of the translation invariant wavelet thresholding scheme.
The idea is to correct mis-alignments between features in the studied signal and features in the basis used
for the decomposition. When the signal contains an important number of discontinuities, the method applies
a range of shifts to it, and average the results obtained after such transformations.
If a empirical contaminated signal y[i],(i=1,...,n)y[i],(i=1,...,n) is provided, the tranlation invariant wavelet
thresholding estimator is calculated as (Equation Equation 21):
g
^
T
I
=
1
n
∑
k
=
1
n
(
W
S
k
)
'
δ
λ
(
W
S
k
y
)
g
^
T
I
=
1
n
∑
k
=
1
n
(
W
S
k
)
'
δ
λ
(
W
S
k
y
)
(21)where δλδλ is either the hard of soft thresholding rule, WW is the size nn orthogonal matrix
associated to the DWT, and SkSk is the shift kernel defined as:
S
k
=
O
k
×
(
n
-
k
)
I
k
×
k
I
(
n
-
k
)
×
(
n
-
k
)
O
(
n
-
k
)
×
k
S
k
=
O
k
×
(
n
-
k
)
I
k
×
k
I
(
n
-
k
)
×
(
n
-
k
)
O
(
n
-
k
)
×
k
(22)In Equation Equation 22, II is the identity matrix and OO stands for a zero matrix with dimensions as indicated
in the expression.
donoho1995 proposed a procedure to select a threshold value λjλj for every resolution level jj.
The method uses Steinâs unbiased risk criterion citep([7]) to get an unbiased estimate of the l2l2-risk.
In mathematical terms, given a set X1,...,XsX1,...,Xs of variables distributed as N(μi,1)N(μi,1) with i=1,...,si=1,...,s, the problem consists in estimate the vector of means with minimum l2-l2-risk. It turns out an estimator of μμ,
can be describes as μ^(X)=X+g(X)μ^(X)=X+g(X), with gg a function from RsRs to RsRs is weakly
differentiable. With this information, the risk of the estimation can be described as (Equation Equation 23):
E
μ
|
|
μ
^
(
X
)
-
μ
|
|
2
=
s
+
E
μ
|
|
g
(
X
)
|
|
2
+
2
∇
g
(
X
)
E
μ
|
|
μ
^
(
X
)
-
μ
|
|
2
=
s
+
E
μ
|
|
g
(
X
)
|
|
2
+
2
∇
g
(
X
)
(23)with,
∇
g
≡
∑
i
=
1
s
∂
g
i
∂
x
i
∇
g
≡
∑
i
=
1
s
∂
g
i
∂
x
i
(24)If an operator SURE is defined as (Equation Equation 25):
S
U
R
E
(
λ
;
X
)
=
s
-
2
·
#
{
i
:
|
X
i
|
≤
λ
}
+
[
min
(
|
X
i
|
,
λ
)
]
2
S
U
R
E
(
λ
;
X
)
=
s
-
2
·
#
{
i
:
|
X
i
|
≤
λ
}
+
[
min
(
|
X
i
|
,
λ
)
]
2
(25)where the operator #{A}#{A} returns the cardinality of the set A, it is found that SURE is an unbiased estimate
of the l2-l2-risk, i.e,
E
μ
|
|
μ
^
(
X
)
-
μ
|
|
2
=
S
U
R
E
(
λ
;
X
)
E
μ
|
|
μ
^
(
X
)
-
μ
|
|
2
=
S
U
R
E
(
λ
;
X
)
(26)Now, the threshold λλ is found by minimizing SURE over the set XX of give data. Extending this principle to
the entire set of resolution levels, an expression for λjsλjs is found (Equation Equation 27):
λ
j
s
=
arg
min
0
≤
λ
≤
λ
U
S
U
R
E
(
λ
;
d
^
j
k
σ
^
)
,
j
=
j
0
,
...
,
J
-
1
;
k
=
0
,
...
,
2
j
-
1
λ
j
s
=
arg
min
0
≤
λ
≤
λ
U
S
U
R
E
(
λ
;
d
^
j
k
σ
^
)
,
j
=
j
0
,
...
,
J
-
1
;
k
=
0
,
...
,
2
j
-
1
(27)where λUλU is the universal threshold, and σ^σ^ is the estimator of the noise level (Equation Equation 14).
Thresholding approaches resorting to term-by-term modification on the wavelets coefficients attempt to balance
variance and bias contribution to the mean squared error in the estimation of the underlying signal gg. However,
it has been proven that such balance is not optimal. Term-by-term thresholding end sup removing to many terms leading
to estimation prone to bias and with a slower convergance rate due to the number of operations involved.
A useful resource to improve the quality of the aforementioned balanced is by using information of the set of data
associated to a particular wavelet coefficient. In order to do so, a block strategy for thresholded is proposed.
The main idea consists in isolating a block of wavelet coefficients and based upon the information collected about
the entire set make a decision about decreasing or even entirely discard the group. This procedure will allow faster
manipulation of the information and accelerated convergence rates.
cai2001 considered an overlapping block thresholding estimator by modifying the nonoverlapping block
thresholding estimator citep([2]). The effect is that the treatment of empirical wavelet coefficients
in the middle of each block depends on the data in the whole block. At each resolution level, this method packs
wavelet coefficients d^jkd^jk into nonoverlapping blocks (jbjb) of length L0L0. Following this, the blocks
are extended in each direction an amount L1=max(1;[L02])L1=max(1;[L02]), generating overlapping blocks
(jBjB) of augmented length L=L0+2L1L=L0+2L1.
If S(jB)2S(jB)2 is the L2-L2-energy of the empirical signal in the augmented block jBjB, the wavelet
coefficients in the blocks jbjb will be estimated simultaneously using the expression in Equation Equation 28
d
˘
j
k
(
j
b
)
=
max
0
,
S
(
j
B
)
2
-
λ
L
σ
^
2
S
(
j
B
)
2
d
^
j
k
d
˘
j
k
(
j
b
)
=
max
0
,
S
(
j
B
)
2
-
λ
L
σ
^
2
S
(
j
B
)
2
d
^
j
k
(28)Once the estimated wavelet coefficients d˘jk(jb)d˘jk(jb) have been calculated, an estimation of the underlying
signal gg can be obtained through using the new wavelet coefficients and the unmodified scaling coefficients
c^j0kc^j0k in the IDWT.
The results from this method (NeigBlock) presented in this document used a value of L0=[log(n2)]L0=[log(n2)]
and a value of λ=4.50524λ=4.50524 as suggested by cai2001.
From Equations Equation 9 and Equation 10 it can be established that the empirical scaling and wavelet coefficients,
conditional on their respective underlying coefficients, are independently distributed, i.e:
c
^
j
0
k
/
c
j
0
k
,
σ
2
∼
N
(
c
j
0
k
,
σ
2
)
c
^
j
0
k
/
c
j
0
k
,
σ
2
∼
N
(
c
j
0
k
,
σ
2
)
(29)
d
^
j
k
/
c
j
k
,
σ
2
∼
N
(
d
j
k
,
σ
2
)
d
^
j
k
/
c
j
k
,
σ
2
∼
N
(
d
j
k
,
σ
2
)
(30)The Bayesian approach imposes an apriori model for the wavelets coefficients designed to capture the sparseness
of the wavelet expansion common to most applications. An usual prior model for each wavelet coefficient d^jkd^jk
is a mixture of two distributions, one of them associated to negligable coefficients and the other to significant
coefficients. Two types of mixtures have been widely used. One of them employs two normal distributions while the
other uses one normal distribution and one point mass at zero.
After mathematical manipulation, it can be shown that an estimator for the underlying signal can be written as
(Equation Equation 31):
g
^
B
R
(
t
)
=
∑
k
=
0
2
j
0
-
1
c
^
j
0
k
n
φ
j
0
k
(
t
)
+
∑
j
=
j
0
J
-
1
∑
k
=
0
2
j
-
1
B
R
(
d
j
k
|
(
d
j
k
,
σ
2
)
)
n
ψ
j
k
(
t
)
g
^
B
R
(
t
)
=
∑
k
=
0
2
j
0
-
1
c
^
j
0
k
n
φ
j
0
k
(
t
)
+
∑
j
=
j
0
J
-
1
∑
k
=
0
2
j
-
1
B
R
(
d
j
k
|
(
d
j
k
,
σ
2
)
)
n
ψ
j
k
(
t
)
(31)i.e. the scaling coefficients are estimated by the empirical scaling coefficients while the wavelet coefficients are
estimated by a Bayesian rule (BR), taking into account the obtained empirical wavelet coefficient and the noise level.
huang2000 proposed a method that takes into account the value of the prior mean for each wavelet
coefficient, by introducing a estimator for the parameter into the general wavelet shrinkage model. These authors
assumed thatthe undelying signal is composed of a piecewise deterministic portion with an added zero mean stochastic
part.
If c^j0c^j0 is the vector of empirical scaling coefficients, d^jd^j the
vector of empirical wavelet coefficients, cj0cj0 the vector of underlying scaling coefficients, and
djdj the vector of underlying wavelet coefficients, then the Bayesian model (Equation Equation 32):
ω
/
(
β
,
σ
2
)
∼
N
(
β
,
σ
2
I
)
ω
/
(
β
,
σ
2
)
∼
N
(
β
,
σ
2
I
)
(32)with ω=(c^j0,d^j0,...,d^J-1')'ω=(c^j0,d^j0,...,d^J-1')' and the underlying
signal β=(cj0',dj0',...,dJ-1')'β=(cj0',dj0',...,dJ-1')' is assumed to follow an
apriori distribution (Equation Equation 33)
β
/
(
μ
,
θ
)
∼
N
(
μ
,
Σ
(
θ
)
)
β
/
(
μ
,
θ
)
∼
N
(
μ
,
Σ
(
θ
)
)
(33)where μμ is the deterministic mean structure and Σ(θ)Σ(θ) accounts for the uncertainty and value correlation in
the underlying signal. Notice that if ηη following a distribution N(0,Σ(θ))N(0,Σ(θ)) is defined as the stochastic
component representing small variation (high frequency) in the signal, then μμ can be interpretated as the stochastic
component accounting for the large-scale variation in ββ. So, it is possible to rewrite ββ as (Equation Equation 34),
Using this model, a shrinkage rule can be established by calculating the mean of ββ conditional on σ2σ2 which
is expressed as (Equation Equation 35),
E
β
/
(
ω
,
σ
2
)
=
μ
+
Σ
(
θ
)
(
Σ
(
θ
)
+
σ
2
I
)
(
ω
-
μ
)
E
β
/
(
ω
,
σ
2
)
=
μ
+
Σ
(
θ
)
(
Σ
(
θ
)
+
σ
2
I
)
(
ω
-
μ
)
(35)