In Lecture 4 and 15, we investigated the problem of denoising a
smooth signal in additive white noise. In Lecture 4, we considered
Lipschitz functions and showed that by filling constants on a
uniform partition of width n-1/3n-1/3 we can achieve an n-2/3n-2/3
rate of MSE convergence.
In Lecture 15, we considered Holder-αα smooth functions, and
we demonstrated that by automatically selecting partition width
and using polynomial fits we can obtain a MSE convergence rate of
n-2α/2α+1n-2α/2α+1, substantially better when α>1α>1. Also important is the fact that we don't need to know the
value of αα a priori. The estimator f^nf^n is
fundamentally different than its counterpart in Lecture 4.
In both cases f^n(t)f^n(t) is a linear function (polynomial on
constant fit) of the data in each interval of the underlying
partition. In Lecture 4, the partition was independent of the
data, and so the overall estimator is a linear function of
the data
.
However, in Lecture 15 the partition itself was selected based on
the data. Consequently, f^n(t)f^n(t) is a non-linear function
of the data
. Linear estimators (linear functions of the data)
cannot adapt to unknown degrees of smoothness. In this lecture, we
lay the groundwork for one more important extension in the
denoising application - spatial adaptivity. That is, we would like
to construct estimators that not only adapt to unknown degrees of
global smoothness, but that also adapt to spatially varying
degrees of smoothness.
We will focus on the approximation theoretic aspects of the
problem in this lecture, considering tree-based approximations and
wavelet expansions. In the next lecture, we will apply these
results to the denoising problem, this will bring us up to date
with the current state-of-the-art in denoising and non-parametric
estimation.
Recall that Holder spaces contain smooth functions that are well
approximated with polynomials or piecewise polynomial functions.
Holder spaces are quite large and contain many interesting
signals. However, Holder spaces are still inadequate in many
applications. Often, we encounter functions that are not smooth
everywhere; they contain discontinuities, jumps, spikes, etc.
Indeed, the "singularities" (or non-smooth points) can be the most
interesting and informative aspects of the functions.
Example 1
Functions not smooth everywhere.
Furthermore, functions of interest may possess different degrees
of smoothness in different regions.
Example 2
Functions with different degrees of smoothness.
Let Bα(Cα)Bα(Cα) denote the set of all functions that are
Hα(Cα)Hα(Cα) everywhere except on a set of measure zero.
To simplify the notation, we won't explicitly identify the domain
(e.g., [0,1][0,1] or [0,1]d[0,1]d); that will be clear from the context.
Example 3
Sets of measure zero.
Let's consider a 1-D case first.
Let f∈Bα(Cα)f∈Bα(Cα) and consider approximating ff by
a piecewise polynomial function on a uniform partition.
If ff is Holder-αα smooth everywhere, then by using an
appropriate partition width k-1k-1 and fitting degree
⌈α⌉⌈α⌉ polynomials on each interval we have an
approximation fkfk satisfying
|
f
(
t
)
-
f
k
(
t
)
|
≤
C
α
k
-
α
|
f
(
t
)
-
f
k
(
t
)
|
≤
C
α
k
-
α
(1)
and
|
|
f
-
f
k
|
|
L
2
2
=
O
(
k
-
2
α
)
|
|
f
-
f
k
|
|
L
2
2
=
O
(
k
-
2
α
)
(2)
.
However, if there is a discontinuity then for tt in the interval
containing the discontinuity the difference
|
f
(
t
)
-
f
k
(
t
)
|
|
f
(
t
)
-
f
k
(
t
)
|
(3)
will not be small.
Example 4
Suppose ff is piecewise Lipschitz and fkfk ia a piecewise
constant.
|
f
(
t
)
-
f
k
(
t
)
|
≈
Δ
|
f
(
t
)
-
f
k
(
t
)
|
≈
Δ
(4)
where ΔΔ is a constant
equal to average of ff on right and left side of discontinuity in
this interval.
⇒
|
|
f
-
f
k
|
|
L
2
2
=
O
(
k
-
1
⇒
|
|
f
-
f
k
|
|
L
2
2
=
O
(
k
-
1
(5)
where k-1k-1 is
the width of the interval. Notice this rate is quite slow.
This problem naturally suggests the following remedy: use very
small intervals near discontinuities and larger intervals in
smooth regions. Specifically, suppose we use intervals of width
k-2αk-2α to contain the discontinuities and the intervals of
width k-1k-1 elsewhere. Then accordingly piecewise polynomial
approximation f˜kf˜k satisfies
|
|
f
-
f
˜
k
|
|
L
2
2
=
O
(
k
-
2
α
)
|
|
f
-
f
˜
k
|
|
L
2
2
=
O
(
k
-
2
α
)
(6)
.
We can accomplish this need for "adaptive resolution" or
"multiresolution" using recursive partitions and trees.
We discussed this idea already in our examination of
classification trees. Here is the basic idea again, graphically.
Consider a function f∈Bα(Cα)f∈Bα(Cα) that contains no
more than m points of discontinuity, and is Hα(Cα)Hα(Cα)
away from these points.
Lemma 1
Consider a complete RDP with n intervals, then there exists an
associated pruned RDP with O(klogn)O(klogn) intervals, such that an
associated piecewise degree ⌈α⌉⌈α⌉ polynomial
approximation (˜f)k(˜f)k, has a squared approximation error of
O(min(k-2α,n-1))O(min(k-2α,n-1)).
Proof:Proof: Assume n>k>mn>k>m. Divide [0,1][0,1] into kk intervals. If ff
is smooth on a particular interval II, then
|
f
(
t
)
-
f
˜
k
(
t
)
|
=
O
(
k
-
2
α
)
∀
t
∈
I
|
f
(
t
)
-
f
˜
k
(
t
)
|
=
O
(
k
-
2
α
)
∀
t
∈
I
(7)
.
In intervals that contain a discontinuity, recursively subdivide
into two until the discontinuity is contained in an interval of
width n-1n-1. This process results in at most log2nlog2n addition
subintervals per discontinuity, and the squared approximation
error is O(k-2α)O(k-2α) on all of them accept the mm intervals
of width n-1n-1 containing the discontinuities where the error
is O(1)O(1) at each point.
Thus, the overall squared L2L2 norm is
|
|
f
-
f
˜
k
|
|
L
2
2
=
O
(
m
i
n
(
k
-
2
α
,
n
-
1
)
)
|
|
f
-
f
˜
k
|
|
L
2
2
=
O
(
m
i
n
(
k
-
2
α
,
n
-
1
)
)
(8)
and there are at most k+log2nk+log2n
intervals in the partition. Since k>m, we can upperbound the
number of intervals by 2klog2n2klog2n.
Note that if the initial complete RDP has n≈k2αn≈k2α
intervals, then the squared error is O(k-2α)O(k-2α).
Thus, we only incur a factor of 2αlogk2αlogk additional leafs
and achieve the same overall approximation error as in the
Hα(Cα)Hα(Cα) case. We will see that this is a small price
to pay in order to handle not only smooth functions, but also
piecewise smooth functions.
Let f∈L2([0,1])f∈L2([0,1]); ∫f2(t)dt<∞∫f2(t)dt<∞.
A wavelet approximation is a series of the form
f
=
c
o
+
∑
j
≥
0
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
f
=
c
o
+
∑
j
≥
0
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
(9)
where coco is a constant (co=∫01f(t)dt)(co=∫01f(t)dt),
<
f
,
ψ
j
,
k
>
=
∫
0
1
f
(
t
)
ψ
j
,
k
(
t
)
d
t
<
f
,
ψ
j
,
k
>
=
∫
0
1
f
(
t
)
ψ
j
,
k
(
t
)
d
t
(10)
and the basis
functions ψj,kψj,k are orthonormal, oscillatory signals,
each with an associated scale 2-j2-j and position k2-jk2-j.
ψj,kψj,k is called the wavelet at scale 2-j2-j and position
k2-jk2-j.
Example 5
Haar Wavelets
ψ
j
,
k
(
t
)
=
2
j
/
2
(
1
{
t
∈
[
2
-
j
(
k
-
1
)
,
2
-
j
(
k
-
1
/
2
)
]
}
-
1
{
t
∈
[
2
-
j
(
k
-
1
/
2
)
,
2
-
j
k
]
}
)
ψ
j
,
k
(
t
)
=
2
j
/
2
(
1
{
t
∈
[
2
-
j
(
k
-
1
)
,
2
-
j
(
k
-
1
/
2
)
]
}
-
1
{
t
∈
[
2
-
j
(
k
-
1
/
2
)
,
2
-
j
k
]
}
)
(11)
∫
0
1
ψ
j
,
k
(
t
)
d
t
=
0
∫
0
1
ψ
j
,
k
(
t
)
d
t
=
0
(12)
∫
0
1
ψ
j
,
k
2
(
t
)
d
t
=
∫
(
k
-
1
)
2
-
j
k
2
-
j
2
j
d
t
=
1
∫
0
1
ψ
j
,
k
2
(
t
)
d
t
=
∫
(
k
-
1
)
2
-
j
k
2
-
j
2
j
d
t
=
1
(13)
∫
0
1
ψ
j
,
k
(
t
)
ψ
l
,
m
(
t
)
d
t
=
δ
j
,
l
.
δ
k
,
m
∫
0
1
ψ
j
,
k
(
t
)
ψ
l
,
m
(
t
)
d
t
=
δ
j
,
l
.
δ
k
,
m
(14)
Note:
If ff is constant on [2-j(k-1),2-jk][2-j(k-1),2-jk], then
∫
f
ψ
j
,
k
(
t
)
=
0
∫
f
ψ
j
,
k
(
t
)
=
0
(15)
.
Suppose ff is piecewise constant with at most mm
discontinuities. Let
f
J
=
c
o
+
∑
j
=
0
J
-
1
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
f
J
=
c
o
+
∑
j
=
0
J
-
1
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
(16)
. Then, fJfJ has at
most mJmJ non-zero wavelet coefficients; i.e., <f,ψj,k>=0<f,ψj,k>=0 for all but mJmJ terms, since at most one Haar Wavelet at each
scale senses each point of discontinuity. Said another way, all
but at most mm of the wavelets at each scale have support over
constant regions of ff.
fJfJ itself will be piecewise constant with discontinuities only
possible occurring at end points of the intervals
[2-J(k-1),2-Jk][2-J(k-1),2-Jk]. Therefore, in this case
|
|
f
-
f
J
|
|
L
2
2
=
O
(
2
-
J
)
.
|
|
f
-
f
J
|
|
L
2
2
=
O
(
2
-
J
)
.
(17)
Daubechies wavelets are the extension of the Haar wavelet idea.
Haar wavelets have one "vanishing moment":
∫
0
1
ψ
j
,
k
=
0
.
∫
0
1
ψ
j
,
k
=
0
.
(18)
Daubechies wavelets are "smoother" basis functions with extra
vanishing moments. The Daubechies-NN wavelet has NN vanishing
moments.
∫
0
1
t
l
ψ
j
,
k
d
t
=
0
f
o
r
l
=
0
,
1
,
.
.
.
,
N
-
1
.
∫
0
1
t
l
ψ
j
,
k
d
t
=
0
f
o
r
l
=
0
,
1
,
.
.
.
,
N
-
1
.
(19)
The Daubechies-1 wavelet is just the Haar case.
If ff is a piecewise degree ≤N≤N polynomial with at most m
pieces, then using the Daubechies-NN wavelet system.
|
|
f
-
f
J
|
|
L
2
2
=
O
(
2
-
J
)
;
|
|
f
-
f
J
|
|
L
2
2
=
O
(
2
-
J
)
;
(20)
and
f
J
(
t
)
=
c
o
+
∑
j
=
0
J
-
1
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
(
t
)
f
J
(
t
)
=
c
o
+
∑
j
=
0
J
-
1
∑
k
=
1
2
j
<
f
,
ψ
j
,
k
>
ψ
j
,
k
(
t
)
(21)
has at most
O(mJ)O(mJ) non-zero wavelet coefficients. fJfJ is called the Discrete Wavelet Transform (DWT)
approximation of ff. The key
idea is the same as we saw with trees.
We can also use DWT's to analyze and represent discrete, sampled
functions. Suppose,
f
̲
=
[
f
(
1
/
n
)
,
f
(
2
/
n
)
,
.
.
.
,
f
(
n
/
n
)
]
f
̲
=
[
f
(
1
/
n
)
,
f
(
2
/
n
)
,
.
.
.
,
f
(
n
/
n
)
]
(22)
then we can write f̲f̲ as
f
̲
=
c
o
+
∑
j
=
0
l
o
g
2
n
-
1
∑
k
=
1
2
j
<
f
̲
,
ψ
̲
j
,
k
>
ψ
̲
j
,
k
f
̲
=
c
o
+
∑
j
=
0
l
o
g
2
n
-
1
∑
k
=
1
2
j
<
f
̲
,
ψ
̲
j
,
k
>
ψ
̲
j
,
k
(23)
where
ψ
̲
j
,
k
=
[
ψ
j
,
k
(
1
)
,
ψ
j
,
k
(
2
)
,
.
.
.
,
ψ
j
,
k
(
n
)
]
ψ
̲
j
,
k
=
[
ψ
j
,
k
(
1
)
,
ψ
j
,
k
(
2
)
,
.
.
.
,
ψ
j
,
k
(
n
)
]
(24)
is a discrete time analog of the continuous time
wavelets we considered before. In particular,
∑
i
=
1
n
i
l
ψ
j
,
k
(
i
)
=
0
,
l
=
0
,
1
,
.
.
.
,
N
-
1
∑
i
=
1
n
i
l
ψ
j
,
k
(
i
)
=
0
,
l
=
0
,
1
,
.
.
.
,
N
-
1
(25)
for the Daubechies-NN
discrete wavelets.
<
f
̲
,
ψ
̲
j
,
k
>
=
f
̲
T
ψ
̲
j
,
k
<
f
̲
,
ψ
̲
j
,
k
>
=
f
̲
T
ψ
̲
j
,
k
(26)
Thus, we also have an
analogous approximation result: If f̲f̲ are samples
from a piecewise degree ≤N≤N polynomial function with a finite
number mm of discontinuities, then f̲f̲ has O(mJ)O(mJ)
non-zero wavelet coefficients.
Suppose f∈Bα(Cα)f∈Bα(Cα) and has a finite number of
discontinuities. Let fpfp denote piecewise degree-N(N=⌈α⌉)N(N=⌈α⌉) polynomial approximation to ff with
O(k)O(k) pieces; a uniform partition into kk equal length intervals
followed by addition splits at the points of discontinuity.
Then