To this point, we have discussed signal representations and models
as basic tools for signal processing. In the following modules, we discuss the actual application of these tools to tasks
such as approximation and compression, and we continue to discuss
the geometric implications.
One common prototypical problem in signal processing is to find
the best linear approximation to a signal xx. By “best linear
approximation,” we mean the best approximation to xx from among
a class of signals comprising a linear (or affine) subspace. This
situation may arise, for example, when we have a noisy observation
of a signal believed to obey a linear model. If we choose an
ℓ2ℓ2 error criterion, the solution to this optimization
problem has a particularly strong geometric interpretation.
To be more concrete, suppose SS is a KK-dimensional
linear subspace of RNRN. (The case of an affine subspace
follows similarly.) If we seek
s
*
:
=
arg
min
s
∈
S
s
-
x
2
,
s
*
:
=
arg
min
s
∈
S
s
-
x
2
,
(1)
standard linear algebra results state that the minimizer is given
by
s
*
=
A
T
A
x
,
s
*
=
A
T
A
x
,
(2)
where
AA is a
K×NK×N matrix whose rows form an
orthonormal basis for
SS. Geometrically, one can easily see that
this solution corresponds to an orthogonal projection of
xx onto
the subspace
SS (see
Figure 1(a)).
The linear approximation problem arises frequently in settings
involving signal dictionaries. In some settings, such as the case
of an oversampled bandlimited signal, certain coefficients in the
vector αα may be assumed to be fixed at zero. In the case
where the dictionary ΨΨ forms an orthonormal basis, the linear
approximation estimate of the unknown coefficients has a
particularly simple form: rows of the matrix AA in
Equation 2 are obtained by selecting and transposing the
columns of ΨΨ whose expansion coefficients are unknown, and
consequently, the unknown coefficients can be estimated simply by
taking the inner products of xx against the appropriate columns
of ΨΨ.
For example, in choosing a fixed subset of the Fourier or wavelet
dictionaries, one may rightfully choose the lowest frequency
(coarsest scale) basis functions for the set SS because, as
discussed in Linear Models from Low-Dimensional Signal Models , the coefficients
generally tend to decay at higher frequencies (finer scales). For
smooth functions, this strategy is appropriate and effective;
functions in Sobolev smoothness spaces are well-approximated using
linear approximations from the Fourier or wavelet
dictionaries [11].
For piecewise smooth functions, however, even the wavelet-domain
linear approximation strategy would miss out on significant
coefficients at fine scales. Since the locations of such
coefficients are unknown a priority, it is impossible to propose a
linear wavelet-domain approximation scheme that could
simultaneously capture all piecewise smooth signals. As an example, Figure 2(a) shows the linear approximation of the Cameraman test image obtained by keeping only the lowest-frequency scaling and wavelet coefficients. No high-frequency information is available to clearly represent features such as edges.
A related question often arises in settings involving signal
dictionaries. Rather than finding the best approximation to a
signal ff using a fixed collection of KK elements from
the dictionary ΨΨ, one may often seek the best
KK-term representation to ff among all possible
expansions that use KK terms from the dictionary.
Compared to linear approximation, this type of nonlinear
approximation [8], [4] utilizes the ability of
the dictionary to adapt: different elements may be important for
representing different signals.
The KK-term nonlinear approximation problem corresponds
to the optimization
s
K
,
p
*
:
=
arg
min
s
∈
Σ
K
∥
s
-
f
∥
p
.
s
K
,
p
*
:
=
arg
min
s
∈
Σ
K
∥
s
-
f
∥
p
.
(3)
(For the sake of generality, we consider general
LpLp and
ℓpℓp norms in this section.) Due to the nonlinearity of the
set
ΣKΣK for a given dictionary, solving this
problem can be difficult. Supposing
ΨΨ is an orthonormal basis
and
p=2p=2, the solution to
Equation 3 is easily obtained by
thresholding: one simply computes the coefficients
αα and keeps the
KK largest (setting the remaining coefficients to zero).
The approximation error is then given simply
by the coefficients that are discarded:
∥
s
K
,
2
*
-
f
∥
2
=
∑
k
>
K
α
˜
k
2
1
/
2
.
∥
s
K
,
2
*
-
f
∥
2
=
∑
k
>
K
α
˜
k
2
1
/
2
.
(4)
When
ΨΨ is a redundant dictionary, however, the situation is
much more complicated. We mention more on this below (see also
Figure 1(b)).
One common measure for the quality of a dictionary ΨΨ in
approximating a signal class is the fidelity of its
KK-term representations. Often one examines the
asymptotic rate of decay of the KK-term approximation
error as KK grows large. Defining
σ
K
(
f
)
p
:
=
∥
s
K
,
p
*
-
f
∥
p
,
σ
K
(
f
)
p
:
=
∥
s
K
,
p
*
-
f
∥
p
,
(5)
for a given signal
ff we may consider the asymptotic decay of
σK(f)pσK(f)p as
K→∞K→∞. (We recall
the dependence of
Equation 3 and hence
Equation 5 on the
dictionary
ΨΨ.) In many cases, the function
σK(f)pσK(f)p will decay as
K-rK-r for some
rr, and when
ΨΨ represents a harmonic dictionary, faster
decay rates tend to correspond to smoother functions. Indeed, one
can show that when
ΨΨ is an orthonormal basis, then
σK(f)2σK(f)2 will decay as
K-rK-r if and only
if
α˜kα˜k decays as
k-r+1/2k-r+1/2 [7].
Let f∈CHf∈CH be a 1-D function. Supposing the
wavelet dictionary has more than HH vanishing moments,
then ff can be well approximated using its KK largest
coefficients (most of which are at coarse scales). As KK
grows large, the nonlinear approximation error will
decay as σK(f)2≲K-HσK(f)2≲K-H.
Supposing that ff is piecewise smooth, however, with a finite
number of discontinuities, then (as discussed in
Sparse (Nonlinear) Models from Low-Dimensional Signal Models) ff will have a limited number of
significant wavelet coefficients at fine scales. Because of the
concentration of these significant coefficients within each scale,
the nonlinear approximation rate will remain
σK(f)2≲K-HσK(f)2≲K-H as if
there were no discontinuities present [11].
Unfortunately, this resilience of wavelets to discontinuities does
not extend to higher dimensions. Suppose, for example, that ff is
a CHCH smooth 2-D signal. Assuming the proper
number of vanishing moments, a wavelet representation will achieve
the optimal nonlinear approximation rate σK(f)2≲K-H/2σK(f)2≲K-H/2 [3], [11]. As in the 1-D case, this approximation rate is maintained when a
finite number of point discontinuities are introduced into ff.
However, when ff contains 1-D discontinuities (edges separating
the smooth regions), the approximation rate will fall to
σK(f)2≲K-1/2σK(f)2≲K-1/2 [11]. The problem actually arises due to the isotropic, dyadic supports
of the wavelets; instead of O(1)O(1) significant wavelets at each
scale, there are now O(2j)O(2j) wavelets overlapping the
discontinuity. We revisit this important issue in
Compression.
Despite the limited approximation capabilities for images with edges, nonlinear approximation in the wavelet domain typically offers a superior approximation to an image compared to linear approximation in the wavelet domain. As an example, Figure 2(b) shows the nonlinear approximation of the Cameraman test image obtained by keeping the largest scaling and wavelet coefficients. In this case, a number of high-frequency coefficients are selected, which gives an improved ability to represent features such as edges. Better concise transforms, which capture the image information in even fewer coefficients, would offer further improvements in terms of nonlinear approximation quality.
As mentioned above, in the case where ΨΨ is an orthonormal
basis and p=2p=2, the solution to Equation 3 is easily obtained
by thresholding: one simply computes the coefficients αα and keeps the
KK largest (setting the remaining coefficients to zero). Thresholding can also be shown to be optimal
for arbitrary ℓpℓp norms in the special case where ΨΨ is
the canonical basis. While the optimality of thresholding does not
generalize to arbitrary norms and bases, thresholding can be shown
to be a near-optimal approximation strategy for wavelet bases with
arbitrary LpLp norms [7].
In the case where ΨΨ is a redundant dictionary, however, the
expansion coefficients αα are not unique, and the
optimization problem Equation 3 can be much more difficult to
solve. Indeed, supposing even that an exact
KK-term
representation exists for ff in the dictionary ΨΨ, finding
that KK-term approximation is NP-hard in general,
requiring a combinatorial enumeration of the
ZKZK possible sparse
subspaces [6]. This search can be recast as the
optimization problem
α
^
=
arg
min
∥
α
∥
0
s.t.
f
=
Ψ
α
.
α
^
=
arg
min
∥
α
∥
0
s.t.
f
=
Ψ
α
.
(6)
While solving
Equation 6 is prohibitively complex, a variety
of algorithms have been proposed as alternatives. One approach
convexifies the optimization problem by replacing the
ℓ0ℓ0
fidelity criterion by an
ℓ1ℓ1 criterion
α
^
=
arg
min
∥
α
∥
1
s.t.
f
=
Ψ
α
.
α
^
=
arg
min
∥
α
∥
1
s.t.
f
=
Ψ
α
.
(7)
This problem, known as Basis Pursuit
[5], is
significantly more approachable and can be solved with traditional
linear programming techniques whose computational complexities are
polynomial in
ZZ. The
ℓ1ℓ1 criterion has the advantage of yielding a convex optimization problem while still encouraging sparse solutions due to the polytope geometry of the
ℓ1ℓ1 unit ball (see for example
[9] and
[10]).
Iterative greedy algorithms such as
Matching Pursuit (MP) and Orthogonal Matching Pursuit
(OMP)
[11] have also been suggested to find sparse
representations
αα for a signal
ff. Both MP and OMP
iteratively select the columns from
ΨΨ that are most
correlated with
ff, then subtract the contribution of each
column, leaving a residual. OMP includes an additional step at
each iteration where the residual is orthogonalized against the
previously selected columns.
We also consider the problem of finding the best manifold-based
approximation to a signal (see Figure 1(c)).
Suppose that F={fθ:θ∈Θ}F={fθ:θ∈Θ} is a
parametrized KK-dimension manifold and that we are given a
signal II that is believed to approximate fθfθ for an
unknown θ∈Θθ∈Θ. From II we wish to recover an
estimate of θθ. Again, we may formulate this parameter
estimation problem as an optimization, writing the objective
function (here we concentrate solely on the L2L2 or ℓ2ℓ2
case)
D
(
θ
)
=
f
θ
-
I
2
2
D
(
θ
)
=
f
θ
-
I
2
2
(8)
and solving for
θ
*
=
arg
min
θ
∈
Θ
D
(
θ
)
.
θ
*
=
arg
min
θ
∈
Θ
D
(
θ
)
.
(9)
We suppose that the minimum is uniquely defined.
Standard nonlinear parameter estimation [2] tells us
that, if DD is differentiable, we can use Newton's method to
iteratively refine a sequence of guesses θ(0),θ(1),θ(2),⋯θ(0),θ(1),θ(2),⋯ to θ*θ* and rapidly
convergence to the true value. Supposing that FF is a differentiable manifold, we would let
J
=
[
∂
D
/
∂
θ
0
∂
D
/
∂
θ
1
⋯
∂
D
/
∂
θ
K
-
1
]
T
J
=
[
∂
D
/
∂
θ
0
∂
D
/
∂
θ
1
⋯
∂
D
/
∂
θ
K
-
1
]
T
(10)
be the gradient of
DD, and let
HH be the
K×KK×K
Hessian,
Hij=∂2D∂θi∂θjHij=∂2D∂θi∂θj. Assuming
DD is differentiable, Newton's method
specifies the following update step:
θ
(
k
+
1
)
←
θ
(
k
)
+
[
H
(
θ
(
k
)
)
]
-
1
J
(
θ
(
k
)
)
.
θ
(
k
+
1
)
←
θ
(
k
)
+
[
H
(
θ
(
k
)
)
]
-
1
J
(
θ
(
k
)
)
.
(11)
To relate this method to the structure of the manifold, we can
actually express the gradient and Hessian in terms of signals,
writing
D
(
θ
)
=
∥
f
θ
-
I
∥
2
2
=
∫
(
f
θ
-
I
)
2
d
x
=
∫
f
θ
2
-
2
I
f
θ
+
I
2
d
x
.
D
(
θ
)
=
∥
f
θ
-
I
∥
2
2
=
∫
(
f
θ
-
I
)
2
d
x
=
∫
f
θ
2
-
2
I
f
θ
+
I
2
d
x
.
(12)
Differentiating with respect to component
θiθi, we obtain
∂
D
∂
θ
i
=
J
i
=
∂
∂
θ
i
∫
f
θ
2
-
2
I
f
θ
+
I
2
d
x
=
∫
∂
∂
θ
i
(
f
θ
2
)
-
2
I
∂
∂
θ
i
f
θ
d
x
=
∫
2
f
θ
τ
θ
i
-
2
I
τ
θ
i
d
x
=
2
〈
f
θ
-
I
,
τ
θ
i
〉
,
∂
D
∂
θ
i
=
J
i
=
∂
∂
θ
i
∫
f
θ
2
-
2
I
f
θ
+
I
2
d
x
=
∫
∂
∂
θ
i
(
f
θ
2
)
-
2
I
∂
∂
θ
i
f
θ
d
x
=
∫
2
f
θ
τ
θ
i
-
2
I
τ
θ
i
d
x
=
2
〈
f
θ
-
I
,
τ
θ
i
〉
,
(13)
where
τθi=∂fθ∂θiτθi=∂fθ∂θi is a tangent signal. Continuing, we examine the
Hessian,
∂
2
D
∂
θ
i
∂
θ
j
=
H
i
j
=
∂
∂
θ
j
∂
D
∂
θ
i
=
∫
∂
∂
θ
j
2
f
θ
τ
θ
i
-
2
I
τ
θ
i
d
x
=
∫
2
τ
θ
i
τ
θ
j
+
2
f
θ
τ
θ
i
j
-
2
I
τ
θ
i
j
d
x
=
2
〈
τ
θ
i
,
τ
θ
j
〉
+
2
〈
f
θ
-
I
,
τ
θ
i
j
〉
,
∂
2
D
∂
θ
i
∂
θ
j
=
H
i
j
=
∂
∂
θ
j
∂
D
∂
θ
i
=
∫
∂
∂
θ
j
2
f
θ
τ
θ
i
-
2
I
τ
θ
i
d
x
=
∫
2
τ
θ
i
τ
θ
j
+
2
f
θ
τ
θ
i
j
-
2
I
τ
θ
i
j
d
x
=
2
〈
τ
θ
i
,
τ
θ
j
〉
+
2
〈
f
θ
-
I
,
τ
θ
i
j
〉
,
(14)
where
τθij=∂2fθ∂θi∂θjτθij=∂2fθ∂θi∂θj denotes a second-derivative signal. Thus, we
can interpret Newton's method geometrically as (essentially) a
sequence of successive projections onto tangent spaces on the
manifold.
Again, the above discussion assumes the manifold to be
differentiable. Many interesting parametric signal manifolds are in fact nowhere
differentiable — the tangent spaces demanded by Newton's method
do not exist. However, in [1] we have identified a type of multiscale tangent structure to the manifold that permits a coarse-to-fine technique for parameter estimation.
-
M. B. Wakin and D. L. Donoho and H. Choi, and R. G. Baraniuk. (2005). High-resolution navigation on non-differentiable image manifolds. In Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP). IEEE.
-
Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis and Its Applications. New York: John Wiley and Sons.
-
Cohen, A. and Dahmen, W. and Daubechies, I. and DeVore, R. (2001). Tree Approximation and Optimal Encoding. Appl. Comput. Harmon. Anal., 11, 192-226.
-
Cohen, A. and Daubechies, I. and Guleryuz, O. G. and Orchard, M. T. (2002, July). On the importance of combining wavelet-based nonlinear approximation with coding strategies. IEEE Trans. Inform. Theory, 48(7), 1895-1921.
-
Chen, S. and Donoho, D. and Saunders, M. (1998). Atomic decomposition by basis pursuit. SIAM J. on Sci. Comp., 20(1), 33-61.
-
Candès, E. and Tao, T. (2005). Error correction via linear programming. [Preprint]. Found. of Comp. Math..
-
DeVore, R. A. (Spring 2006). Lecture notes on Compressed Sensing. Rice University ELEC 631 Course Notes.
-
DeVore, R. A. (1998). Nonlinear Approximation. Acta Numerica, 7, 51-150.
-
Donoho, D. and Tanner, J. (2005). Neighborliness of randomly-projected simplices in high dimensions. [Preprint].
-
Donoho, D. L. and Tanner, J. (2006). Counting faces of randomly-projected polytopes when then projection radically lowers dimension. (2006-11). Technical report. Stanford University Department of Statistics.
-
Mallat, S. (1999). A wavelet tour of signal processing. San Diego, CA, USA: Academic Press.