To better understand the role of the penalty function, it is informative to consider the scalar case.
That is, given y∈Ry∈R,
find x∈Rx∈R so as to minimize the function
F
(
x
)
=
1
2
(
y
-
x
)
2
+
φ
(
x
)
.
F
(
x
)
=
1
2
(
y
-
x
)
2
+
φ
(
x
)
.
(7)To emphasize the dependence of the minimizing value xx on yy,
we define
s
(
y
)
:
=
arg
min
x
1
2
(
y
-
x
)
2
+
φ
(
x
)
.
s
(
y
)
:
=
arg
min
x
1
2
(
y
-
x
)
2
+
φ
(
x
)
.
(8)The function s(y)s(y) is called the shrinkage function because it shrinks yy towards zero.
The exact form of s(y)s(y) depends on the penalty function φ(x)φ(x).
Two particular examples of φ(x)φ(x) will be analyzed in this section.
Example 1:
Define the penalty function φ1(x)φ1(x) as
φ
1
(
x
)
=
λ
|
x
|
φ
1
(
x
)
=
λ
|
x
|
(9)where λ>0λ>0,
as illustrated in Figure 1.
Note that φ1(x)φ1(x) is a convex function.
This corresponds to the ℓ1ℓ1 norm.
Example 2:
Define
φ
2
(
x
)
=
λ
a
log
(
1
+
a
|
x
|
)
φ
2
(
x
)
=
λ
a
log
(
1
+
a
|
x
|
)
(10)where λ>0λ>0 and a>0a>0,
as illustrated in Figure 2.
Note that φ2(x)φ2(x) is not a convex function.
There are many other penalty functions that have been used for sparse signal/image processing [4], [10], [13].
But these are two common, representative examples.
For both of these two examples, we will find the shrinkage function s(y)s(y).
To find the shrinkage function s(y)s(y), we minimize F(x)F(x) with respect to xx.
The derivative of F(x)F(x) is given by
d
F
(
x
)
d
x
=
x
-
y
+
φ
'
(
x
)
.
d
F
(
x
)
d
x
=
x
-
y
+
φ
'
(
x
)
.
(11)Setting the derivative of F(x)F(x) to zero gives
y
=
x
+
φ
'
(
x
)
y
=
x
+
φ
'
(
x
)
(12)The value xx minimizing F(x)F(x) can be found by solving Equation 12.
First we need φ'(x)φ'(x).
Example 1:
The derivative of φ1(x)φ1(x) is
φ
1
'
(
x
)
=
λ
,
x
>
0
-
λ
,
x
<
0
,
φ
1
'
(
x
)
=
λ
,
x
>
0
-
λ
,
x
<
0
,
(13)which is illustrated in Figure 3.
Note that φ1(x)φ1(x) is not differentiable at x=0x=0 and we have omitted defining φ1'(x)φ1'(x) at x=0x=0.
In fact, there is a mathematically rigorous method to define the derivative using convex analysis.
(In convex analysis, the sub-differential is a generalization of the derivative that
is defined as a set-valued function.)
However, here we defer convex analysis and simply graph φ1'(x)φ1'(x) using a straight vertical line at x=0x=0
as illustrated in Figure 3.
The function φ1'(x)φ1'(x) can be written compactly as
φ
1
'
(
x
)
=
λ
sign
(
x
)
φ
1
'
(
x
)
=
λ
sign
(
x
)
(14)where the function sign (x) sign (x) is defined as
sign
(
x
)
=
1
,
x
>
0
-
1
,
x
<
0
.
sign
(
x
)
=
1
,
x
>
0
-
1
,
x
<
0
.
(15)Example 2:
The derivative of φ2(x)φ2(x) is given by
φ
2
'
(
x
)
=
λ
1
+
a
x
,
x
>
0
-
λ
1
-
a
x
,
x
<
0
φ
2
'
(
x
)
=
λ
1
+
a
x
,
x
>
0
-
λ
1
-
a
x
,
x
<
0
(16)which can be written compactly as
φ
2
'
(
x
)
=
λ
1
+
a
|
x
|
sign
(
x
)
.
φ
2
'
(
x
)
=
λ
1
+
a
|
x
|
sign
(
x
)
.
(17)As in Example 1, the function φ2(x)φ2(x) is not differentiable at x=0x=0.
In the graph of φ2'(x)φ2'(x) shown in Figure 4,
we simply use a straight vertical line at x=0x=0.
Now that φ'(x)φ'(x) is found,
we obtain the shrinkage function s(y)s(y) by solving Equation 12 for xx.
Example 1:
To find the shrinkage function, write
y
=
x
+
φ
1
'
(
x
)
y
=
x
+
λ
sign
(
x
)
.
y
=
x
+
φ
1
'
(
x
)
y
=
x
+
λ
sign
(
x
)
.
(18)
Solving Equation 18 for xx can be done graphically.
Figure 5a shows
the functions xx and λ sign (x)λ sign (x) individually.
Figure 5b shows
the function y=x+λ sign (x)y=x+λ sign (x).
Solving for xx entails simply exchanging the xx and yy axes of the graph
in Figure 5b.
The shrinkage function s1(y)s1(y) is given by the graph of xx versus yy in Figure 5c.
From the graph, we can write
s
1
(
y
)
=
y
-
λ
,
y
≥
λ
0
,
|
y
|
≤
λ
y
+
λ
,
y
≤
-
λ
s
1
(
y
)
=
y
-
λ
,
y
≥
λ
0
,
|
y
|
≤
λ
y
+
λ
,
y
≤
-
λ
(19)which can be written compactly as
s
1
(
y
)
=
max
(
|
y
|
-
λ
,
0
)
sign
(
y
)
.
s
1
(
y
)
=
max
(
|
y
|
-
λ
,
0
)
sign
(
y
)
.
(20)The function s1(y)s1(y) is known as the soft threshold function.
It arises frequently in sparse signal processing and compressed sensing.
It sets small values (less than λλ in absolute value) to zero;
and it shrinks large values (greater than λλ in absolute value) toward zero.
Note that the soft threshold function is continuous and it has a maximum derivative of 1.
Therefore, small changes in yy do not produce large changes in x=s1(y)x=s1(y).
This is in contrast with shrinkage functions produced using some other penalty functions.
Example 2:
To find the shrinkage function, write
y
=
x
+
φ
2
'
(
x
)
y
=
x
+
λ
1
+
a
|
x
|
sign
(
x
)
.
y
=
x
+
φ
2
'
(
x
)
y
=
x
+
λ
1
+
a
|
x
|
sign
(
x
)
.
(21)
As for Example 1,
solving Equation 21 for xx can be done graphically.
Figures Figure 6a and Figure 6b
show
the functions xx, φ2'(x)φ2'(x) individually, and their sum.
The shrinkage function s2(y)s2(y), obtained by exchanging the xx and yy axes of the graph
in Figure 6b, is shown in Figure 6c.
A mathematical equation for s2(y)s2(y) can be found by solving Equation 21 for xx.
For y≥λy≥λ, Equation 21 is written as
y
=
x
+
λ
1
+
a
x
y
=
x
+
λ
1
+
a
x
(22)from which we write
y
-
x
=
λ
1
+
a
x
(
y
-
x
)
(
1
+
a
x
)
=
λ
a
x
2
+
(
1
-
a
y
)
x
+
(
T
-
y
)
=
0
.
y
-
x
=
λ
1
+
a
x
(
y
-
x
)
(
1
+
a
x
)
=
λ
a
x
2
+
(
1
-
a
y
)
x
+
(
T
-
y
)
=
0
.
(23)Using the quadratic formula gives
x
=
1
2
a
(
a
y
-
1
)
+
(
a
y
-
1
)
2
-
4
a
(
T
-
y
)
,
y
≥
λ
x
=
1
2
a
(
a
y
-
1
)
+
(
a
y
-
1
)
2
-
4
a
(
T
-
y
)
,
y
≥
λ
(24)A complete formula for the shrinkage formula s2(y)s2(y) is given by
s
2
(
y
)
=
1
2
a
(
a
|
y
|
-
1
)
+
(
a
|
y
|
-
1
)
2
-
4
a
(
T
-
|
y
|
)
sign
(
y
)
,
|
y
|
≥
λ
0
,
|
y
|
≤
λ
s
2
(
y
)
=
1
2
a
(
a
|
y
|
-
1
)
+
(
a
|
y
|
-
1
)
2
-
4
a
(
T
-
|
y
|
)
sign
(
y
)
,
|
y
|
≥
λ
0
,
|
y
|
≤
λ
(25)Like the soft-threshold function,
it sets small values (less than λλ) to zero;
and it shrinks large values (greater than λλ) toward zero.
But compared with the soft-threshold rule,
large values are not as attenuated.
For both examples, we see that the shrinkage function sets all values of yy less than λλ to zero.
Both shrinkage functions are thresholding functions.
The threshold value is λλ.
But φ2(x)φ2(x) has an additional parameter, aa, which also affects the shape of the shrinkage function.
This parameter can be used to adjust the shrinkage function.
How does aa influence the shape of the shrinkage function s2(y)s2(y)?
First, we note that some values of aa will lead to a complication in the derivation of the shrinkage function.
In particular, when aa is too large, then the graphs in Figure 6 take on different characteristics.
For example, with λ=2λ=2 and a=3a=3, the new graphs are shown in Figure 7.
Note that the graph of x+φ2'(x)x+φ2'(x) in Figure 7b is not a strictly increasing function of xx.
The graph shown in Figure 7c does not define a function of yy.
There is not a unique xx for each yy.
Inspection of F(x)F(x) reveals that two of the three values of xx on the graph (where the
graph provides non-unque xx) are only local maxima and minima of F(x)F(x).
Only one of the three values is the global minimizer of F(x)F(x).
Taking into consideration the value of F(x)F(x) so as to identify the minimizing xx as
a function of yy, we obtain the shrinkage function illustrated in Figure 7.
(The darker line in the graph represents the shrinkage function.)
The shrinkage function in Figure 7 is discontinuous.
Some frequently used shrinkage functions are discontinuous like this (for example, the hard threshold
function).
However, sometimes a discontinuous shrinkage function is considered undesirable.
A small change in yy can produce a large change in s(y)s(y).
Such a shrinkage function is sensitive to small changes in the data, yy.
To ensure the shrinkage function s2(y)s2(y) is continuous, how should aa be chosen?
Note that the discontinuity of the shrinkage function is due to x+φ'(x)x+φ'(x) not being a strictly increasing
function of xx.
To avoid the discontinuity, x+φ'(x)x+φ'(x) should be increasing, i.e.
d
d
x
(
x
+
φ
'
(
x
)
)
>
0
for
all
x
>
0
.
d
d
x
(
x
+
φ
'
(
x
)
)
>
0
for
all
x
>
0
.
(26)Equivalently, the penalty function φ(y)φ(y) should satisfy
φ
'
'
(
x
)
>
-
1
for
all
x
>
0
.
φ
'
'
(
x
)
>
-
1
for
all
x
>
0
.
(27)Note that, in Example 2,
φ
2
'
'
(
x
)
=
-
λ
a
(
1
+
a
x
)
2
,
x
>
0
φ
2
'
'
(
x
)
=
-
λ
a
(
1
+
a
x
)
2
,
x
>
0
(28)which is most negative at x=0+x=0+,
at which point we have
φ
2
'
'
(
0
+
)
=
-
λ
a
.
φ
2
'
'
(
0
+
)
=
-
λ
a
.
(29)From Equation 27 and Equation 28,
we obtain the condition
-λa>-1-λa>-1
or
a<1/λa<1/λ.
Recalling that a>0a>0,
we obtain the range
0
<
a
<
1
λ
.
0
<
a
<
1
λ
.
(30)Regarding the sensitivity of the shrinkage function — note that
the maximum value of derivative of the shrinkage function s2(y)s2(y) occurs at y=λy=λ.
Let us calculate s2'(λ+)s2'(λ+).
It can be found in two ways,
by differentiating the shrinkage function,
or indirectly as the reciprocal of the right-sided derivative of x+φ'(x)x+φ'(x) at x=0x=0.
For Example 2, we have
s
2
'
(
λ
)
=
1
1
-
λ
a
s
2
'
(
λ
)
=
1
1
-
λ
a
(31)so aa can be set as
a
=
1
λ
1
-
1
s
2
'
(
λ
+
)
a
=
1
λ
1
-
1
s
2
'
(
λ
+
)
(32)where s2'(λ+)s2'(λ+) is the maximum slope of the shrinkage function.
The penalty function and shrinkage function for several values of aa are shown in
Figure 8.
In the limit as aa approaches zero, the shrinkage function s2(y)s2(y) approaches the soft-threshold rule,
which has a slope equal to 1 for all |y|>λ|y|>λ.