# Connexions

You are here: Home » Content » Penalty and Shrinkage Functions for Sparse Signal Processing

### Recently Viewed

This feature requires Javascript to be enabled.

# Penalty and Shrinkage Functions for Sparse Signal Processing

Module by: Ivan Selesnick. E-mail the author

Summary: This tutorial describes the use of penalty functions for sparse signal processing. Several penalty functions are illustrated, both convex and non-convex. It is shown how a nonlinear shrinkage function can be derived from the penalty function and vice versa. An algorithm for solving a linear inverse problem with a sparse penalty function is derived using majorization-minimization (MM). As an example, the algorithm is applied to deconvolution of a sparse spike signal. Two penalty functions are used in the examlple to show the influence of the penalty function.

## Introduction

The use of sparsity in signal processing frequently calls for the solution to the minimization problem

arg min x 1 2 y - A x 2 2 + λ x 1 arg min x 1 2 y - A x 2 2 + λ x 1
(1)

where

x = [ x ( 1 ) , , x ( N ) ] T , x = [ x ( 1 ) , , x ( N ) ] T ,
(2)
y = [ y ( 1 ) , , x ( M ) ] T , y = [ y ( 1 ) , , x ( M ) ] T ,
(3)

AA is a matrix of size M×NM×N, and x22x22 denotes the energy of of xx, defined as

x 2 2 = n | x ( n ) | 2 , x 2 2 = n | x ( n ) | 2 ,
(4)

and x1x1 denotes the 11 norm of xx, defined as

x 1 = n | x ( n ) | . x 1 = n | x ( n ) | .
(5)

One approach in sparse signal processing calls for the solution to the more general problem:

arg min x 1 2 y - A x 2 2 + n φ ( x ( n ) ) . arg min x 1 2 y - A x 2 2 + n φ ( x ( n ) ) .
(6)

The function φ(·)φ(·) is referred to as the penalty function (or regularization function). If φ(x)=λ|x|φ(x)=λ|x|, then Equation 6 is the same as Equation 1. For sparse signal processing, φ(x)φ(x) should be chosen so as to promote sparsity of xx. It is common to set φ(x)=λ|x|φ(x)=λ|x|, especially because it is a convex function unlike many other sparsity promoting penalty functions. Convex functions are attractive because they can be more reliably minimized than non-convex functions. However, non-convex penalty functions can lead to enhanced sparsity of sparse signals.

Several questions arise:

How should the penalty function φ(·)φ(·) be chosen?

What is the influence of the penalty function?

How can the function in Equation 6 be minimized with respect to xx?

## Scalar Case

To better understand the role of the penalty function, it is informative to consider the scalar case. That is, given yRyR, find xRxR so as to minimize the function

F ( x ) = 1 2 ( y - x ) 2 + φ ( x ) . F ( x ) = 1 2 ( y - x ) 2 + φ ( x ) .
(7)

To emphasize the dependence of the minimizing value xx on yy, we define

s ( y ) : = arg min x 1 2 ( y - x ) 2 + φ ( x ) . s ( y ) : = arg min x 1 2 ( y - x ) 2 + φ ( x ) .
(8)

The function s(y)s(y) is called the shrinkage function because it shrinks yy towards zero. The exact form of s(y)s(y) depends on the penalty function φ(x)φ(x). Two particular examples of φ(x)φ(x) will be analyzed in this section.

Example 1: Define the penalty function φ1(x)φ1(x) as

φ 1 ( x ) = λ | x | φ 1 ( x ) = λ | x |
(9)

where λ>0λ>0, as illustrated in Figure 1. Note that φ1(x)φ1(x) is a convex function. This corresponds to the 11 norm.

Example 2: Define

φ 2 ( x ) = λ a log ( 1 + a | x | ) φ 2 ( x ) = λ a log ( 1 + a | x | )
(10)

where λ>0λ>0 and a>0a>0, as illustrated in Figure 2. Note that φ2(x)φ2(x) is not a convex function.

There are many other penalty functions that have been used for sparse signal/image processing [4], [10], [13]. But these are two common, representative examples. For both of these two examples, we will find the shrinkage function s(y)s(y).

To find the shrinkage function s(y)s(y), we minimize F(x)F(x) with respect to xx. The derivative of F(x)F(x) is given by

d F ( x ) d x = x - y + φ ' ( x ) . d F ( x ) d x = x - y + φ ' ( x ) .
(11)

Setting the derivative of F(x)F(x) to zero gives

y = x + φ ' ( x ) y = x + φ ' ( x )
(12)

The value xx minimizing F(x)F(x) can be found by solving Equation 12. First we need φ'(x)φ'(x).

Example 1: The derivative of φ1(x)φ1(x) is

φ 1 ' ( x ) = λ , x > 0 - λ , x < 0 , φ 1 ' ( x ) = λ , x > 0 - λ , x < 0 ,
(13)

which is illustrated in Figure 3. Note that φ1(x)φ1(x) is not differentiable at x=0x=0 and we have omitted defining φ1'(x)φ1'(x) at x=0x=0. In fact, there is a mathematically rigorous method to define the derivative using convex analysis. (In convex analysis, the sub-differential is a generalization of the derivative that is defined as a set-valued function.) However, here we defer convex analysis and simply graph φ1'(x)φ1'(x) using a straight vertical line at x=0x=0 as illustrated in Figure 3.

The function φ1'(x)φ1'(x) can be written compactly as

φ 1 ' ( x ) = λ sign ( x ) φ 1 ' ( x ) = λ sign ( x )
(14)

where the function sign (x) sign (x) is defined as

sign ( x ) = 1 , x > 0 - 1 , x < 0 . sign ( x ) = 1 , x > 0 - 1 , x < 0 .
(15)

Example 2: The derivative of φ2(x)φ2(x) is given by

φ 2 ' ( x ) = λ 1 + a x , x > 0 - λ 1 - a x , x < 0 φ 2 ' ( x ) = λ 1 + a x , x > 0 - λ 1 - a x , x < 0
(16)

which can be written compactly as

φ 2 ' ( x ) = λ 1 + a | x | sign ( x ) . φ 2 ' ( x ) = λ 1 + a | x | sign ( x ) .
(17)

As in Example 1, the function φ2(x)φ2(x) is not differentiable at x=0x=0. In the graph of φ2'(x)φ2'(x) shown in Figure 4, we simply use a straight vertical line at x=0x=0.

Now that φ'(x)φ'(x) is found, we obtain the shrinkage function s(y)s(y) by solving Equation 12 for xx.

Example 1: To find the shrinkage function, write

y = x + φ 1 ' ( x ) y = x + λ sign ( x ) . y = x + φ 1 ' ( x ) y = x + λ sign ( x ) .
(18)

Solving Equation 18 for xx can be done graphically. Figure 5a shows the functions xx and λ sign (x)λ sign (x) individually. Figure 5b shows the function y=x+λ sign (x)y=x+λ sign (x). Solving for xx entails simply exchanging the xx and yy axes of the graph in Figure 5b. The shrinkage function s1(y)s1(y) is given by the graph of xx versus yy in Figure 5c. From the graph, we can write

s 1 ( y ) = y - λ , y λ 0 , | y | λ y + λ , y - λ s 1 ( y ) = y - λ , y λ 0 , | y | λ y + λ , y - λ
(19)

which can be written compactly as

s 1 ( y ) = max ( | y | - λ , 0 ) sign ( y ) . s 1 ( y ) = max ( | y | - λ , 0 ) sign ( y ) .
(20)

The function s1(y)s1(y) is known as the soft threshold function. It arises frequently in sparse signal processing and compressed sensing. It sets small values (less than λλ in absolute value) to zero; and it shrinks large values (greater than λλ in absolute value) toward zero.

Note that the soft threshold function is continuous and it has a maximum derivative of 1. Therefore, small changes in yy do not produce large changes in x=s1(y)x=s1(y). This is in contrast with shrinkage functions produced using some other penalty functions.

Example 2: To find the shrinkage function, write

y = x + φ 2 ' ( x ) y = x + λ 1 + a | x | sign ( x ) . y = x + φ 2 ' ( x ) y = x + λ 1 + a | x | sign ( x ) .
(21)

As for Example 1, solving Equation 21 for xx can be done graphically. Figures Figure 6a and Figure 6b show the functions xx, φ2'(x)φ2'(x) individually, and their sum. The shrinkage function s2(y)s2(y), obtained by exchanging the xx and yy axes of the graph in Figure 6b, is shown in Figure 6c.

A mathematical equation for s2(y)s2(y) can be found by solving Equation 21 for xx. For yλyλ, Equation 21 is written as

y = x + λ 1 + a x y = x + λ 1 + a x
(22)

from which we write

y - x = λ 1 + a x ( y - x ) ( 1 + a x ) = λ a x 2 + ( 1 - a y ) x + ( T - y ) = 0 . y - x = λ 1 + a x ( y - x ) ( 1 + a x ) = λ a x 2 + ( 1 - a y ) x + ( T - y ) = 0 .
(23)

x = 1 2 a ( a y - 1 ) + ( a y - 1 ) 2 - 4 a ( T - y ) , y λ x = 1 2 a ( a y - 1 ) + ( a y - 1 ) 2 - 4 a ( T - y ) , y λ
(24)

A complete formula for the shrinkage formula s2(y)s2(y) is given by

s 2 ( y ) = 1 2 a ( a | y | - 1 ) + ( a | y | - 1 ) 2 - 4 a ( T - | y | ) sign ( y ) , | y | λ 0 , | y | λ s 2 ( y ) = 1 2 a ( a | y | - 1 ) + ( a | y | - 1 ) 2 - 4 a ( T - | y | ) sign ( y ) , | y | λ 0 , | y | λ
(25)

Like the soft-threshold function, it sets small values (less than λλ) to zero; and it shrinks large values (greater than λλ) toward zero. But compared with the soft-threshold rule, large values are not as attenuated.

For both examples, we see that the shrinkage function sets all values of yy less than λλ to zero. Both shrinkage functions are thresholding functions. The threshold value is λλ. But φ2(x)φ2(x) has an additional parameter, aa, which also affects the shape of the shrinkage function. This parameter can be used to adjust the shrinkage function.

How does aa influence the shape of the shrinkage function s2(y)s2(y)?

First, we note that some values of aa will lead to a complication in the derivation of the shrinkage function. In particular, when aa is too large, then the graphs in Figure 6 take on different characteristics. For example, with λ=2λ=2 and a=3a=3, the new graphs are shown in Figure 7. Note that the graph of x+φ2'(x)x+φ2'(x) in Figure 7b is not a strictly increasing function of xx. The graph shown in Figure 7c does not define a function of yy. There is not a unique xx for each yy.

Inspection of F(x)F(x) reveals that two of the three values of xx on the graph (where the graph provides non-unque xx) are only local maxima and minima of F(x)F(x). Only one of the three values is the global minimizer of F(x)F(x). Taking into consideration the value of F(x)F(x) so as to identify the minimizing xx as a function of yy, we obtain the shrinkage function illustrated in Figure 7. (The darker line in the graph represents the shrinkage function.)

The shrinkage function in Figure 7 is discontinuous. Some frequently used shrinkage functions are discontinuous like this (for example, the hard threshold function). However, sometimes a discontinuous shrinkage function is considered undesirable. A small change in yy can produce a large change in s(y)s(y). Such a shrinkage function is sensitive to small changes in the data, yy.

To ensure the shrinkage function s2(y)s2(y) is continuous, how should aa be chosen?

Note that the discontinuity of the shrinkage function is due to x+φ'(x)x+φ'(x) not being a strictly increasing function of xx. To avoid the discontinuity, x+φ'(x)x+φ'(x) should be increasing, i.e.

d d x ( x + φ ' ( x ) ) > 0 for all x > 0 . d d x ( x + φ ' ( x ) ) > 0 for all x > 0 .
(26)

Equivalently, the penalty function φ(y)φ(y) should satisfy

φ ' ' ( x ) > - 1 for all x > 0 . φ ' ' ( x ) > - 1 for all x > 0 .
(27)

Note that, in Example 2,

φ 2 ' ' ( x ) = - λ a ( 1 + a x ) 2 , x > 0 φ 2 ' ' ( x ) = - λ a ( 1 + a x ) 2 , x > 0
(28)

which is most negative at x=0+x=0+, at which point we have

φ 2 ' ' ( 0 + ) = - λ a . φ 2 ' ' ( 0 + ) = - λ a .
(29)

From Equation 27 and Equation 28, we obtain the condition -λa>-1-λa>-1 or a<1/λa<1/λ. Recalling that a>0a>0, we obtain the range

0 < a < 1 λ . 0 < a < 1 λ .
(30)

Regarding the sensitivity of the shrinkage function — note that the maximum value of derivative of the shrinkage function s2(y)s2(y) occurs at y=λy=λ. Let us calculate s2'(λ+)s2'(λ+). It can be found in two ways, by differentiating the shrinkage function, or indirectly as the reciprocal of the right-sided derivative of x+φ'(x)x+φ'(x) at x=0x=0. For Example 2, we have

s 2 ' ( λ ) = 1 1 - λ a s 2 ' ( λ ) = 1 1 - λ a
(31)

so aa can be set as

a = 1 λ 1 - 1 s 2 ' ( λ + ) a = 1 λ 1 - 1 s 2 ' ( λ + )
(32)

where s2'(λ+)s2'(λ+) is the maximum slope of the shrinkage function.

The penalty function and shrinkage function for several values of aa are shown in Figure 8. In the limit as aa approaches zero, the shrinkage function s2(y)s2(y) approaches the soft-threshold rule, which has a slope equal to 1 for all |y|>λ|y|>λ.

## From Shrinkage to Penalty Functions

In the preceding section, it was shown how the shrinkage function s(y)s(y) can be derived from the penalty function φ(x)φ(x). In this section, the reverse problem is considered. Given a shrinkage function s(y)s(y); how can the corresponding penalty function φ(t)φ(t) be derived? For example, consider the shrinkage function in Example 3.

Example 3: Define the shrinkage function

s 3 ( y ) = y - λ 2 y , | y | λ 0 , | y | λ s 3 ( y ) = y - λ 2 y , | y | λ 0 , | y | λ
(33)

as illustrated in Figure 9. This shrinkage is sometimes referred to as the nonnegative garrote' [7], [9], [2]. The graph of the shrinkage function is similar to that in Figure 6c, however, the formula is simpler; compare Equation 33 with Equation 25.

How can we find a penalty function φ(x)φ(x) corresponding to the shrinkage function s(y)s(y)?

From Equation 12 we have y=x+φ'(x)y=x+φ'(x) where x=s(y)x=s(y). Hence, φ'(x)φ'(x) can be found by solving x=s(y)x=s(y) for yy and equating with x+φ'(x)x+φ'(x). For the shrinkage function in Equation 33, for yλyλ,

x = y - λ 2 y x = y - λ 2 y
(34)
y 2 - y x - λ = 0 . y 2 - y x - λ = 0 .
(35)

y = x 2 + 1 2 x 2 + 4 λ 2 . y = x 2 + 1 2 x 2 + 4 λ 2 .
(36)

Equating Equation 36 with x+φ'(x)x+φ'(x) gives

x 2 + 1 2 x 2 + 4 λ 2 = x + φ ' ( x ) x 2 + 1 2 x 2 + 4 λ 2 = x + φ ' ( x )
(37)

or

φ ' ( x ) = 1 2 x 2 + 4 λ 2 - x 2 , x > 0 . φ ' ( x ) = 1 2 x 2 + 4 λ 2 - x 2 , x > 0 .
(38)

A complete formula for φ'(x)φ'(x) is

φ 3 ' ( x ) = 1 2 x 2 + 4 λ 2 - | x | sign ( x ) , φ 3 ' ( x ) = 1 2 x 2 + 4 λ 2 - | x | sign ( x ) ,
(39)

as illustrated in Figure 9b.

For large xx, the expression for φ'(x)φ'(x) in Equation 39 involves the difference between two large numbers, which can be numerically inaccurate. The numerical calculation of φ'(x)φ'(x) for large xx can be made more accurate, and its asymptotic behavior can be made more clear, using the identity

x 2 + 4 λ 2 - x x 2 + 4 λ 2 + x = 4 λ 2 x 2 + 4 λ 2 - x x 2 + 4 λ 2 + x = 4 λ 2
(40)

which is verified by multiplying the terms on the left-hand-side. Using Equation 40, the expression for φ'(x)φ'(x) in Equation 39 can be written as

φ 3 ' ( x ) = 2 λ 2 x 2 + 4 λ 2 + | x | sign ( x ) . φ 3 ' ( x ) = 2 λ 2 x 2 + 4 λ 2 + | x | sign ( x ) .
(41)

From Equation 41, it can be seen that φ3'(x)λ2/xφ3'(x)λ2/x for large xx.

The penalty function φ(x)φ(x) can be obtained by integration,

φ ( x ) = 0 | x | φ ' ( α ) d α φ ( x ) = 0 | x | φ ' ( α ) d α
(42)

which yields

φ 3 ( x ) = λ 2 asinh | x | 2 λ + 1 4 | x | x 2 + 4 λ 2 - | x | = λ 2 asinh | x | 2 λ + λ 2 | x | x 2 + 4 λ 2 + | x | φ 3 ( x ) = λ 2 asinh | x | 2 λ + 1 4 | x | x 2 + 4 λ 2 - | x | = λ 2 asinh | x | 2 λ + λ 2 | x | x 2 + 4 λ 2 + | x |
(43)

illustrated in Figure 9c.

### Comparison

Figure 10 shows the penalty functions and shrinkage functions for each of the three examples on the same axis. For each example, the parameter λλ is set to 2, hence each shrinkage function exhibits a threshold value of 2. For the log penalty function (Example 2), the parameter aa is set so that the (right-sided) derivative of the shrinkage function at y=2y=2 is equal to the slope of nn-garrote shrinkage function at that point. Namely, both the log and nn-garrote shrinkage functions have a (right-sided) derivative of 2 at y=2y=2; which is emphasized by the dashed line with slope 2. It can be seen that for this value of aa (i.e., a=0.25a=0.25), the log shrinkage function is quite similar to the the nn-garrote shrinkage function, but it attenuates large yy slightly more than the nn-garrote shrinkage function.

## Vector Case

Consider the vector case,

F ( x ) = 1 2 y - A x 2 2 + n φ ( x ( n ) ) F ( x ) = 1 2 y - A x 2 2 + n φ ( x ( n ) )
(44)

where

x = [ x ( 1 ) , , x ( N ) ] T x = [ x ( 1 ) , , x ( N ) ] T
(45)
y = [ y ( 1 ) , , x ( M ) ] T y = [ y ( 1 ) , , x ( M ) ] T
(46)

and AA is a matrix of size M×NM×N.

The function F(x)F(x) can not be easily minimized unless the penalty function φ(x)φ(x) is quadratic. When φ(x)φ(x) is quadratic, then the derivative of F(x)F(x) with respect to xx gives linear equations, and the minimizer xx can be obtained by solving a linear system. However, when φ(x)φ(x) is quadratic, the solution xx will generally not be sparse; hence other penalty functions are preferred for sparse signal processing, as described in the preceding sections.

To minimize F(x)F(x), an iterative numerical algorithm must be used. One approach to derive suitable algorithms is based on the majorization-minimization (MM) method from optimization theory [6]. Instead of minimizing F(x)F(x) directly, the MM method solves a sequence of simpler minimization problems

x k + 1 = arg min x G k ( x ) x k + 1 = arg min x G k ( x )
(47)

where kk is the iteration counter, k=1,2,3,k=1,2,3,. With initialization x0x0, the expression in Equation 47 is a rule for updating xkxk. The MM method asks that each function Gk(x)Gk(x) be a majorizor (upper bound) of F(x)F(x) and that it coincides with F(x)F(x) at x=xkx=xk. That is,

G k ( x ) F ( x ) for all x G k ( x k ) = F ( x k ) G k ( x ) F ( x ) for all x G k ( x k ) = F ( x k )
(48)

To apply the MM method to minimize F(x)F(x) in Equation 44, one may either majorize the data fidelity term y-Ax22y-Ax22 (as in ISTA [8], [5], FISTA [3], etc), or the penalty term nφ(x(n))nφ(x(n)), or both terms. In the following, we majorize just the penalty term.

To majorize the penalty term, it will be useful to consider first the scalar case. Specifically, we first find a majorizor for φ(x)φ(x) with xRxR. The majorizer g(x)g(x) should be an upper bound for φ(x)φ(x) that coincides with φ(x)φ(x) at a specified point vRvR as shown in Figure 11. That is,

g ( x ) φ ( x ) for all x R g ( v ) = φ ( v ) where v is a specified point g ( x ) φ ( x ) for all x R g ( v ) = φ ( v ) where v is a specified point
(49)

We will find a quadratic majorizor g(x)g(x) of the form

g ( x ) = m x 2 + c . g ( x ) = m x 2 + c .
(50)

For this quadratic majorizor, the conditions in Equation 49 are equivalent to:

g ( v ) = φ ( v ) g ' ( v ) = φ ' ( v ) . g ( v ) = φ ( v ) g ' ( v ) = φ ' ( v ) .
(51)

That is, g(x)g(x) and its derivative should agree with φ(x)φ(x) at x=vx=v. Using Equation 50, these two conditions can be written as

m v 2 + b = φ ( v ) 2 m v = φ ' ( v ) . m v 2 + b = φ ( v ) 2 m v = φ ' ( v ) .
(52)

Solving for mm and bb gives

m = φ ' ( v ) 2 v , b = φ ( v ) - v 2 φ ' ( v ) . m = φ ' ( v ) 2 v , b = φ ( v ) - v 2 φ ' ( v ) .
(53)

The majorizor g(x)g(x) in Equation 50 is therefore given by

g ( x ) = φ ' ( v ) 2 v x 2 + φ ( v ) - v 2 φ ' ( v ) . g ( x ) = φ ' ( v ) 2 v x 2 + φ ( v ) - v 2 φ ' ( v ) .
(54)

With this function g(x)g(x) we have g(x)φ(x)g(x)φ(x) with equality at x=vx=v as illustrated in Figure 11. To emphasize the dependence of g(x)g(x) on the point vv, we write

g ( x , v ) = φ ' ( v ) 2 v x 2 + φ ( v ) - v 2 φ ' ( v ) . g ( x , v ) = φ ' ( v ) 2 v x 2 + φ ( v ) - v 2 φ ' ( v ) .
(55)

The scalar majorizor can be used to obtain a majorizor for F(x)F(x) in Equation 44. If xx and vv denote vectors

x = [ x ( 1 ) , , x ( N ) ] T x = [ x ( 1 ) , , x ( N ) ] T
(56)
v = [ v ( 1 ) , , v ( N ) ] T v = [ v ( 1 ) , , v ( N ) ] T
(57)

then

n g ( x ( n ) , v ( n ) ) n φ ( x ( n ) ) n g ( x ( n ) , v ( n ) ) n φ ( x ( n ) )
(58)

with equality if x=vx=v. That is, the left-hand-side of Equation 58 is a majorizor for nφ(x(n))nφ(x(n)).

Note that the left-hand-side of Equation 58 can be written compactly as

n g ( x ( n ) , v ( n ) ) = 1 2 x ' W x + c n g ( x ( n ) , v ( n ) ) = 1 2 x ' W x + c
(59)

where

W = φ ' ( v ( 1 ) ) v ( 1 ) φ ' ( v ( N ) ) v ( N ) W = φ ' ( v ( 1 ) ) v ( 1 ) φ ' ( v ( N ) ) v ( N )
(60)

and

c = n φ ( v ( n ) ) - v 2 φ ' ( v ( n ) ) . c = n φ ( v ( n ) ) - v 2 φ ' ( v ( n ) ) .
(61)

As an abbreviation, we denote the diagonal matrix WW as

W = diag ( φ ' ( v ) . / v ) W = diag ( φ ' ( v ) . / v )
(62)

where the notation ././' denotes component-wise division.

Therefore, using Equation 58, a majorizor for F(x)F(x) is given by

G ( x , v ) = 1 2 y - A x 2 2 + 1 2 x ' W x + c , where W = diag ( φ ' ( v ) . / v ) . G ( x , v ) = 1 2 y - A x 2 2 + 1 2 x ' W x + c , where W = diag ( φ ' ( v ) . / v ) .
(63)

Minimizing G(x,v)G(x,v) with respect to xx gives

x = ( A T A + W ) - 1 A T y x = ( A T A + W ) - 1 A T y
(64)

where WW depends on vv.

Therefore, the MM update in Equation 47 with Gk(x)=G(x,xk)Gk(x)=G(x,xk) produces the sequence

x k + 1 = ( A T A + W ) - 1 A T y where W = diag ( φ ' ( x k ) . / x k ) . x k + 1 = ( A T A + W ) - 1 A T y where W = diag ( φ ' ( x k ) . / x k ) .
(65)

Note that if components of xx go to zero, the entries of WW go to infinity. Hence, the update equation Equation 65 may become numerically inaccurate. Moreover, because the solution xx is expected to be sparse, components of xkxk will go to zero as the iteration progresses. This problem is avoided, as described in Ref. [6], by using the matrix inverse lemma to write

( A T A + W ) - 1 = W - 1 - W - 1 A T ( I + A W - 1 A T ) - 1 A W - 1 . ( A T A + W ) - 1 = W - 1 - W - 1 A T ( I + A W - 1 A T ) - 1 A W - 1 .
(66)

Note, as components of xkxk go to zero, the entries of W-1W-1 go to zero instead of to infinity.

For convenience, we define

Λ : = W - 1 = diag ( x k . / φ ' ( x k ) ) . Λ : = W - 1 = diag ( x k . / φ ' ( x k ) ) .
(67)

Using Equation 66, the quadratic MM update in Equation 65 is given by

x k + 1 = Λ A T y - Λ A T ( I + A Λ A T ) - 1 A Λ A T y . x k + 1 = Λ A T y - Λ A T ( I + A Λ A T ) - 1 A Λ A T y .
(68)

An implementation of the quadratic MM algorithm based on this update equation is:

y = x g = A T y repeat Λ ( n , n ) x ( n ) / φ ' ( x ( n ) ) , n = 1 , , N F = I + A Λ A T x = Λ ( g - A T F - 1 A Λ g ) until convergence y = x g = A T y repeat Λ ( n , n ) x ( n ) / φ ' ( x ( n ) ) , n = 1 , , N F = I + A Λ A T x = Λ ( g - A T F - 1 A Λ g ) until convergence
(69)

A MATLAB program for this algorithm is given in "MATLAB Program". This iterative algorithm may be considered a form of the FOCUSS algorithm [12].

Note that if AA is banded, then the matrix I+AΛATI+AΛAT is banded and fast algorithms for banded systems can be used to implement the algorithm.

Note that the penalty function φ(·)φ(·) arises in the update equations only as entries of ΛΛ in the expression x/φ'(x)x/φ'(x). Hence, it is informative to analyze and simplify the function x/φ'(x)x/φ'(x).

Example 1: The function x/φ'(x)x/φ'(x) is given by

x φ 1 ' ( x ) = x λ sign ( x ) x φ 1 ' ( x ) = x λ sign ( x )
(70)

or more simply as

x φ 1 ' ( x ) = 1 λ | x | x φ 1 ' ( x ) = 1 λ | x |
(71)

Notice that using Equation 70 leads to a numerical issue when x=0x=0 because the function sign (x) sign (x) is undefined at x=0x=0. On the other hand, the simplified form in Equation 71 is well defined for all xx.

Example 2:

x φ 2 ' ( x ) = x λ 1 + a | x | sign ( x ) x φ 2 ' ( x ) = x λ 1 + a | x | sign ( x )
(72)

or more simply as

x φ 2 ' ( x ) = 1 λ | x | ( 1 + a | x | ) . x φ 2 ' ( x ) = 1 λ | x | ( 1 + a | x | ) .
(73)

For large xx, we have x/φ2'(x)ax2/λx/φ2'(x)ax2/λ.

Example 3:

x φ 3 ' ( x ) = 1 2 λ 2 | x | x 2 + 4 λ 2 + | x | . x φ 3 ' ( x ) = 1 2 λ 2 | x | x 2 + 4 λ 2 + | x | .
(74)

As for Example 2, for large xx, we have x/φ3'(x)x2/λ2x/φ3'(x)x2/λ2.

For these three penalty functions, it can be seen that if xk(m)xk(m) becomes equal to zero at some iteration kk for some 1mN1mN, then it will remain zero for all the remaining iterations. This is because x(m)/φ'(x(m))x(m)/φ'(x(m)) will equal zero. This would appear to interfere with convergence of the algorithm to the minimizer of Equation 6. However, detailed investigation in [11] suggests that this phenomenon does not seem to inhibit the reliable operation of this iterative algorithm.

## Optimality Conditions

When the penalty function φ(x)φ(x) is convex, then the minimizer of the cost function Equation 6 must satisfy specific conditions [1]. These conditions can be used to verify the optimality of a solution produced by a numerical algorithm. When the penalty function φ(x)φ(x) is not convex, then conditions can be used to determine if a local minimum is attained.

If φ(x)φ(x) is convex and differentiable, and if xx minimizes Equation 6, then xx must satisfy:

[ A T ( y - A x ) ] n = φ ' ( x ( n ) ) , n = 1 , , N [ A T ( y - A x ) ] n = φ ' ( x ( n ) ) , n = 1 , , N
(75)

where [AT(y-Ax)]n[AT(y-Ax)]n denotes the nn-th component of the vector AT(y-Ax)AT(y-Ax). However, this can not be applied directly when φ'(x)φ'(x) is not differentiable as in the three example penalty functions in the preceding sections.

If φ(x)φ(x) is convex but not differentiable, then the optimality condition is given by

[ A T ( y - A x ) ] n φ ( x ( n ) ) , n = 1 , , N [ A T ( y - A x ) ] n φ ( x ( n ) ) , n = 1 , , N
(76)

where φ(·)φ(·) is the subdifferential, a set-valued generalization of the derivative [1]. For Example 1, with φ1(x)=λ|x|φ1(x)=λ|x|, the subdifferential is given by

φ 1 ( x ) = { λ } , x > 0 λ , λ ] , x = 0 { - λ } , x < 0 . φ 1 ( x ) = { λ } , x > 0 λ , λ ] , x = 0 { - λ } , x < 0 .
(77)

In this case, the condition Equation 76 is given by

[ A T ( y - A x ) ] n = λ , x ( n ) > 0 - λ [ A T ( y - A x ) ] n λ , x ( n ) = 0 A T ( y - A x ) ] n = - λ , x ( n ) < 0 [ A T ( y - A x ) ] n = λ , x ( n ) > 0 - λ [ A T ( y - A x ) ] n λ , x ( n ) = 0 A T ( y - A x ) ] n = - λ , x ( n ) < 0
(78)

as illustrated in Figure 13 below.

### Setting the threshold

The condition Equation 76 can sometimes be used as a guideline to set the threshold value for the penalty function φ(x)φ(x). Suppose the data y(n)y(n) is modeled as AxAx where xx is sparse and ww is additive noise,

y = A x + w . y = A x + w .
(79)

Note that if x=0x=0, then yy consists of noise only (i.e. y=wy=w). Then Equation 76 gives

[ A T w ] n φ ( 0 ) . [ A T w ] n φ ( 0 ) .
(80)

For φ1(x)=λ|x|φ1(x)=λ|x|, that is

- λ [ A T w ] n λ , n = 1 , , N . - λ [ A T w ] n λ , n = 1 , , N .
(81)

This implies that λλ should be chosen so that λmax|ATw|λmax|ATw| where ww is the additive noise. On the other hand, the larger λλ is, the more the signal xx will be attenuated. (If λλ is too large, then the solution to Equation 6 will be x=0x=0.) Hence, it is reasonable to set λλ to the smallest value satisfying Equation 81, namely,

λ max | A T w | . λ max | A T w | .
(82)

Although Equation 82 assumes availability of the noise signal ww, which is not realistic in practice, Equation 82 can often be estimated based on knowledge of statistics of the noise ww. For example, if the noise ww is white Gaussian, then the 3σ3σ' rule can be used to estimate the maximum value of |ATw||ATw|.

## Deconvolution Example

The deconvolution problem is one of finding the input signal x(n)x(n) to a linear time-invariant system when the output signal y(n)y(n) and the impulse response of the system h(n)h(n) are both known. The problem is sometimes ill conditioned, depending on h(n)h(n); and the output signal is usually noisy. When it is known that the input signal is sparse, then a suitable approach for estimating the input signal is to solve the optimization problem Equation 6 where the penalty function is chosen so as to promote sparsity, for example, any of the three penalty functions in the preceding sections.

The noisy output signal y(n)y(n) is given by

y ( n ) = ( h * x ) ( n ) + w ( n ) , w N ( 0 , σ 2 ) y ( n ) = ( h * x ) ( n ) + w ( n ) , w N ( 0 , σ 2 )
(83)

where ** denotes convolution. This can be written as

y = H x + w y = H x + w
(84)

where HH is the appropriately defined convolution matrix. If the impulse response h(n)h(n) is finite in length, then the matrix HH will be banded. Most of the entries of a banded matrix are zero, so it can be stored with less memory; also, linear algebra operations, like matrix-vector multiplication and Gaussian elimination for solving linear systems, are fast for banded matrices compared with full (dense) matrices.

A solution to the sparse deconvolution can be obtained using a specified penalty function φ(x)φ(x),

x φ = arg min x 1 2 y - H x 2 2 + n φ ( x ( n ) ) . x φ = arg min x 1 2 y - H x 2 2 + n φ ( x ( n ) ) .
(85)

In the proceeding examples, φ(x)φ(x) has a parameter λλ that determines the threshold of the shrinkage function.

How should λλ be set? Combining Equation 82 with the 3-sigma' rule, gives

λ max | H T w | 3 std ( H T w ) λ max | H T w | 3 std ( H T w )
(86)

where std () std () denotes the standard deviation. If the noise w(n)w(n) is zero-mean white Gaussian with variance σ2σ2, then the standard deviation of HTwHTw is given by

std ( H T w ) = σ n | h ( n ) | 2 = σ h 2 . std ( H T w ) = σ n | h ( n ) | 2 = σ h 2 .
(87)

Hence, we get a rule of thumb':

λ 3 σ n | h ( n ) | 2 = 3 σ h 2 . λ 3 σ n | h ( n ) | 2 = 3 σ h 2 .
(88)

Figure 12a shows a sparse signal x(n)x(n). This signal x(n)x(n) is convolved with the 4-point impulse response

h ( n ) = 1 4 n = 0 , 1 , 2 , 3 0 otherwise h ( n ) = 1 4 n = 0 , 1 , 2 , 3 0 otherwise
(89)

and white Gaussian noise w(n)w(n) is added to obtain the observed signal y(n)y(n) illustrated in Figure 12b. The impulse response has h2=0.5h2=0.5. The standard deviation of the AWGN is σ=0.03σ=0.03. Using Equation 88, we set λ=0.045λ=0.045. Using the penalty functions φ1(x)=λ|x|φ1(x)=λ|x| (i.e., 11 norm solution) and φ3(x)φ3(x) in Equation 43 (i.e., nn-garrote solution), and using the quadratic MM algorithm, we obtain the two sparse solutions shown in Figure 12c,d. Both recover the sparse input signal quite accurately.

It is interesting to compare the sparse solution with the least square solution,

x LS = arg min x 1 2 y - H x 2 2 + λ x 2 2 x LS = arg min x 1 2 y - H x 2 2 + λ x 2 2
(90)

which is shown in Figure 12e. Note that the least square solution is unable to recover the sparse property of the input signal, and unable to suppress the noise as effectively as the sparse solution.

The optimality of the two sparse solutions are verified in Figure 13. For the nn-garrote solution, which corresponds to a non-convex penalty function, the solution might not be a global minimizer of Equation 85, so it is desirable to verify that the solution is a local minimum. The graphs in Figure 13 show [HT(y-Hx)]n[HT(y-Hx)]n vs x(n)x(n) as scatter plots for each of the two sparse solutions. Overlaid on the graph is the function φ'(x)φ'(x). In each case, the points of the scatter plot coincide with φ'(x)φ'(x), which for a convex penalty function indicates the optimality of the solution.

To compare the 11 norm and nn-garrote solutions in more detail, Figure 14 shows both solutions on the same axis. The original sparse signal is also shown. For clarity, only the non-zero values of each signal are displayed. It can be seen that the 11 norm solution is slightly attenuated compared with the nn-garrote solution. This is to be expected — recall that the shrinkage function corresponding to the 11 norm penalty function is the soft thresholding function which subtracts λλ from every value greater than the threshold. Hence this penalty function has the effect of attenuating the large signal values. In contrast, the nn-garrote penalty function attenuates large values much less. The difference in these two solutions is due to the different behavior of the two penalty functions φ1(x)φ1(x) and φ3(x)φ3(x). The nn-garrote penalty function φ3(x)φ3(x) increases more slowly than the 11-norm penalty function φ1(x)φ1(x) (see Figure 10).

If further adjustment of the penalty function is desired, one may use the log penalty function φ2(x)φ2(x) in Equation 10 and adjust the parameter aa, or use other penalty functions in the literature.

A MATLAB program for implementing the quadratic MM algorithm for sparse deconvolution is given in "MATLAB Program".

## MATLAB Program

 function [x, cost] = deconv_MM(y, phi, wfun, H, Nit) % [x, cost] = deconv_MM(y, phi, wfun, H, Nit) % Sparse deconvolution % Cost function : 0.5 * sum(abs(y-H(x)).^2) + sum(phi(x)); % % INPUT %   y - noisy data %   H - banded convolution matrix [sparse data structure] %   phi - cost function %   wfun - function x / phi(x) - needed for quadratic MM %   Nit - number of iterations % % OUTPUT %   x - deconvolved sparse signal %   cost - cost function history   % Reference: Penalty and Shrinkage Functions for Sparse Signal Processing % Ivan Selesnick, selesi@poly.edu, 2012 % http://eeweb.poly.edu/iselesni/lecture_notes/   % Algorithm: majorization-minimization with banded system solver.   y = y(:);                                            % convert to column vector cost = zeros(1, Nit);                                % cost function history [M, N] = size(H); x = y;                                               % initialization g = H'*y;                                            % H'*y for k = 1:Nit     Lam = spdiags(wfun(x), 0, N, N);                 % Lam : diagonal matrix     F = speye(M) + H*Lam*H';                         % F : banded matrix     x = Lam * (g - (H'*(F\(H*(Lam*g)))));            % update x (solve banded system)     cost(k) = 0.5*sum(abs(y-H*x).^2) + sum(phi(x));  % cost function value end `

## References

1. Bach, F. and Jenatton, R. and Mairal, J. and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), 1-106.
2. Breiman, L. (1995). Better subset selection using the nonnegative garrote. Technometr., 37(4), 373-384.
3. Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imag. Sci., 2(1), 183-202.
4. Combettes, P. L. and Pesquet, J.-C. (2010). Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag (New York).
5. Daubechies, I. and Defriese, M. and Mol, C. De. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math, LVII, 1413-1457.
6. Figueiredo, M. and Bioucas-Dias, J. and Nowak, R. (2007, December). Majorization-minimization algorithms for wavelet-based image restoration. IEEE Trans. Image Process., 16(12), 2980-2991.
7. Figueiredo, M. A. T. and Nowak, R. D. (2001, September). Wavelet-Based Image Estimation: An Empirical Bayes Approach Using Jeffrey's Noninformative Prior. IEEE Trans. Image Process., 10(9), 1322-1331.
8. Figueiredo, M. and Nowak, R. (2003, August). An EM algorithm for wavelet-based image restoration. IEEE Trans. Image Process., 12(8), 906-916.
9. Gao, H. (1998). Wavelet shrinkage denoising using the nonnegative garrote. J. Comput. Graph. Statist., 7, 469-488.
10. Hyvärinen, A. (1999). Sparse Code Shrinkage: Denoising of Non-Gaussian Data by Maximum Likelihood Estimation. Neural Computation, 11, 1739-1768.
11. Oliveira, J. and Bioucas-Dias, J. and Figueiredo, M. A. T. (2009, September). Adaptive total variation image deblurring: A majorization-minimization approach. Signal Processing, 89(9), 1683-1693.
12. Rao, B. D. and Engan, K. and Cotter, S. F. and Palmer, J. and Kreutz-Delgado, K. (2003, March). Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Process., 51(3), 760-770.
13. Vidakovic, B. (1999). Statistical Modeling by Wavelets. John Wiley & Sons.

## Content actions

### Give feedback:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks