Skip to content Skip to navigation

Connexions

You are here: Home » Content » Maximum Likelihood Estimation

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

This feature requires Javascript to be enabled.

Maximum Likelihood Estimation

Module by: Robert Nowak

In the last lecture we derived a risk (MSE) bound for regression problems; i.e., select an fFfF so that E[(f(X)-Y)2]-E[(f*(X)-Y)2]E[(f(X)-Y)2]-E[(f*(X)-Y)2] is small, where f*(x)=E[Y|X=x]f*(x)=E[Y|X=x]. The result is summarized below.

Theorem 1 (Complexity Regularization with Squared Error Loss) Let X=RdX=Rd, Y=[-b/2,b/2]Y=[-b/2,b/2], {Xi,Yi}i=1n{Xi,Yi}i=1n iid, PXYPXY unknown, FF = {collection of candidate functions},

f : R d Y , R ( f ) = E [ ( f ( X ) - Y ) 2 ] . f : R d Y , R ( f ) = E [ ( f ( X ) - Y ) 2 ] . (1)

Let c(f)c(f), fFfF, be positive numbers satisfying fF2-c(f)1fF2-c(f)1, and select a function from FF according to

f ^ n = arg min R ^ n ( f ) + 1 ϵ c ( f ) log 2 n , f ^ n = arg min R ^ n ( f ) + 1 ϵ c ( f ) log 2 n , (2)

with ϵ35b2ϵ35b2 and R^n(f)=1ni=1n(f(Xi)-Yi)2R^n(f)=1ni=1n(f(Xi)-Yi)2. Then,

E [ R ( f ^ n ) ] - R ( f * ) 1 + α 1 - α min f F R ( f ) - R ( f * ) + 1 ϵ c ( f ) log 2 n + O ( n - 1 ) E [ R ( f ^ n ) ] - R ( f * ) 1 + α 1 - α min f F R ( f ) - R ( f * ) + 1 ϵ c ( f ) log 2 n + O ( n - 1 ) (3)

where α=ϵb21-2b2ϵ/3α=ϵb21-2b2ϵ/3

Maximum Likelihood Estimation

The focus of this lecture is to consider another approach to learning based on maximum likelihood estimation. Consider the classical signal plus noise model:

Y i = f i n + W i , i = 1 , , n Y i = f i n + W i , i = 1 , , n (4)

where WiWi are iid zero-mean noises. Furthermore, assume that WiP(w)WiP(w) for some known density P(w)P(w). Then

Y i P y - f i n P f i ( y ) Y i P y - f i n P f i ( y ) (5)

since Yi-fin=WiYi-fin=Wi.

A very common and useful loss function to consider is

R ^ n ( f ) = 1 n i = 1 n ( - log P f i ( Y i ) ) . R ^ n ( f ) = 1 n i = 1 n ( - log P f i ( Y i ) ) . (6)

Minimizing R^nR^n with respect to ff is equivalent to maximizing

1 n i = 1 n log P f i ( Y i ) 1 n i = 1 n log P f i ( Y i ) (7)

or

i = 1 n P f i ( Y i ) . i = 1 n P f i ( Y i ) . (8)

Thus, using the negative log-likelihood as a loss function leads to maximum likelihood estimation. If the WiWi are iid zero-mean Gaussian r.v.s then this is just the squared error loss we considered last time. If the WiWi are Laplacian distributed e.g. P(w)e-|w|P(w)e-|w|, then we obtain the absolute error, or L1L1, loss function. We can also handle non-additive models such as the Poisson model

Y i P y | f i / n = e - f ( i / n ) [ f ( i / n ) ] y y ! Y i P y | f i / n = e - f ( i / n ) [ f ( i / n ) ] y y ! (9)

In this case

- log P Y i | f i / n = f i / n - Y i log f i / n + constant - log P Y i | f i / n = f i / n - Y i log f i / n + constant (10)

which is a very different loss function, but quite appropriate for many imaging problems.

Before we investigate maximum likelihood estimation for model selection, let's review some of the basis concepts. Let ΘΘ denote a parameter space (e.g., Θ=RΘ=R), and assume we have observations

Y i i i d P θ * ( y ) , i = 1 , , n Y i i i d P θ * ( y ) , i = 1 , , n (11)

where θ*Θθ*Θ is a parameter determining the density of the {YiYi}. The ML estimator of θ*θ* is

θ ^ n = arg max θ Θ i = 1 n P θ ( Y i ) = arg max θ Θ i = 1 n log P θ ( Y i ) = arg min θ Θ i = 1 n - log P θ ( Y i ) . θ ^ n = arg max θ Θ i = 1 n P θ ( Y i ) = arg max θ Θ i = 1 n log P θ ( Y i ) = arg min θ Θ i = 1 n - log P θ ( Y i ) . (12)

θ^θ^ maximizes the expected log-likelihood. To see this, let's compare the expected log-likelihood of θ*θ* with any other θΘθΘ.

E [ log P θ * ( Y ) - log P θ ( Y ) ] = E log P θ * ( Y ) P θ ( Y ) = log P θ * ( y ) P θ ( y ) P θ * ( y ) d y = K ( P θ , P θ * ) the KL divergence 0 with equality iff P θ * = P θ . E [ log P θ * ( Y ) - log P θ ( Y ) ] = E log P θ * ( Y ) P θ ( Y ) = log P θ * ( y ) P θ ( y ) P θ * ( y ) d y = K ( P θ , P θ * ) the KL divergence 0 with equality iff P θ * = P θ . (13)

Why?

- E log P θ * ( y ) P θ ( y ) = E log P θ ( y ) P θ * ( y ) log E P θ ( y ) P θ * ( y ) = log P θ ( y ) d y = 0 K ( P θ , P θ * ) 0 - E log P θ * ( y ) P θ ( y ) = E log P θ ( y ) P θ * ( y ) log E P θ ( y ) P θ * ( y ) = log P θ ( y ) d y = 0 K ( P θ , P θ * ) 0 (14)

On the other hand, since θ^nθ^n maximizes the likelihood over θΘθΘ, we have

i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) = i = 1 n log P θ * ( Y i ) - log P θ ^ n ( Y i ) 0 i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) = i = 1 n log P θ * ( Y i ) - log P θ ^ n ( Y i ) 0 (15)

Therefore,

1 n i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) + K ( P θ ^ n , P θ * ) 0 1 n i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) + K ( P θ ^ n , P θ * ) 0 (16)

or re-arranging

K ( P θ ^ n , P θ * ) 1 n i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) K ( P θ ^ n , P θ * ) 1 n i = 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) (17)

Notice that the quantity

1 n i = 1 n log P θ * ( Y i ) P θ ( Y i ) 1 n i = 1 n log P θ * ( Y i ) P θ ( Y i ) (18)

is an empirical average whose mean is K(Pθ,Pθ*)K(Pθ,Pθ*). By the law of large numbers, for each θΘθΘ,

1 n i = 1 n log P θ * ( Y i ) P θ ( Y i ) - K ( P θ , P θ * ) a . s . 0 1 n i = 1 n log P θ * ( Y i ) P θ ( Y i ) - K ( P θ , P θ * ) a . s . 0 (19)

. If this also holds for the sequence {θ^n}{θ^n}, then we have

K ( P θ ^ n , P θ * ) 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) 0 as n K ( P θ ^ n , P θ * ) 1 n log P θ * ( Y i ) P θ ^ n ( Y i ) - K ( P θ ^ n , P θ * ) 0 as n (20)

which implies that

P θ ^ n P θ * P θ ^ n P θ * (21)

which often implies that

θ ^ n θ * θ ^ n θ *