Skip to content Skip to navigation

OpenStax-CNX

You are here: Home » Content » Conditional Expectation, Regression

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collection: "Applied Probability"

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Also in these lenses

  • UniqU content

    This module is included inLens: UniqU's lens
    By: UniqU, LLCAs a part of collection: "Applied Probability"

    Click the "UniqU content" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

Conditional Expectation, Regression

Module by: Paul E Pfeiffer. E-mail the author

Summary: Conditional expectation, given a random vector, plays a fundamental role in much of modern probability theory. Various types of “conditioning” characterize some of the more important random sequences and processes. The notion of conditional independence is expressed in terms of conditional expectation. Conditional independence plays an essential role in the theory of Markov processes and in much of decision theory.

Conditional expectation, given a random vector, plays a fundamental role in much of modern probability theory. Various types of “conditioning” characterize some of the more important random sequences and processes. The notion of conditional independence is expressed in terms of conditional expectation. Conditional independence plays an essential role in the theory of Markov processes and in much of decision theory.

We first consider an elementary form of conditional expectation with respect to an event. Then we consider two highly intuitive special cases of conditional expectation, given a random variable. In examining these, we identify a fundamental property which provides the basis for a very general extension. We discover that conditional expectation is a random quantity. The basic property for conditional expectation and properties of ordinary expectation are used to obtain four fundamental properties which imply the “expectationlike” character of conditional expectation. An extension of the fundamental property leads directly to the solution of the regression problem which, in turn, gives an alternate interpretation of conditional expectation.

Conditioning by an event

If a conditioning event C occurs, we modify the original probabilities by introducing the conditional probability measure P(·|C)P(·|C). In making the change from

P ( A ) to P ( A | C ) = P ( A C ) P ( C ) P ( A ) to P ( A | C ) = P ( A C ) P ( C )
(1)

we effectively do two things:

  • We limit the possible outcomes to event C
  • We “normalize” the probability mass by taking P(C)P(C) as the new unit

It seems reasonable to make a corresponding modification of mathematical expectation when the occurrence of event C is known. The expectation E[X]E[X] is the probability weighted average of the values taken on by X. Two possibilities for making the modification are suggested.

  • We could replace the prior probability measure P(·)P(·) with the conditional probability measure P(·|C)P(·|C) and take the weighted average with respect to these new weights.
  • We could continue to use the prior probability measure P(·)P(·) and modify the averaging process as follows:
    • Consider the values X(ω)X(ω) for only those ωCωC. This may be done by using the random variable ICXICX which has value X(ω)X(ω) for ωCωC and zero elsewhere. The expectation E[ICX]E[ICX] is the probability weighted sum of those values taken on in C.
    • The weighted average is obtained by dividing by P(C)P(C).

These two approaches are equivalent. For a simple random variable X=k=1ntkIAkX=k=1ntkIAk in canonical form

E [ I C X ] / P ( C ) = k = 1 n E [ t k I C I A k ] / P ( C ) = k = 1 n t k P ( C A k ) / P ( C ) = k = 1 n t k P ( A k | C ) E [ I C X ] / P ( C ) = k = 1 n E [ t k I C I A k ] / P ( C ) = k = 1 n t k P ( C A k ) / P ( C ) = k = 1 n t k P ( A k | C )
(2)

The final sum is expectation with respect to the conditional probability measure. Arguments using basic theorems on expectation and the approximation of general random variables by simple random variables allow an extension to a general random variable X. The notion of a conditional distribution, given C, and taking weighted averages with respect to the conditional probability is intuitive and natural in this case. However, this point of view is limited. In order to display a natural relationship with more the general concept of conditioning with repspect to a random vector, we adopt the following

Definition. The conditional expectation of X, given event C with positive probability, is the quantity

E [ X | C ] = E [ I C X ] P ( C ) = E [ I C X ] E [ I C ] E [ X | C ] = E [ I C X ] P ( C ) = E [ I C X ] E [ I C ]
(3)

Remark. The product form E[X|C]P(C)=E[ICX]E[X|C]P(C)=E[ICX] is often useful.

Example 1: A numerical example

Suppose XX exponential (λ)(λ) and C={1/λX2/λ}C={1/λX2/λ}. Now IC=IM(X)IC=IM(X) where M=[1/λ,2/λ]M=[1/λ,2/λ].

P ( C ) = P ( X 1 / λ ) - P ( X > 2 / λ ) = e - 1 - e - 2 and P ( C ) = P ( X 1 / λ ) - P ( X > 2 / λ ) = e - 1 - e - 2 and
(4)
E [ I C X ] = I M ( t ) t λ e - λ t d t = 1 / λ 2 / λ t λ e - λ t d t = 1 λ ( 2 e - 1 - 3 e - 2 ) E [ I C X ] = I M ( t ) t λ e - λ t d t = 1 / λ 2 / λ t λ e - λ t d t = 1 λ ( 2 e - 1 - 3 e - 2 )
(5)

Thus

E [ X | C ] = 2 e - 1 - 3 e - 2 λ ( e - 1 - e - 2 ) 1 . 418 λ E [ X | C ] = 2 e - 1 - 3 e - 2 λ ( e - 1 - e - 2 ) 1 . 418 λ
(6)

Conditioning by a random vector—discrete case

Suppose X=i=1ntiIAiX=i=1ntiIAi and Y=j=1mujIBjY=j=1mujIBj in canonical form. We supposeP(Ai)=P(X=ti)>0P(Ai)=P(X=ti)>0 and P(Bj)=P(Y=uj)>0P(Bj)=P(Y=uj)>0, for each permissible i,ji,j. Now

P ( Y = u j | X = t i ) = P ( X = t i , Y = u j ) P ( X = t i ) P ( Y = u j | X = t i ) = P ( X = t i , Y = u j ) P ( X = t i )
(7)

We take the expectation relative to the conditional probability P(·|X=ti)P(·|X=ti) to get

E [ g ( Y ) | X = t i ] = j = 1 m g ( u j ) P ( Y = u j | X = t i ) = e ( t i ) E [ g ( Y ) | X = t i ] = j = 1 m g ( u j ) P ( Y = u j | X = t i ) = e ( t i )
(8)

Since we have a value for each ti in the range of X, the function e(·)e(·) is defined on the range of X. Now consider any reasonable set M on the real line and determine the expectation

E [ I M ( X ) g ( Y ) ] = i = 1 n j = 1 m I M ( t i ) g ( u j ) P ( X = t i , Y = u j ) E [ I M ( X ) g ( Y ) ] = i = 1 n j = 1 m I M ( t i ) g ( u j ) P ( X = t i , Y = u j )
(9)
= i = 1 n I M ( t i ) j = 1 m g ( u j ) P ( Y = u j | X = t i ) P ( X = t i ) = i = 1 n I M ( t i ) j = 1 m g ( u j ) P ( Y = u j | X = t i ) P ( X = t i )
(10)
= i = 1 n I M ( t i ) e ( t i ) P ( X = t i ) = E [ I M ( X ) e ( X ) ] = i = 1 n I M ( t i ) e ( t i ) P ( X = t i ) = E [ I M ( X ) e ( X ) ]
(11)

We have the pattern

( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t i ) = E [ g ( Y ) | X = t i ] ( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t i ) = E [ g ( Y ) | X = t i ]
(12)

for all ti in the range of X.

We return to examine this property later. But first, consider an example to display the nature of the concept.

Example 2: Basic calculations and interpretation

Suppose the pair {X,Y}{X,Y} has the joint distribution

P ( X = t i , Y = u j ) P ( X = t i , Y = u j )
(13)
Table 1
X = X = 0 1 4 9
Y = 2 Y = 2 0.05 0.04 0.21 0.15
0 0.05 0.01 0.09 0.10
-1 0.10 0.05 0.10 0.05
P X P X 0.20 0.10 0.40 0.30

Calculate E[Y|X=ti]E[Y|X=ti] for each possible value ti taken on by X

  • E[Y|X=0]=-10.100.20+00.050.20+20.050.20E[Y|X=0]=-10.100.20+00.050.20+20.050.20
  • =(-1·0.10+0·0.05+2·0.05)/0.20=0=(-1·0.10+0·0.05+2·0.05)/0.20=0
  • E[Y|X=1]=(-1·0.05+0·0.01+2·0.04)/0.10=0.30E[Y|X=1]=(-1·0.05+0·0.01+2·0.04)/0.10=0.30
  • E[Y|X=4]=(-1·0.10+0·0.09+2·0.21)/0.40=0.80E[Y|X=4]=(-1·0.10+0·0.09+2·0.21)/0.40=0.80
  • E[Y|X=9]=(-1·0.05+0·0.10+2·0.15)/0.10=0.83E[Y|X=9]=(-1·0.05+0·0.10+2·0.15)/0.10=0.83

The pattern of operation in each case can be described as follows:

  • For the ith column, multiply each value uj by P(X=ti,Y=uj)P(X=ti,Y=uj), sum, then divide by P(X=ti)P(X=ti).

The following interpretation helps visualize the conditional expectation and points to an important result in the general case.

  • For each ti we use the mass distributed “above” it. This mass is distributed along a vertical line at values uj taken on by Y. The result of the computation is to determine the center of mass for the conditional distribution above t=tit=ti. As in the case of ordinary expectations, this should be the best estimate, in the mean-square sense, of Y when X=tiX=ti. We examine that possibility in the treatment of the regression problem in Section 5.

Although the calculations are not difficult for a problem of this size, the basic pattern can be implemented simply with MATLAB, making the handling of much larger problems quite easy. This is particularly useful in dealing with the simple approximation to an absolutely continuous pair.

X = [0 1 4 9];             % Data for the joint distribution
Y = [-1 0 2];
P = 0.01*[ 5  4 21 15; 5  1  9 10; 10  5 10  5];
jcalc                      % Setup for calculations
Enter JOINT PROBABILITIES (as on the plane)  P
Enter row matrix of VALUES of X  X
Enter row matrix of VALUES of Y  Y
 Use array operations on matrices X, Y, PX, PY, t, u, and P
EYX = sum(u.*P)./sum(P);   % sum(P) = PX  (operation sum yields column sums)
disp([X;EYX]')             % u.*P = u_j P(X = t_i, Y = u_j) for all i, j
         0         0
    1.0000    0.3000
    4.0000    0.8000
    9.0000    0.8333

The calculations extend to E[g(X,Y)|X=ti]E[g(X,Y)|X=ti]. Instead of values of uj we use values of g(ti,uj)g(ti,uj) in the calculations. Suppose Z=g(X,Y)=Y2-2XYZ=g(X,Y)=Y2-2XY.

G = u.^2 - 2*t.*u;         % Z = g(X,Y) = Y^2 - 2XY
EZX = sum(G.*P)./sum(P);   % E[Z|X=x]
disp([X;EZX]')
         0    1.5000
    1.0000    1.5000
    4.0000   -4.0500
    9.0000  -12.8333

Conditioning by a random vector — absolutely continuous case

Suppose the pair {X,Y}{X,Y} has joint density function fXYfXY. We seek to use the concept of a conditional distribution, given X=tX=t. The fact that P(X=t)=0P(X=t)=0 for each t requires a modification of the approach adopted in the discrete case. Intuitively, we consider the conditional density

f Y | X ( u | t ) = f X Y ( t , u ) / f X ( t ) for f X ( t ) > 0 0 elsewhere f Y | X ( u | t ) = f X Y ( t , u ) / f X ( t ) for f X ( t ) > 0 0 elsewhere
(14)

The condition fX(t)>0fX(t)>0 effectively determines the range of X. The function fY|X(·|t)fY|X(·|t) has the properties of a density for each fixed t for which fX(t)>0fX(t)>0.

f Y | X ( u | t ) 0 , f Y | X ( u | t ) d u = 1 f X ( t ) f X Y ( t , u ) d u = f X ( t ) / f X ( t ) = 1 f Y | X ( u | t ) 0 , f Y | X ( u | t ) d u = 1 f X ( t ) f X Y ( t , u ) d u = f X ( t ) / f X ( t ) = 1
(15)

We define, in this case,

E [ g ( Y ) | X = t ] = g ( u ) f Y | X ( u | t ) d u = e ( t ) E [ g ( Y ) | X = t ] = g ( u ) f Y | X ( u | t ) d u = e ( t )
(16)

The function e(·)e(·) is defined for fX(t)>0fX(t)>0, hence effectively on the range of X. For any reasonable set M on the real line,

E [ I M ( X ) g ( Y ) ] = I M ( t ) g ( u ) f X Y ( t , u ) d u d t = I M ( t ) g ( u ) f Y | X ( u | t ) d u f X ( t ) d t E [ I M ( X ) g ( Y ) ] = I M ( t ) g ( u ) f X Y ( t , u ) d u d t = I M ( t ) g ( u ) f Y | X ( u | t ) d u f X ( t ) d t
(17)
= I M ( t ) e ( t ) f X ( t ) d t , where e ( t ) = E [ g ( Y ) | X = t ] = I M ( t ) e ( t ) f X ( t ) d t , where e ( t ) = E [ g ( Y ) | X = t ]
(18)

Thus we have, as in the discrete case, for each t in the range of X.

( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t ) = E [ g ( Y ) | X = t ] ( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t ) = E [ g ( Y ) | X = t ]
(19)

Again, we postpone examination of this pattern until we consider a more general case.

Example 3: Basic calculation and interpretation

Suppose the pair {X,Y}{X,Y} has joint density fXY(t,u)=65(t+2u)fXY(t,u)=65(t+2u) on the triangular region bounded by t=0t=0, u=1u=1, and u=tu=t (see Figure 1). Then

f X ( t ) = 6 5 t 1 ( t + 2 u ) d u = 6 5 ( 1 + t - 2 t 2 ) , 0 t 1 f X ( t ) = 6 5 t 1 ( t + 2 u ) d u = 6 5 ( 1 + t - 2 t 2 ) , 0 t 1
(20)

By definition, then,

f Y | X ( u | t ) = t + 2 u 1 + t - 2 t 2 on the triangle (zero elsewhere) f Y | X ( u | t ) = t + 2 u 1 + t - 2 t 2 on the triangle (zero elsewhere)
(21)

We thus have

E [ Y | X = t ] = u f Y | X ( u | t ) d u = 1 1 + t - 2 t 2 t 1 ( t u + 2 u 2 ) d u = 4 + 3 t - 7 t 3 6 ( 1 + t - 2 t 2 ) 0 t < 1 E [ Y | X = t ] = u f Y | X ( u | t ) d u = 1 1 + t - 2 t 2 t 1 ( t u + 2 u 2 ) d u = 4 + 3 t - 7 t 3 6 ( 1 + t - 2 t 2 ) 0 t < 1
(22)

Theoretically, we must rule out t=1t=1 since the denominator is zero for that value of t. This causes no problem in practice.

Figure 1: The density function for Example 3.
Figure one is a cartesian graph in the first quadrant of a labeled, shaded right triangle. The horizontal axis is labeled, t, and the vertical axis is labeled, u. The right triangle appears to have two sides of equal length. Two points, and therefore one side of the triangle sits on the vertical axis, with one point at the origin, and the other further up the graph. This side is labeled, t = 0. The second side of equal length, which begins with one point in the positive region of the vertical axis, and ends in the first quadrant of the graph at the point (1, 1), is labeled u = 1. The hypotenuse of the triangle, which contains one point at the origin and one in the first quadrant of the graph at point (1, 1), is labeled, u = t. There is also a larger caption inside the graph that reads, f_XY (t, u) = (6/5)*(t + 2u).

We are able to make an interpretation quite analogous to that for the discrete case. This also points the way to practical MATLAB calculations.

  • For any t in the range of X (between 0 and 1 in this case), consider a narrow vertical strip of width ΔtΔt with the vertical line through t at its center. If the strip is narrow enough, then fXY(t,u)fXY(t,u) does not vary appreciably with t for any u.
  • The mass in the strip is approximately
    MassΔtfXY(t,u)du=ΔtfX(t)MassΔtfXY(t,u)du=ΔtfX(t)
    (23)
  • The moment of the mass in the strip about the line u=0u=0 is approximately
    MomentΔtufXY(t,u)duMomentΔtufXY(t,u)du
    (24)
  • The center of mass in the strip is
    Centerofmass=MomentMassΔtufXY(t,u)duΔtfX(t)=ufY|X(u|t)du=e(t)Centerofmass=MomentMassΔtufXY(t,u)duΔtfX(t)=ufY|X(u|t)du=e(t)
    (25)

This interpretation points the way to the use of MATLAB in approximating the conditional expectation. The success of the discrete approach in approximating the theoretical value in turns supports the validity of the interpretation. Also, this points to the general result on regression in the section, "The Regression Problem".

In the MATLAB handling of joint absolutely continuous random variables, we divide the region into narrow vertical strips. Then we deal with each of these by dividing the vertical strips to form the grid structure. The center of mass of the discrete distribution over one of the t chosen for the approximation must lie close to the actual center of mass of the probability in the strip. Consider the MATLAB treatment of the example under consideration.

f = '(6/5)*(t + 2*u).*(u>=t)';                  % Density as string variable
tuappr
Enter matrix [a b] of X-range endpoints  [0 1]
Enter matrix [c d] of Y-range endpoints  [0 1]
Enter number of X approximation points  200
Enter number of Y approximation points  200
Enter expression for joint density  eval(f)     % Evaluation of string variable
Use array operations on X, Y, PX, PY, t, u, and P
EYx = sum(u.*P)./sum(P);                        % Approximate values
eYx = (4 + 3*X - 7*X.^3)./(6*(1 + X - 2*X.^2)); % Theoretical expression
plot(X,EYx,X,eYx)
% Plotting details             (see Figure 2)

Figure 2: Theoretical and approximate conditional expectation for above.
Figure two is a graph titled, theoretical and approximate conditional expectation. The horizontal axis is labeled, t, and the vertical axis is labeled E[X | Y = t]. The values on the horizontal axis are from 0 to 1 in increments of 0.1. The values on the vertical axis range from 0.65 to 1 in increments of 0.05. There is a caption inside the graph that reads fXY (t, u) = (6/5)*(t + 2u), for 0 ≤ t  ≤ u ≤ 1. There are two plots on this graph. The first is a solid, smooth line labeled Approximate. the second is a smooth, dashed line, labeled theoretical. Both lines follow the same path on the graph, and are so closely fitted that they are nearly indistinguishable. They begin on the lower left side, at approximately (0, 0.67), and continue towards the right with a slightly negative slope for a very small segment, until approximately (0.08, 0.66), where the plots begin gradually increasing at an increasing rate. By midway across the graph, at approximately (0.4, 0.74), the slope of the graph remains positive and constant, and continues in a linear fashion from this point to the top-right corner of the graph, at  (1, 1).

The agreement of the theoretical and approximate values is quite good enough for practical purposes. It also indicates that the interpretation is reasonable, since the approximation determines the center of mass of the discretized mass which approximates the center of the actual mass in each vertical strip.

Extension to the general case

Most examples for which we make numerical calculations will be one of the types above. Analysis of these cases is built upon the intuitive notion of conditional distributions. However, these cases and this interpretation are rather limited and do not provide the basis for the range of applications—theoretical and practical—which characterize modern probability theory. We seek a basis for extension (which includes the special cases). In each case examined above, we have the property

( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t ) = E [ g ( Y ) | X = t ] ( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] where e ( t ) = E [ g ( Y ) | X = t ]
(26)

for all t in the range of X.

We have a tie to the simple case of conditioning with respect to an event. If C={XM}C={XM} has positive probability, then using IC=IM(X)IC=IM(X) we have

( B ) E [ I M ( X ) g ( Y ) ] = E [ g ( Y ) | X M ] P ( X M ) ( B ) E [ I M ( X ) g ( Y ) ] = E [ g ( Y ) | X M ] P ( X M )
(27)

Two properties of expectation are crucial here:

  1. By the uniqueness property (E5), since (A) holds for all reasonable (Borel) sets, then e(X)e(X) is unique a.s. (i.e., except for a set of ω of probability zero).
  2. By the special case of the Radon Nikodym theorem (E19), the function e(·)e(·)always exists and is such that random variable e(X)e(X) is unique a.s.

We make a definition based on these facts.

Definition. The conditional expectationE[g(Y)|X=t]=e(t)E[g(Y)|X=t]=e(t) is the a.s. unique function defined on the range of X such that

( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] for all Borel sets M ( A ) E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] for all Borel sets M
(28)

Note that e(X)e(X) is a random variable and e(·)e(·) is a function. Expectation E[g(Y)]E[g(Y)] is always a constant. The concept is abstract. At this point it has little apparent significance, except that it must include the two special cases studied in the previous sections. Also, it is not clear why the term conditional expectation should be used. The justification rests in certain formal properties which are based on the defining condition (A) and other properties of expectation.

In Appendix F we tabulate a number of key properties of conditional expectation. The condition (A) is called property (CE1). We examine several of these properties. For a detailed treatment and proofs, any of a number of books on measure-theoretic probability may be consulted.

(CE1) Defining condition. e(X)=E[g(Y)|X]e(X)=E[g(Y)|X] a.s. iff

E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] for each Borel set M on the codomain of X E [ I M ( X ) g ( Y ) ] = E [ I M ( X ) e ( X ) ] for each Borel set M on the codomain of X
(29)

Note that X and Y do not need to be real valued, although g(Y)g(Y) is real valued. This extension to possible vector valued X and Y is extremely important. The next condition is just the property (B) noted above.

(CE1a) If P(XM)>0P(XM)>0, then E[IM(X)e(X)]=E[g(Y)|XM]P(XM)E[IM(X)e(X)]=E[g(Y)|XM]P(XM)

The special case which is obtained by setting M to include the entire range of X so that IM(X(ω))=1IM(X(ω))=1 for all ω is useful in many theoretical and applied problems.

(CE1b) Law of total probability. E[g(Y)]=E{E[g(Y)|X]}E[g(Y)]=E{E[g(Y)|X]}

It may seem strange that we should complicate the problem of determining E[g(Y)]E[g(Y)] by first getting the conditional expectation e(X)=E[g(Y)|X]e(X)=E[g(Y)|X] then taking expectation of that function. Frequently, the data supplied in a problem makes this the expedient procedure.

Example 4: Use of the law of total probability

Suppose the time to failure of a device is a random quantity XX exponential (u)(u), where the parameter u is the value of a parameter random variable H. Thus

f X | H ( t | u ) = u e - u t for t 0 f X | H ( t | u ) = u e - u t for t 0
(30)

If the parameter random variable HH uniform (a,b)(a,b), determine the expected life E[X]E[X] of the device.

SOLUTION

We use the law of total probability:

E [ X ] = E { E [ X | H ] } = E [ X | H = u ] f H ( u ) d u E [ X ] = E { E [ X | H ] } = E [ X | H = u ] f H ( u ) d u
(31)

Now by assumption

E [ X | H = u ] = 1 / u and f H ( u ) = 1 b - a , a < u < b E [ X | H = u ] = 1 / u and f H ( u ) = 1 b - a , a < u < b
(32)

Thus

E [ X ] = 1 b - a a b 1 u d u = ln ( b / a ) b - a E [ X ] = 1 b - a a b 1 u d u = ln ( b / a ) b - a
(33)

For a=1/100,b=2/100a=1/100,b=2/100, E[X]=100ln(2)69.31E[X]=100ln(2)69.31.

The next three properties, linearity, positivity/monotonicity, and monotone convergence, along with the defining condition provide the “expectation like” character. These properties for expectation yield most of the other essential properties for expectation. A similar development holds for conditional expectation, with some reservation for the fact that e(X)e(X) is a random variable, unique a.s. This restriction causes little problem for applications at the level of this treatment.

In order to get some sense of how these properties root in basic properties of expectation, we examine one of them.

(CE2) Linearity. For any constants a,ba,b

E [ a g ( Y ) + b h ( Z ) | X ] = a E [ g ( Y ) | X ] + b E [ h ( Z ) | X ] a . s . E [ a g ( Y ) + b h ( Z ) | X ] = a E [ g ( Y ) | X ] + b E [ h ( Z ) | X ] a . s .
(34)

VERIFICATION

Let e1(X)=E[g(Y)|X],e2(X)=E[h(Z)|X],ande(X)=E[ag(Y)+bh(Z)|X]a.s.e1(X)=E[g(Y)|X],e2(X)=E[h(Z)|X],ande(X)=E[ag(Y)+bh(Z)|X]a.s..

E[IM(X)e(X)] = E{IM(X)[ag(Y)+bh(Z)]}a.s. by (CE1) = aE[IM(X)g(Y)]+bE[IM(X)h(Z)]a.s. by linearity of expectation = aE[IM(X)e1(X)]+bE[IM(X)e2(X)]a.s. by (CE1) = E{IM(X)[ae1(X)+be2(X)]}a.s. by linearity of expectation E[IM(X)e(X)] = E{IM(X)[ag(Y)+bh(Z)]}a.s. by (CE1) = aE[IM(X)g(Y)]+bE[IM(X)h(Z)]a.s. by linearity of expectation = aE[IM(X)e1(X)]+bE[IM(X)e2(X)]a.s. by (CE1) = E{IM(X)[ae1(X)+be2(X)]}a.s. by linearity of expectation

Since the equalities hold for any Borel M, the uniqueness property (E5) for expectation implies

e ( X ) = a e 1 ( X ) + b e 2 ( X ) a . s . e ( X ) = a e 1 ( X ) + b e 2 ( X ) a . s .
(35)

This is property (CE2). An extension to any finite linear combination is easily established by mathematical induction.

Property (CE5) provides another condition for independence.

(CE5) Independence. {X,Y}{X,Y} is an independent pair

  • iff E[g(Y)|X]=E[g(Y)]E[g(Y)|X]=E[g(Y)] a.s. for all Borel functions g
  • iff E[IN(Y)|X]=E[IN(Y)]E[IN(Y)|X]=E[IN(Y)] a.s. for all Borel sets N on the codomain of Y

Since knowledge of X does not affect the likelihood that Y will take on any set of values, then conditional expectation should not be affected by the value of X. The resulting constant value of the conditional expectation must be E[g(Y)]E[g(Y)] in order for the law of total probability to hold. A formal proof utilizes uniqueness (E5) and the product rule (E18) for expectation.

Property (CE6) forms the basis for the solution of the regresson problem in the next section.

(CE6) e(X)=E[g(Y)|X]e(X)=E[g(Y)|X] a.s. iff E[h(X)g(Y)]=E[h(X)e(X)]E[h(X)g(Y)]=E[h(X)e(X)] a.s. for any Borel function h

Examination shows this to be the result of replacing IM(X)IM(X) in (CE1) with arbitrary h(X)h(X). Again, to get some insight into how the various properties arise, we sketch the ideas of a proof of (CE6).

IDEAS OF A PROOF OF (CE6)

  1. For h(X)=IM(X)h(X)=IM(X), this is (CE1).
  2. For h(X)=i=1naiIMi(X)h(X)=i=1naiIMi(X), the result follows by linearity.
  3. For h0,g0h0,g0, there is a seqence of nonnegative, simple hnhhnh. Now by positivity, e(X)0e(X)0. By monotone convergence (CE4),
    E[hn(X)g(Y)]E[h(X)g(Y)]andE[hn(X)e(X)]E[h(X)e(X)]E[hn(X)g(Y)]E[h(X)g(Y)]andE[hn(X)e(X)]E[h(X)e(X)]
    (36)
    Since corresponding terms in the sequences are equal, the limits are equal.
  4. For h=h+-h-h=h+-h-, g0g0, the result follows by linearity (CE2).
  5. For g=g+-g-g=g+-g-, the result again follows by linearity.

Properties (CE8) and (CE9) are peculiar to conditional expectation. They play an essential role in many theoretical developments. They are essential in the study of Markov sequences and of a class of random sequences known as submartingales. We list them here (as well as in Appendix F) for reference.

(CE8) E[h(X)g(Y)|X]=h(X)E[g(Y)|X]E[h(X)g(Y)|X]=h(X)E[g(Y)|X] a.s. for any Borel function h

This property says that any function of the conditioning random vector may be treated as a constant factor. This combined with (CE10) below provide useful aids to computation.

(CE9) Repeated conditioning

If X = h ( W ) , then E { E [ g ( Y ) | X ] | W } = E { E [ g ( Y ) | W ] | X } = E [ g ( Y ) | X ] a . s . If X = h ( W ) , then E { E [ g ( Y ) | X ] | W } = E { E [ g ( Y ) | W ] | X } = E [ g ( Y ) | X ] a . s .
(37)

This somewhat formal property is highly useful in many theoretical developments. We provide an interpretation after the development of regression theory in the next section.

The next property is highly intuitive and very useful. It is easy to establish in the two elementary cases developed in previous sections. Its proof in the general case is quite sophisticated.

(CE10) Under conditions on g that are nearly always met in practice

  1. E[g(X,Y)|X=t]=E[g(t,Y)|X=t]a.s.[PX]E[g(X,Y)|X=t]=E[g(t,Y)|X=t]a.s.[PX]
  2. If {X,Y}{X,Y} is independent, then E[g(X,Y)|X=t]=E[g(t,Y)]a.s.[PX]E[g(X,Y)|X=t]=E[g(t,Y)]a.s.[PX]

It certainly seem reasonable to suppose that if X=tX=t, then we should be able to replace X by t in E[g(X,Y)|X=t]E[g(X,Y)|X=t] to get E[g(t,Y)|X=t]E[g(t,Y)|X=t]. Property (CE10) assures this. If {X,Y}{X,Y} is an independent pair, then the value of X should not affect the value of Y, so that E[g(t,Y)|X=t]=E[g(t,Y)]a.s.E[g(t,Y)|X=t]=E[g(t,Y)]a.s..

Example 5: Use of property (CE10)

Consider again the distribution for Example 3. The pair {X,Y}{X,Y} has density

f X Y ( t , u ) = 6 5 ( t + 2 u ) on the triangular region bounded by t = 0 , u = 1 , and u = t f X Y ( t , u ) = 6 5 ( t + 2 u ) on the triangular region bounded by t = 0 , u = 1 , and u = t
(38)

We show in Example 3 that

E [ Y | X = t ] = 4 + 3 t - 7 t 3 6 ( 1 + t - 2 t 2 ) 0 t < 1 E [ Y | X = t ] = 4 + 3 t - 7 t 3 6 ( 1 + t - 2 t 2 ) 0 t < 1
(39)

Let Z=3X2+2XYZ=3X2+2XY. Determine E[Z|X=t]E[Z|X=t].

SOLUTION

By linearity, (CE8), and (CE10)

E [ Z | X = t ] = 3 t 2 + 2 t E [ Y | X = t ] = 3 t 2 + 4 t + 3 t 2 - 7 t 4 3 ( 1 + t - 2 t 2 ) E [ Z | X = t ] = 3 t 2 + 2 t E [ Y | X = t ] = 3 t 2 + 4 t + 3 t 2 - 7 t 4 3 ( 1 + t - 2 t 2 )
(40)

Conditional probability

In the treatment of mathematical expectation, we note that probability may be expressed as an expectation

P ( E ) = E [ I E ] P ( E ) = E [ I E ]
(41)

For conditional probability, given an event, we have

E [ I E | C ] = E [ I E I C ] P ( C ) = P ( E C ) P ( C ) = P ( E | C ) E [ I E | C ] = E [ I E I C ] P ( C ) = P ( E C ) P ( C ) = P ( E | C )
(42)

In this manner, we extend the concept conditional expectation.

Definition. The conditional probability of event E, given X, is

P ( E | X ) = E [ I E | X ] P ( E | X ) = E [ I E | X ]
(43)

Thus, there is no need for a separate theory of conditional probability. We may define the conditional distribution function

F Y | X ( u | X ) = P ( Y u | X ) = E [ I ( - , u ] ( Y ) | X ] F Y | X ( u | X ) = P ( Y u | X ) = E [ I ( - , u ] ( Y ) | X ]
(44)

Then, by the law of total probability (CE1b),

F Y ( u ) = E [ F Y | X ( u | X ) ] = F Y | X ( u | t ) F X ( d t ) F Y ( u ) = E [ F Y | X ( u | X ) ] = F Y | X ( u | t ) F X ( d t )
(45)

If there is a conditional density fY|XfY|X such that

P ( Y M | X = t ) = M f Y | X ( r | t ) d r P ( Y M | X = t ) = M f Y | X ( r | t ) d r
(46)

then

F Y | X ( u | t ) = - u f Y | X ( r | t ) d r so that f Y | X ( u | t ) = u F Y | X ( u | t ) F Y | X ( u | t ) = - u f Y | X ( r | t ) d r so that f Y | X ( u | t ) = u F Y | X ( u | t )
(47)

A careful, measure-theoretic treatment shows that it may not be true that FY|X(·|t)FY|X(·|t) is a distribution function for all t in the range of X. However, in applications, this is seldom a problem. Modeling assumptions often start with such a family of distribution functions or density functions.

Example 6: The conditional distribution function

As in Example 4, suppose XX exponential (u)(u), where the parameter u is the value of a parameter random variable H. If the parameter random variable HH uniform (a,b)(a,b), determine the distribution function FX.

SOLUTON

As in Example 4, take the assumption on the conditional distribution to mean

f X | H ( t | u ) = u e - u t t 0 f X | H ( t | u ) = u e - u t t 0
(48)

Then

F X | H ( t | u ) = 0 t u e - u s d s = 1 - e - u t 0 t F X | H ( t | u ) = 0 t u e - u s d s = 1 - e - u t 0 t
(49)

By the law of total probability

F X ( t ) = F X | H ( t | u ) f H ( u ) d u = 1 b - a a b ( 1 - e - u t ) d u = 1 - 1 b - a a b e - u t d u F X ( t ) = F X | H ( t | u ) f H ( u ) d u = 1 b - a a b ( 1 - e - u t ) d u = 1 - 1 b - a a b e - u t d u
(50)
= 1 - 1 t ( b - a ) [ e - b t - e - a t ] = 1 - 1 t ( b - a ) [ e - b t - e - a t ]
(51)

Differentiation with respect to t yields the expression for fX(t)fX(t)

f X ( t ) = 1 b - a 1 t 2 + b t e - b t - 1 t 2 + a t e - a t t > 0 f X ( t ) = 1 b - a 1 t 2 + b t e - b t - 1 t 2 + a t e - a t t > 0
(52)

The following example uses a discrete conditional distribution and marginal distribution to obtain the joint distribution for the pair.

Example 7: A random number N of Bernoulli trials

A number N is chosen by a random selection from the integers from 1 through 20 (say by drawing a card from a box). A pair of dice is thrown N times. Let S be the number of “matches” (i.e., both ones, both twos, etc.). Determine the joint distribution for {N,S}{N,S}.

SOLUTION

NN uniform on the integers 1 through 20. P(N=i)=1/20P(N=i)=1/20 for 1i201i20. Since there are 36 pairs of numbers for the two dice and six possible matches, the probability of a match on any throw is 1/6. Since the i throws of the dice constitute a Bernoulli sequence with probability 1/6 of a success (a match), we have S conditionally binomial (i,1/6)(i,1/6), given N=iN=i. For any pair (i,j)(i,j), 0ji0ji,

P ( N = i , S = j ) = P ( S = j | N = i ) P ( N = i ) P ( N = i , S = j ) = P ( S = j | N = i ) P ( N = i )
(53)

Now E[S|N=i]=i/6E[S|N=i]=i/6. so that

E [ S ] = 1 6 · 1 20 i = 1 20 i = 20 · 21 6 · 20 · 2 = 7 4 = 1 . 75 E [ S ] = 1 6 · 1 20 i = 1 20 i = 20 · 21 6 · 20 · 2 = 7 4 = 1 . 75
(54)

The following MATLAB procedure calculates the joint probabilities and arranges them “as on the plane.”

% file randbern.m
p  = input('Enter the probability of success  ');
N  = input('Enter VALUES of N  ');
PN = input('Enter PROBABILITIES for N  ');
n  = length(N);
m  = max(N);
S  = 0:m;
P  = zeros(n,m+1);
for i = 1:n
  P(i,1:N(i)+1) = PN(i)*ibinom(N(i),p,0:N(i));
end
PS = sum(P);
P  = rot90(P);
disp('Joint distribution N, S, P, and marginal PS')
randbern                           % Call for the procedure
Enter the probability of success  1/6
Enter VALUES of N  1:20
Enter PROBABILITIES for N  0.05*ones(1,20)
Joint distribution N, S, P, and marginal PS
ES = S*PS'
ES =  1.7500                          % Agrees with the theoretical value

The regression problem

We introduce the regression problem in the treatment of linear regression. Here we are concerned with more general regression. A pair {X,Y}{X,Y} of real random variables has a joint distribution. A value X(ω)X(ω) is observed. We desire a rule for obtaining the “best” estimate of the corresponding value Y(ω)Y(ω). If Y(ω)Y(ω) is the actual value and r(X(ω))r(X(ω)) is the estimate, then Y(ω)-r(X(ω))Y(ω)-r(X(ω)) is the error of estimate. The best estimation rule (function) r(·)r(·) is taken to be that for which the average square of the error is a minimum. That is, we seek a function r such that

E [ ( Y - r ( X ) ) 2 ] is a minimum. E [ ( Y - r ( X ) ) 2 ] is a minimum.
(55)

In the treatment of linear regression, we determine the best affine function, u=at+bu=at+b. The optimum function of this form defines the regression line of Y on X. We now turn to the problem of finding the best function r, which may in some cases be an affine function, but more often is not.

We have some hints of possibilities. In the treatment of expectation, we find that the best constant to approximate a random variable in the mean square sense is the mean value, which is the center of mass for the distribution. In the interpretive Example 14.2.1 for the discrete case, we find the conditional expectation E[Y|X=ti]E[Y|X=ti] is the center of mass for the conditional distribution at X=tiX=ti. A similar result, considering thin vertical strips, is found in Example 2 for the absolutely continuous case. This suggests the possibility that e(t)=E[Y|X=t]e(t)=E[Y|X=t] might be the best estimate for Y when the value X(ω)=tX(ω)=t is observed. We investigate this possibility. The property (CE6) proves to be key to obtaining the result.

Let e(X)=E[Y|X]e(X)=E[Y|X]. We may write (CE6) in the form E[h(X)(Y-e(X))]=0E[h(X)(Y-e(X))]=0 for any reasonable function h. Consider

E [ ( Y - r ( X ) ) 2 ] = E [ ( Y - e ( X ) + e ( X ) - r ( X ) ) 2 ] E [ ( Y - r ( X ) ) 2 ] = E [ ( Y - e ( X ) + e ( X ) - r ( X ) ) 2 ]
(56)
= E [ ( Y - e ( X ) ) 2 ] + E [ ( e ( X ) - r ( X ) ) 2 ] + 2 E [ ( Y - e ( X ) ) ( r ( X ) - e ( X ) ) ] = E [ ( Y - e ( X ) ) 2 ] + E [ ( e ( X ) - r ( X ) ) 2 ] + 2 E [ ( Y - e ( X ) ) ( r ( X ) - e ( X ) ) ]
(57)

Now e(X)e(X) is fixed (a.s.) and for any choice of r we may take h(X)=r(X)-e(X)h(X)=r(X)-e(X) to assert that

E [ ( Y - e ( X ) ) ( r ( X ) - e ( X ) ) ] = E [ ( Y - e ( X ) ) h ( X ) ] = 0 E [ ( Y - e ( X ) ) ( r ( X ) - e ( X ) ) ] = E [ ( Y - e ( X ) ) h ( X ) ] = 0
(58)

Thus

E [ ( Y - r ( X ) ) 2 ] = E [ ( Y - e ( X ) ) 2 ] + E [ ( e ( X ) - r ( X ) ) 2 ] E [ ( Y - r ( X ) ) 2 ] = E [ ( Y - e ( X ) ) 2 ] + E [ ( e ( X ) - r ( X ) ) 2 ]
(59)

The first term on the right hand side is fixed; the second term is nonnegative, with a minimum at zero iff r(X)=e(X)a.s.r(X)=e(X)a.s. Thus, r=er=e is the best rule. For a given value X(ω)=tX(ω)=t the best mean square estimate of Y is

u = e ( t ) = E [ Y | X = t ] u = e ( t ) = E [ Y | X = t ]
(60)

The graph of u=e(t)u=e(t) vs t is known as the regression curve of Y on X. This is defined for argument t in the range of X, and is unique except possibly on a set N such that P(XN)=0P(XN)=0. Determination of the regression curve is thus determination of the conditional expectation.

Example 8: Regression curve for an independent pair

If the pair {X,Y}{X,Y} is independent, then u=E[Y|X=t]=E[Y]u=E[Y|X=t]=E[Y], so that the regression curve of Y on X is the horizontal line through u=E[Y]u=E[Y]. This, of course, agrees with the regression line, since Cov [X,Y]=0 Cov [X,Y]=0 and the regression line is u=0+E[Y]u=0+E[Y].

The result extends to functions of X and Y. Suppose Z=g(X,Y)Z=g(X,Y). Then the pair {X,Z}{X,Z} has a joint distribution, and the best mean square estimate of Z given X=tX=t is E[Z|X=t]E[Z|X=t].

Example 9: Estimate of a function of {X,Y}{X,Y}

Suppose the pair {X,Y}{X,Y} has joint density fXY(t,u)=60t2ufXY(t,u)=60t2u for 0t10t1, 0u1-t0u1-t. This is the triangular region bounded by t=0t=0, u=0u=0, and u=1-tu=1-t (see Figure 3). Integration shows that

f X ( t ) = 30 t 2 ( 1 - t ) 2 , 0 t 1 and f Y | X ( u | t ) = 2 u ( 1 - t ) 2 on the triangle f X ( t ) = 30 t 2 ( 1 - t ) 2 , 0 t 1 and f Y | X ( u | t ) = 2 u ( 1 - t ) 2 on the triangle
(61)

Consider

Z = X 2 for X 1 / 2 2 Y for X > 1 / 2 = I M ( X ) X 2 + I N ( X ) 2 Y Z = X 2 for X 1 / 2 2 Y for X > 1 / 2 = I M ( X ) X 2 + I N ( X ) 2 Y
(62)

where M=[0,1/2]M=[0,1/2] and N=(1/2,1]N=(1/2,1]. Determine E[Z|X=t]E[Z|X=t].

Figure 3: The density function for Example 9.
Figure three is a cartesian graph in the first quadrant containing a large, shaded right triangle. The horizontal axis is labeled, t, and the vertical axis is labeled, u. It is  labeled appropriately that both shorter sides of the triangle sit on the vertical and horizontal axes and are both of length one, with the vertex of the triangle containing the right angle sitting at the origin. The hypotenuse of the triangle, which is along a line from the point (0, 1) to the point (1, 0), is the only labeled side of the triangle, and its label reads, u = 1 - t. Inside the triangle is an equation that reads, f_xy (t, u) = 60t^2 u.

SOLUTION By linearity and (CE8),

E [ Z | X = t ] = E [ I M ( X ) X 2 | | X = t ] + E [ I N ( X ) 2 Y | | X = t ] = I M ( t ) t 2 + I N ( t ) 2 E [ Y | X = t ] E [ Z | X = t ] = E [ I M ( X ) X 2 | | X = t ] + E [ I N ( X ) 2 Y | | X = t ] = I M ( t ) t 2 + I N ( t ) 2 E [ Y | X = t ]
(63)

Now

E [ Y | X = t ] = u f Y | X ( u | t ) d u = 1 ( 1 - t ) 2 0 1 - t 2 u 2 d u = 2 3 · ( 1 - t ) 3 ( 1 - t ) 2 = 2 3 ( 1 - t ) , 0 t < 1 E [ Y | X = t ] = u f Y | X ( u | t ) d u = 1 ( 1 - t ) 2 0 1 - t 2 u 2 d u = 2 3 · ( 1 - t ) 3 ( 1 - t ) 2 = 2 3 ( 1 - t ) , 0 t < 1
(64)

so that

E [ Z | X = t ] = I M ( t ) t 2 + I N ( t ) 4 3 ( 1 - t ) E [ Z | X = t ] = I M ( t ) t 2 + I N ( t ) 4 3 ( 1 - t )
(65)

Note that the indicator functions separate the two expressions. The first holds on the interval M=[0,1/2]M=[0,1/2] and the second holds on the interval N=(1/2,1]N=(1/2,1]. The two expressions t2 and (4/3)(1-t)(4/3)(1-t)must not be added, for this would give an expression incorrect for all t in the range of X.

APPROXIMATION

tuappr
Enter matrix [a b] of X-range endpoints  [0 1]
Enter matrix [c d] of Y-range endpoints  [0 1]
Enter number of X approximation points  100
Enter number of Y approximation points  100
Enter expression for joint density  60*t.^2.*u.*(u<=1-t)
Use array operations on X, Y, PX, PY, t, u, and P
G = (t<=0.5).*t.^2 + 2*(t>0.5).*u;
EZx = sum(G.*P)./sum(P);                       % Approximation
eZx = (X<=0.5).*X.^2 + (4/3)*(X>0.5).*(1-X);   % Theoretical
plot(X,EZx,'k-',X,eZx,'k-.')
% Plotting details                             % See Figure 4

The fit is quite sufficient for practical purposes, in spite of the moderate number of approximation points. The difference in expressions for the two intervals of X values is quite clear.

Figure 4: Theoretical and approximate regression curves for Example 9.
Figure four is a graph labeled, theoretical and approximate regression curves. The horizontal axis is labeled t, and the vertical axis is labeled E[Z | X = t]. The values on the horizontal axis range from 0 to 1 in increments of 0.1, and the vertical axis ranges in value from 0 to 0.7, in increments of 1. There are two plots on this graph. The first is a dashed line labeled Theoretical, and the second is a solid line labeled approximate. Both lines follow the same path and shape on the graph, except that the solid line is sometimes a little less smooth, wavering but still closely following the more consistent dashed line. The shape of the plot appears in three major connected sections. The first section begins at the bottom-left corner of the graph, and starts to the right with a shallow but increasing slope. The plot increases at an increasing rate until midway across the graph, at approximately (0.5, 0.25). The second section begins at this point, as the path continues vertically from  (0.5, 0.25) to (0.5, 0.65). At this point, the third section begins, and is roughly linear, with a constant negative slope moving towards the bottom-right corner of the graph, where it terminates at point (1, 0).

Example 10: Estimate of a function of {X,Y}{X,Y}

Suppose the pair {X,Y}{X,Y} has joint density fXY(t,u)=65(t2+u)fXY(t,u)=65(t2+u), on the unit square 0t10t1, 0u10u1 (see Figure 5). The usual integration shows

f X ( t ) = 3 5 ( 2 t 2 + 1 ) , 0 t 1 , and f Y | X ( u | t ) = 2 t 2 + u 2 t 2 + 1 on the square f X ( t ) = 3 5 ( 2 t 2 + 1 ) , 0 t 1 , and f Y | X ( u | t ) = 2 t 2 + u 2 t 2 + 1 on the square
(66)

Consider

Z = 2 X 2 for X Y 3 X Y for X > Y = I Q ( X , Y ) 2 X 2 + I Q c ( X , Y ) 3 X Y , where Q = { ( t , u ) : u t } Z = 2 X 2 for X Y 3 X Y for X > Y = I Q ( X , Y ) 2 X 2 + I Q c ( X , Y ) 3 X Y , where Q = { ( t , u ) : u t }
(67)

Determine E[Z|X=t]E[Z|X=t].

SOLUTION

E [ Z | X = t ] = 2 t 2 I Q ( t , u ) f Y | X ( u | t ) d u + 3 t I Q c ( t , u ) u f Y | X ( u | t ) d u E [ Z | X = t ] = 2 t 2 I Q ( t , u ) f Y | X ( u | t ) d u + 3 t I Q c ( t , u ) u f Y | X ( u | t ) d u
(68)
= 4 t 2 2 t 2 + 1 t 1 ( t 2 + u ) d u + 6 t 2 t 2 + 1 0 t ( t 2 u + u 2 ) d u = - t 5 + 4 t 4 + 2 t 2 2 t 2 + 1 , 0 t 1 = 4 t 2 2 t 2 + 1 t 1 ( t 2 + u ) d u + 6 t 2 t 2 + 1 0 t ( t 2 u + u 2 ) d u = - t 5 + 4 t 4 + 2 t 2 2 t 2 + 1 , 0 t 1
(69)
Figure 5: The density and regions for Example 10
Figure five is a cartesian graph containing two equal right triangles that put together at their hypotenuse create a large square. The horizontal axis is labeled, t, and the vertical axis is labeled, u. Each axis marked only with the value 1. The points (0, 0),  (0, 1), (1, 1), and (1, 0) are vertices of the square. A diagonal dashed line from point (0, 0) through point (1, 1) is labeled u = t and divides the square into two triangles. The two sides of the triangle not sitting on an axis are labeled, with the horizontal side from (0, 1) to (1, 1) labeled, u = 1, and the vertical side from (1, 0) to (1, 1) labeled, t = 1. The triangle above the diagonal line is labeled, Q, and the triangle below is labeled Q^C. A large equation is printed below the graph that reads, f_XY (t, u) = (6/5)*(t^2 + u).

Note the different role of the indicator functions than in Example 9. There they provide a separation of two parts of the result. Here they serve to set the effective limits of integration, but sum of the two parts is needed for each t.

Figure 6: Theoretical and approximate regression curves for Example 10.
Figure six is a graph labeled, theoretical and approximate regression curves. The horizontal axis is labeled, t, and the vertical axis is labeled, E[Z | X = t]. The values on the horizontal axis range from 0 to 1 in increments of 0.1. The values on the vertical axis range from 0 to 1.8 in increments of 0.2. There are two plots on the graph, but both follow the same shape so closely that they are indistinguishable. One is a solid line, labeled Approximate, and the other is a dashed line, labeled Theoretical. The shape begins at the bottom-right corner of the graph at (0, 0). It initially moves to the right at a shallow positive slope. As it continues to move to the right, it begins increasing at an increasing rate until approximately (0.6, 0.7) where it maintains a constant positive slope. The plot continues this slope up to the upper-right corner of the graph, where it terminates at approximately (1, 1.65).

APPROXIMATION

tuappr
Enter matrix [a b] of X-range endpoints  [0 1]
Enter matrix [c d] of Y-range endpoints  [0 1]
Enter number of X approximation points  200
Enter number of Y approximation points  200
Enter expression for joint density  (6/5)*(t.^2 + u)
Use array operations on X, Y, PX, PY, t, u, and P
G = 2*t.^2.*(u>=t) + 3*t.*u.*(u<t);
EZx = sum(G.*P)./sum(P);                        % Approximate
eZx = (-X.^5 + 4*X.^4 + 2*X.^2)./(2*X.^2 + 1);  % Theoretical
plot(X,EZx,'k-',X,eZx,'k-.')
% Plotting details                              % See Figure 6

The theoretical and approximate are barely distinguishable on the plot. Although the same number of approximation points are use as in Figure 4 (Example 9), the fact that the entire region is included in the grid means a larger number of effective points in this example.

Given our approach to conditional expectation, the fact that it solves the regression problem is a matter that requires proof using properties of of conditional expectation. An alternate approach is simply to define the conditional expectation to be the solution to the regression problem, then determine its properties. This yields, in particular, our defining condition (CE1). Once that is established, properties of expectation (including the uniqueness property (E5)) show the essential equivalence of the two concepts. There are some technical differences which do not affect most applications. The alternate approach assumes the second moment E[X2]E[X2] is finite. Not all random variables have this property. However, those ordinarily used in applications at the level of this treatment will have a variance, hence a finite second moment.

We use the interpretation of e(X)=E[g(Y)|X]e(X)=E[g(Y)|X] as the best mean square estimator of g(Y)g(Y), given X, to interpret the formal property (CE9). We examine the special form

(CE9a) E{E[g(Y)|X]|X,Z}=E{E[g(Y)|X,Z]|X}=E[g(Y)|X]E{E[g(Y)|X]|X,Z}=E{E[g(Y)|X,Z]|X}=E[g(Y)|X]

Put e1(X,Z)=E[g(Y)|X,Z]e1(X,Z)=E[g(Y)|X,Z], the best mean square estimator of g(Y)g(Y), given (X,Z)(X,Z). Then (CE9b) can be expressed

E [ e ( X ) | X , Z ] = e ( X ) a . s . and E [ e 1 ( X , Z ) | X ] = e ( X ) a . s . E [ e ( X ) | X , Z ] = e ( X ) a . s . and E [ e 1 ( X , Z ) | X ] = e ( X ) a . s .
(70)

In words, if we take the best estimate of g(Y)g(Y), given X, then take the best mean sqare estimate of that, given (X,Z)(X,Z), we do not change the estimate of g(Y)g(Y). On the other hand if we first get the best mean sqare estimate of g(Y)g(Y), given (X,Z)(X,Z), and then take the best mean square estimate of that, given X, we get the best mean square estimate of g(Y)g(Y), given X.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks