# Connexions

You are here: Home » Content » Topics in Applied Probability » Matlab Procedures for Markov Decision Processes

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
• Rice Digital Scholarship

This collection is included in aLens by: Digital Scholarship at Rice University

Click the "Rice Digital Scholarship" link to see all content affiliated with them.

• NSF Partnership

This collection is included inLens: NSF Partnership in Signal Processing
By: Sidney Burrus

Click the "NSF Partnership" link to see all content affiliated with them.

Click the tag icon to display tags associated with this content.

• Featured Content

This collection is included inLens: Connexions Featured Content
By: Connexions

Click the "Featured Content" link to see all content affiliated with them.

Click the tag icon to display tags associated with this content.

#### Also in these lenses

• UniqU content

This collection is included inLens: UniqU's lens
By: UniqU, LLC

Click the "UniqU content" link to see all content selected in this lens.

### Recently Viewed

This feature requires Javascript to be enabled.

### Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.

Inside Collection (Textbook):

Textbook by: Paul E Pfeiffer. E-mail the author

# Matlab Procedures for Markov Decision Processes

Module by: Paul E Pfeiffer. E-mail the author

Summary: We first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. Then we consider three cases: a. Gain associated with a state; b. One-step transition gains; and c. Gains associated with a demand under certain reasonable conditions. Matlab implementations of the results of analysis provide machine solutions to a variety of problems.

In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.

1. Some cost and reward patterns
Consider a finite, ergodic (i.e., recurrent, positive, aperiodic, irreducible) homogeneous Markov chain XN, with state space E={1,2,,M}E={1,2,,M}. To facilitate use of matrix computation programs, we number states from one, except in certain examples in which state zero has a natural meaning. Associated with this chain is a cost or reward structure belonging to one of the three general classes described below. The one-step transition probability matrix is P=[p(i,j)]P=[p(i,j)], where p(i,j)=P(Xn+1=j|Xn=i)p(i,j)=P(Xn+1=j|Xn=i). The distribution for Xn is represented by the row matrix π(n)=[p1(n),p2(n),,pM(n)]π(n)=[p1(n),p2(n),,pM(n)], where pi(n)=P(Xn=i)pi(n)=P(Xn=i). The long-run distribution is represented by the row matrix π=[π1,π2,,πM]π=[π1,π2,,πM].
1. Type 1. Gain associated with a state. A reward or gain in the amount g(i)g(i) is realized in the next period if the current state is i. The gain sequence{Gn:1n}{Gn:1n} of random variables Gn+1=g(Xn)Gn+1=g(Xn) is the sequence of gains realized as the chain XN evolves. We represent the gain function g by the column matrix g=[g(1)g(2)g(M)]Tg=[g(1)g(2)g(M)]T.
2. Type 2. One-step transition gains. A reward or gain in the amount g(i,j)=gijg(i,j)=gij is realized in period n+1n+1 if the system goes from state i in period n to state j in period n+1n+1. The corresponding one-step transition probability is p(i,j)=pijp(i,j)=pij. The gain matrix is g=[g(i,j)]g=[g(i,j)]. The gain sequence{Gn:1n}{Gn:1n} of random variables Gn+1=g(Xn,Xn+1)Gn+1=g(Xn,Xn+1) is the sequence of gains experienced as the chain XN evolves.
3. Type 3. Gains associated with a demand. In this case, the gain random variables are of the form
Gn+1=g(Xn,Dn+1)Gn+1=g(Xn,Dn+1)
(1)
If n represents the present, the random vector Un=(X0,X1,,Xn)Un=(X0,X1,,Xn) represents the “past” at n of the chain XN. We suppose {Dn:1n}{Dn:1n} is iid and each {Dn+1,Un}{Dn+1,Un} is independent. Again, the gain sequence{Gn:1n}{Gn:1n} represents the gains realized as the process evolves. Standard results on Markov chains show that in each case the sequence GN={Gn:1n}GN={Gn:1n} is Markov.
A recurrence relation. We are interested in the expected gains in future stages, given the present state of the process. For any n>0n>0 and any m>1m>1, the gain Gn(m)Gn(m) in the m periods following period n is given by
Gn(m)=Gn+1+Gn+2++Gn+mGn(m)=Gn+1+Gn+2++Gn+m
(2)
If the process is in state i, the expected gain in the next period qi is
qi=vi(1)=E[Gn+1|Xn=i]qi=vi(1)=E[Gn+1|Xn=i]
(3)
and the expected gain in the next m periods is
vi(m)=E[Gn(m)|Xn=i]vi(m)=E[Gn(m)|Xn=i]
(4)
We utilize a recursion relation that allows us to begin with the transition matrix P and the next-period expected gain matrixq=[q1q2qm]Tq=[q1q2qm]T and calculate the vi(m)vi(m) for any m>1m>1. Although the analysis is somewhat different for the various gain structures, the result exhibits a common pattern. In each case
vi(m)=E[Gn(m)|Xn=i]=E[Gn+1|Xn=i]+k=1m-1E[Gn+k+1|Xn=i]vi(m)=E[Gn(m)|Xn=i]=E[Gn+1|Xn=i]+k=1m-1E[Gn+k+1|Xn=i]
(5)
=qi+k=1m-1jEE[Gn+k+1|Xn+1=j]p(i,j)=qi+k=1m-1jEE[Gn+k+1|Xn+1=j]p(i,j)
(6)
=qi+jEE[k=1m-1Gn+k|Xn=j]p(i,j)=qi+jEE[k=1m-1Gn+k|Xn=j]p(i,j)
(7)
We thus have the fundamental recursion relation
(*)vi(m)=qi+jEvj(m-1)p(i,j)(*)vi(m)=qi+jEvj(m-1)p(i,j)
(8)
The recursion relation (*)(*) shows that the transition matrix P=[p(i,j)]P=[p(i,j)] and the vector of next-period expected gains, which we represent by the column matrix q=[q1,q2,,qM]Tq=[q1,q2,,qM]T, determine the vi(m)vi(m). The difference between the cases is the manner in which the qi are obtained.
• Type 1: . qi=E[g(Xn)|Xn=i]=g(i)qi=E[g(Xn)|Xn=i]=g(i)
• Type 2: . qi=E[g(Xn,Xn+1)|Xn=i]=E[g(i,Xn+1)|Xn=i]= jE g(i,j)p(i,j)qi=E[g(Xn,Xn+1)|Xn=i]=E[g(i,Xn+1)|Xn=i]= jE g(i,j)p(i,j)
• Type 3: . qi=E[g(Xn,Dn+1)|Xn=i]=E[g(i,Dn+1)]=kg(i,k)P(D=k)qi=E[g(Xn,Dn+1)|Xn=i]=E[g(i,Dn+1)]=kg(i,k)P(D=k)
• Matrix form: . For computational purposes, we utilize these relations in matrix form. If
v(n)=[v1(n)v2(n)vM(n)]T and q=[q1q2qM]Tv(n)=[v1(n)v2(n)vM(n)]T and q=[q1q2qM]T
(9)
then
(*)v(m+1)=q+Pv(m) forallm>0,withv(0)=0(*)v(m+1)=q+Pv(m) forallm>0,withv(0)=0
(10)
Examination of the expressions for qi, above, shows that the next-period expected gain matrix q takes the following forms. In each case, g is the gain matrix. In case c, pD is the column matrix whose elements are P(D=k)P(D=k).
• Type 1: q = g q = g
• Type 2: q = thediagonalofPgT q = thediagonalofPgT
• Type 3: q=gpDq=gpD
2. Some long-run averages
Consider the average expected gain for m periods
E[1mGn(m)]=1mk=1mE[Gn+k]=1mk=1mE{E[Gn+k|Xn-1]}E[1mGn(m)]=1mk=1mE[Gn+k]=1mk=1mE{E[Gn+k|Xn-1]}
(11)
Use of the Markov property and the fact that for an ergodic chain
1mk=1mpk(i,j)πjasm1mk=1mpk(i,j)πjasm
(12)
yields the result that,
limmE[1mGn(m)]=iP(Xn-1=i)jqjπj=jqjπj=glimmE[1mGn(m)]=iP(Xn-1=i)jqjπj=jqjπj=g
(13)
and for any state i,
limmE[1mGn(m)|Xn=i]=limm1mvi(m)=glimmE[1mGn(m)|Xn=i]=limm1mvi(m)=g
(14)
The expression for g may be put into matrix form. If the long-run probability distribution is represented by the row matrix π=[π1π2πM]π=[π1π2πM] and q=[q1q2;qM]Tq=[q1q2;qM]T, then
g=πqg=πq
(15)
3. Discounting and potentials
Suppose in a given time interval the value of money increases by a fraction a. This may represent the potential earning if the money were invested. One dollar now is worth 1+a1+a dollars at the end of one period. It is worth (1+a)n(1+a)n dollars after n periods. Set α=1/(1+a)α=1/(1+a), so that 0<α10<α1. It takes α dollars to be worth one dollar after one period. It takes αn dollars at present to be worth one dollar n periods later. Thus the present worth or discounted value of one dollar n periods in the future is αn dollars. This is the amount one would need to invest presently, at interest rate a per unit of time, to have one dollar n periods later. If f is any function defined on the state space E, we designate by f the column matrix [f(1)f(2)f(M)]T[f(1)f(2)f(M)]T. We make an exception to this convention in the case of the distributions of the state probabilities π(n)=[p1(n)p2(n)pM(n)]π(n)=[p1(n)p2(n)pM(n)], their generating function, and the long-run probabilities π=[π1π2πM]π=[π1π2πM], which we represent as row matrices. It should be clear that much of the following development extends immediately to infinite state space E=N={0,1,2,}E=N={0,1,2,}. We assume one of the gain structures introduced in Sec 1 and the corresponding gain sequence {Gn:1n}{Gn:1n}. The value of the random variable Gn+1Gn+1 is the gain or reward realized during the period n+1n+1. We let each transition time be at the end of one period of time (say a month or a quarter). If n corresponds to the present, then Gn+kGn+k is the gain in period n+kn+k. If we do not discount for the period immediately after the present, then αk-1Gn+kαk-1Gn+k is the present value of that gain. Hence
k=1αk-1Gn+k isthetotaldiscountedfuturegaink=1αk-1Gn+k isthetotaldiscountedfuturegain
(16)
Definition. The α-potential of the gain sequence {Gn:1n}{Gn:1n} is the function φ defined by
φ(i)=E[n=0αnGn+1|X0=i]iEφ(i)=E[n=0αnGn+1|X0=i]iE
(17)
Remark. φ(i)φ(i) is the expected total discounted gain, given the system starts in state i. We next define a concept which is related to the α-potential in a useful way. Definition. The α-potential matrixRαRα for the process XN is the matrix
Rα=n=0αnPnwithP0=IRα=n=0αnPnwithP0=I
(18)

## Theorem 1: 3.1

Let XN be an ergodic, homogeneous Markov chain with state space E and gain sequence {Gn:1n}{Gn:1n}. Let q=[q1q1qM]Tq=[q1q1qM]T where qi=E[Gn+1|Xn=i]foriEqi=E[Gn+1|Xn=i]foriE. For α(0,1)α(0,1), let φ be the α-potential for the gain sequence. That is,

φ(i)=En=0αnGn+1|X0=iiEφ(i)=En=0αnGn+1|X0=iiE
(19)
Then, if RαRα is the α-potential matrix for XN, we have
φ=Rαqφ=Rαq
(20)

## Theorem 2: 3.2

Consider an ergodic, homogeneous Markov chain XN with gain sequence {Gn:1n}{Gn:1n} and next-period expected gain matrix q. For α(0,1)α(0,1), the α-potential φ and the α-potential matrix RαRα satisfy

[I-αP]Rα=Iand[I-αP]φ=q[I-αP]Rα=Iand[I-αP]φ=q
(21)
If the state space E is finite, then
Rα=[I-αP]-1andφ=[I-αP]-1q=RαqRα=[I-αP]-1andφ=[I-αP]-1q=Rαq
(22)

## Example 1: A numerical example

Suppose the transition matrix P and the next-period expected gain matrix q are

P=0.20.30.50.40.10.50.60.30.1andq=253P=0.20.30.50.40.10.50.60.30.1andq=253
(23)
For α=0.9α=0.9, we have
R0.9=(I-0.9P)-1=4.40302.28813.30883.55563.13563.30883.66772.28814.0441andφ=R0.9q=30.1732.7230.91R0.9=(I-0.9P)-1=4.40302.28813.30883.55563.13563.30883.66772.28814.0441andφ=R0.9q=30.1732.7230.91
(24)

The next example is of class c, but the demand random variable is Poisson, which does not have finite range. The simple pattern for calculating the qi must be modified.

## Example 2: Costs associated with inventory under an (m,M)(m,M) order policy.

If k units are ordered, the cost in dollars is

c(k)=0k=010+25k0<kMc(k)=0k=010+25k0<kM
(25)

For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use the (m,M)(m,M) inventory policy described in Ex 23.1.3. X0=MX0=M, and Xn is the stock at the end of period n, before restocking. Demand in period n is Dn. The cost for restocking at the end of period n+1n+1 is

g(Xn,Dn+1)=10+25(M-Xn)+50max{Dn+1-M,0}0Xn<m50max{Dn+1-Xn,0}mXnMg(Xn,Dn+1)=10+25(M-Xn)+50max{Dn+1-M,0}0Xn<m50max{Dn+1-Xn,0}mXnM
(26)
For m=1,M=3m=1,M=3, we have
g(0,Dn+1)=85+50I[3,)(Dn+1)(Dn+1-3)g(0,Dn+1)=85+50I[3,)(Dn+1)(Dn+1-3)
(27)
g(i,Dn+1)=50I[3,)(Dn+1)(Dn+1-i)1i3g(i,Dn+1)=50I[3,)(Dn+1)(Dn+1-i)1i3
(28)
We take m=1,M=3m=1,M=3 and assume D is Poisson (1). From Ex 23.1.3 in PA, we obtain

P=0.08030.18390.36790.36790.63210.3679000.26420.36790.367900.08030.18390.36790.3679P=0.08030.18390.36790.36790.63210.3679000.26420.36790.367900.08030.18390.36790.3679
(29)

The largest eigenvalue |λ|0.26|λ|0.26, so n10n10 should be sufficient for convergence. We use n=16n=16.

Taking any row of P16P16, we approximate the long-run distribution by

π=[0.28580.28470.26320.1663]π=[0.28580.28470.26320.1663]
(30)

We thus have

q0=85+50E[I[3,)(Dn+1)(Dn+1-3)]=85+50k=4(k-3)pk(thetermfork=3iszero)q0=85+50E[I[3,)(Dn+1)(Dn+1-3)]=85+50k=4(k-3)pk(thetermfork=3iszero)
(31)

For the Poisson distribution k=nkpk=λk=n-1pkk=nkpk=λk=n-1pk.
Hence q0=85+50[k=3pk-3k=4pk]85+50[0.0803-3×0.0190]=86.1668q0=85+50[k=3pk-3k=4pk]85+50[0.0803-3×0.0190]=86.1668 q1=50k=3(k-1)pk=50[k=2pk-k=3pk]=50p2=9.1970q1=50k=3(k-1)pk=50[k=2pk-k=3pk]=50p2=9.1970 q2=50k=3(k-2)pk=50[0.2642-2×0.0803]=5.1809q2=50k=3(k-2)pk=50[0.2642-2×0.0803]=5.1809 q3=q0-85=1.1668q3=q0-85=1.1668 so that

q=86.16689.19705.18091.1668q=86.16689.19705.18091.1668
(32)

Then, for α=0.8α=0.8 we have

R0.8=(I-0.8P)-1=1.96081.06261.15890.81781.40512.17850.83040.58601.17331.22692.11050.48930.96081.06261.15891.8178andφ=R0.8q=185.68146.09123.89100.68R0.8=(I-0.8P)-1=1.96081.06261.15890.81781.40512.17850.83040.58601.17331.22692.11050.48930.96081.06261.15891.8178andφ=R0.8q=185.68146.09123.89100.68
(33)

Recall that we are dealing with expected discounted costs. Since we usually consider starting with a full stock, the amount φ(M)=φ(3)=100.68φ(M)=φ(3)=100.68 is the quantity of interest. Note that this is smaller than other values of φ(j)φ(j), which correspond to smaller beginning stock levels. We expect greater restocking costs in such cases.

4. Evolution of chains with costs and rewards
1. No discounting A generating function analysis of
(*)v(m+1)=q+Pv(m)forallm>0,withv(0)=0(*)v(m+1)=q+Pv(m)forallm>0,withv(0)=0
(34)
shows
v(n)=ng1+v+transients,wherev=B0qv(n)=ng1+v+transients,wherev=B0q
(35)
Here g=πqg=πq, is a column matrix of ones, P0=PP0=P, and B0 is a matrix which analysis also shows we may approximate by
B0=B(1)I+P+P2++Pn-(n+1)P0B0=B(1)I+P+P2++Pn-(n+1)P0
(36)
The largest |λi|<1|λi|<1 gives an indication of how many terms are needed.

## Example 3: The inventory problem (continued)

We consider again the inventory problem. We have

P=0.08030.18390.36790.36790.63210.3679000.26420.36790.367900.08030.18390.36790.3679andq=86.16689.19705.18191.1668P=0.08030.18390.36790.36790.63210.3679000.26420.36790.367900.08030.18390.36790.3679andq=86.16689.19705.18191.1668
(37)

The eigenvalues are 1,0.0920+i0.2434,0.0920-0.24341,0.0920+i0.2434,0.0920-0.2434 and 0. Thus, the decay of the transients is controlled by |λ*|=(0.09202+0.24342)1/2=0.2602|λ*|=(0.09202+0.24342)1/2=0.2602. Since |λ*|40.0046|λ*|40.0046, the transients die out rather rapidly. We obtain the following approximations

P0P160.28520.28470.26320.16630.28520.28470.26320.16630.28520.28470.26320.16630.28520.28470.26320.1663sothatπ[0.28520.28470.26320.1663]P0P160.28520.28470.26320.16630.28520.28470.26320.16630.28520.28470.26320.16630.28520.28470.26320.1663sothatπ[0.28520.28470.26320.1663]
(38)

The approximate value of B0 is found to be

B0I+P+P2+P3+P4-5P00.4834-0.3766-0.12420.01740.02990.7537-0.5404-0.2432-0.2307-0.16840.7980-0.3989-0.5166-0.3766-0.12421.0174B0I+P+P2+P3+P4-5P00.4834-0.3766-0.12420.01740.02990.7537-0.5404-0.2432-0.2307-0.16840.7980-0.3989-0.5166-0.3766-0.12421.0174
(39)

The value g=πq28.80g=πq28.80 is found in the earlier treatment. From the values above, we have

v=B0q37.66.4-17.8-47.4sothatv(n)28.80n+37.628.80n+6.428.80n-17.828.80n-47.4+transientsv=B0q37.66.4-17.8-47.4sothatv(n)28.80n+37.628.80n+6.428.80n-17.828.80n-47.4+transients
(40)

The average gain per period is clearly g28.80g28.80. This soon dominates the constant terms represented by the entries in v.

2. With discounting Let the discount factor be α(0,1)α(0,1). If n represents the present period, then Gn+1=Gn+1= the reward in the first succeeding period Gn+k=Gn+k= the reward in the kth succeding period. If we do not discount the first period, then
Gn(m)=Gn+1+αGn+2+α2Gn+3++αm-1Gn+m=Gn+1+αGn+1(m-1)Gn(m)=Gn+1+αGn+2+α2Gn+3++αm-1Gn+m=Gn+1+αGn+1(m-1)
(41)
Thus
vi(m)=E[Gn(m)|Xn=i]=qi+αj=1Mp(i,j)vj(m-1)vi(m)=E[Gn(m)|Xn=i]=qi+αj=1Mp(i,j)vj(m-1)
(42)
In matrix form, this is
v(n)=q+αPv(n-1)v(n)=q+αPv(n-1)
(43)
A generating function analysis shows that
vi(n)=vi+transients1iMvi(n)=vi+transients1iM
(44)
Hence, the steady state part of
vi(m)=qi+αj=1Mp(i,j)vj(m-1)isvi=qi+αj=1Mp(i,j)vj1iMvi(m)=qi+αj=1Mp(i,j)vj(m-1)isvi=qi+αj=1Mp(i,j)vj1iM
(45)
In matrix form,
v=q+αPvwhichyieldsv=[I-αP]-1qv=q+αPvwhichyieldsv=[I-αP]-1q
(46)
Since the qi=E[Gn+1|Xn=i]qi=E[Gn+1|Xn=i] are known, we can solve for the vi. Also, since vi(m)vi(m) is the present value of expected gain m steps into the future, the vi represent the present value for an unlimited future, given that the process starts in state i.
5. Stationary decision policies
We suppose throughout this section that XN is ergodic, with finite state space E={1,2,,M}E={1,2,,M}. For each state iEiE, there is a class Ai of possible actions which may be taken when the process is in state i. A decision policy is a sequence of decision functions d1,d2d1,d2 such that
Theactionatstagenisdn(X0,X1,...,Xn).Theactionatstagenisdn(X0,X1,...,Xn).
(47)
The action selected is in Ai whenever Xn=iXn=i. We consider a special class of decision policies. Stationary decision policy
dn(X0,X1,,Xn)=d(Xn)withdinvariantwithndn(X0,X1,,Xn)=d(Xn)withdinvariantwithn
(48)
The possible actions depend upon the current state. That is d(Xn)Aid(Xn)Ai whenever Xn=iXn=i. Analysis shows the Markov character of the process is preserved under stationary policies. Also, if E is finite and every stationary policy produces an irreducible chain, then an optimal stationary policy is optimal. We suppose each policy yields an ergodic chain. Since the transition probabilities from any state depend in part on the action taken when in that state, the long-run probabilities will depend upon the policy. In the no-discounting case, we want to determine a stationary policy that maximizes the gain g=πqg=πq for the corresponding long-run probabilities. We say “a stationary policy,” since we do not rule out the possibility there may be more than one policy which maximizes the gain. In the discounting case, the goal is to maximize the steady state part vi in the expressions
vi(n)=vi+transients1iMvi(n)=vi+transients1iM
(49)
In simple cases, with small state space and a small number of possible actions in each state, it may be feasible to obtain the gain or present values for each permissible policy and then select the optimum by direct comparison. However, for most cases, this is not an efficient approach. In the next two sections, we consider iterative procedures for step-by-step optimization which usually greatly reduces the computational burden.
6. Policy iteration method– no discounting
We develop an iterative procedure for finding an optimal stationary policy. The goal is to determine a stationary policy that maximizes the long run expected gain per period g. In the next section, we extend this procedure to the case where discounting is in effect. We assume that each policy yields an ergodic chain. Suppose we have established that under a given policy (1) vi(n)=qi+jp(i,j)vj(n-1)vi(n)=qi+jp(i,j)vj(n-1), where qi=E[Gn+1|Xn=i]qi=E[Gn+1|Xn=i] In this case, the analysis in Sec 4 shows that for large n, (2) vi(n)ng+vivi(n)ng+vi, where g=iπiqig=iπiqi We note that vi and g depend upon the entire policy, while qi and p(i,j)p(i,j) depend on the individual actions ai. Using (1) and (2), we obtain
ng+vi=qi+ip(i,j)[(n-1)g+vj]=qi+(n-1)g+jp(i,j)vjng+vi=qi+ip(i,j)[(n-1)g+vj]=qi+(n-1)g+jp(i,j)vj
(50)
From this we conclude (3) g+vi=qi+jp(i,j)vjg+vi=qi+jp(i,j)vj for all iEiE Suppose a policy d has been used. That is, action d(i)d(i) is taken whenever the process is in state i. To simplify writing, we drop the indication of the action and simply write p(i,j)p(i,j) for pij(d(i))pij(d(i)), etc. Associated with this policy, there is a gain g. We should like to determine whether or not this is the maximum possible gain, and if it is not, to find a policy which does give the maximum gain. The following two-phase procedure has been found to be effective. It is essentially the procedure developed originally by Ronald Howard, who pioneered the treatment of these problems. The new feature is the method of carrying out the value-determination step, utilizing the approximation method noted above.
1. Value-determination procedure We calculate g=iπiqi=πqg=iπiqi=πq. As in Sec 4, we calculate
v=B0q[I+P+P2++Ps-(s+1)P0]qwhereP0=limnPnv=B0q[I+P+P2++Ps-(s+1)P0]qwhereP0=limnPn
(51)
2. Policy-improvement procedure We suppose policy d has been used through period n. Then, by (3), above,
g+vi=qi+jp(i,j)vjg+vi=qi+jp(i,j)vj
(52)
We seek to improve policy d by selecting policy d*, with d*(i)=aik*d*(i)=aik*, to satisfy
qi*+jpij*vj=max{qi(aik)+jpij(aik)vj:aikAi},1iMqi*+jpij*vj=max{qi(aik)+jpij(aik)vj:aikAi},1iM
(53)
Remarks
• In the procedure for selecting d*, we use the “old” vj.
• It has been established that g*gg*g, with equality iff g is optimal. Since there is a finite number of policies, the procedure must converge to an optimum stationary policy in a finite number of steps.
We implement this two step procedure with Matlab. To demonstrate the procedure, we consider the following

## Example 4: A numerical example

A Markov decision process has three states: State space E={1,2,3}E={1,2,3}.

Actions: State 1: A1={1,2,3}A1={1,2,3} State 2: A2={1,2}A2={1,2} State 3: A3={1,2}A3={1,2}

Transition probabilities and rewards are:

 p1j(1)p1j(1): [1/3 1/3 1/3] g1j(1)g1j(1): [1 3 4] p1j(2)p1j(2): [1/4 3/8 3/8] g2j(2)g2j(2): [2 2 3] p1j(3)p1j(3): [1/3 1/3 1/3] g3j(3)g3j(3): [2 2 3] p2j(1)p2j(1): [1/8 3/8 1/2] g2j(1)g2j(1): [2 1 2] p2j(2)p2j(2): [1/2 1/4 1/4] g2j(2)g2j(2): [1 4 4] p3j(1)p3j(1): [3/8 1/4 3/8] g3j(1)g3j(1): [2 3 3] p3j(2)p3j(2): [1/8 1/4 5/8] g3j(2)g3j(2): [3 2 2]
Use the policy iteration method to determine the policy which gives the maximum gain g.

A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is expedient to include the case number. This example belongs to type 2. Data in file md61.m

type = 2
PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0;
1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0;
3/8 1/4 3/8; 1/8 1/4 5/8]
GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0;        % Zero rows are separators
2 1 2; 1 4 4; 0 0 0;
2 3 3; 3 2 2]
A = [1 2 3 0 1 2 0 1 2]


md61
type = 2
PA = 0.3333    0.3333    0.3333
0.2500    0.3750    0.3750
0.3333    0.3333    0.3333
0         0         0
0.1250    0.3750    0.5000
0.5000    0.2500    0.2500
0         0         0
0.3750    0.2500    0.3750
0.1250    0.2500    0.6250

GA =  1     3     4
2     2     3
2     2     3
0     0     0
2     1     2
1     4     4
0     0     0
2     3     3
3     2     2

A =   1     2     3     0     1     2     0     1     2


Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedure is in the file newpolprep.m

% file newpolprep.m
% version of 3/23/92
disp('Data needed:')
disp('  Matrix PA of transition probabilities for states and actions')
disp('  Matrix GA of gains for states and actions')
disp('  Type number to identify GA matrix types')
disp('    Type 1.  Gains associated with a state')
disp('    Type 2.  One-step transition gains')
disp('    Type 3.  Gains associated with a demand')
disp('  Row vector of actions')
disp('  Value of alpha (= 1 for no discounting)')
c  = input('Enter type number to show gain type ');
a  = input('Enter value of alpha (= 1 for no discounting) ');
PA = input('Enter matrix PA of transition probabilities ');
GA = input('Enter matrix GA of gains ');
if c == 3
PD = input('Enter row vector of demand probabilities ');
end
if c == 1
QA = GA';
elseif c ==2
QA = diag(PA*GA');     % (qx1)
else
QA = GA*PD';
end

A  = input('Enter row vector A of possible actions ');  % (1xq)
m  = length(PA(1,:));
q  = length(A);
n  = input('Enter n, the power of P  to approximate P0 ');
s  = input('Enter s, the power of P  in the approximation of V ');
QD = [(1:q)' A' QA];    % First col is index; second is A; third is QA
DIS = ['    Index      Action    Value'];
disp(DIS)
disp(QD)
if a == 1
call = 'Call for newpol';
else
call = 'Call for newpold';
end
disp(call)

newpolprep   % Call for preparatory program in file npolprep.m
Data needed:
Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actions
Type number to identify GA matrix types
Type 1.  Gains associated with a state
Type 2.  One-step transition gains
Type 3.  Gains associated with a demand
Row vector of actions
Value of alpha (= 1 for no discounting)

Enter type number to show gain type 2
Enter value of alpha (=1 for no discounting) 1
Enter matrix PA of transition probabilities  PA
Enter matrix GA of gains GA
Enter row vector A of possible actions  A
Enter n, the power of P  to approximate P0  16
Enter s, the power of P  in the approximation of V  6
Index      Action    Value
1.0000    1.0000    2.6667
2.0000    2.0000    2.3750
3.0000    3.0000    2.3333
4.0000         0         0
5.0000    1.0000    1.6250
6.0000    2.0000    2.5000
7.0000         0         0
8.0000    1.0000    2.6250
9.0000    2.0000    2.1250

Call for newpol


The procedure has prompted for the procedure newpol (in file newpol.m)

% file:  newpol.m  (without discounting)
% version of 3/23/92
d  = input('Enter policy as row matrix of indices ');
D  = A(d);               % policy in terms of actions
P  = PA(d',:);           % selects probabilities for policy
Q  = QA(d',:);           % selects next-period gains for policy
P0 = P^n;                % Display to check convergence
PI = P0(1,:);            % Long-run distribution
G = PI*Q                 % Long-run expected gain per period
C = P + eye(P);
for j=2:s
C = C + P^j;           % C = I + P + P^2 + ... + P^s
end
V = (C - (s+1)*P0 )*Q;  % B = B0*Q
disp(' ')
disp('Approximation to P0; rows are long-run dbn')
disp(P0)
disp('Policy in terms of indices')
disp(d)
disp('Policy in terms of actions')
disp(D)
disp('Values for the policy selected')
disp(V)
disp('Long-run expected gain per period G')
disp(G)
T = [(1:q)' A' (QA + PA*V)];  % Test values for determining new policy
DIS =['    Index     Action   Test Value'];
disp(DIS)
disp(T)
disp('Check current policy against new test values.')
disp('--If new policy needed, call for newpol')
disp('--If not, use policy, values V, and gain G, above')

newpol
Enter policy as row matrix of indices [2 6 9]  % A deliberately poor choice

Approximation to P0; rows are long-run dbn
0.2642    0.2830    0.4528
0.2642    0.2830    0.4528
0.2642    0.2830    0.4528

Policy in terms of indices
2     6     9

Policy in terms of actions
2     2     2

Long-run expected gain per period G
2.2972

Index     Action   Test Value
1.0000    1.0000    2.7171
2.0000    2.0000    2.4168
3.0000    3.0000    2.3838
4.0000         0         0
5.0000    1.0000    1.6220
6.0000    2.0000    2.5677
7.0000         0         0
8.0000    1.0000    2.6479
9.0000    2.0000    2.0583

Check current policy against new test values.
--If new policy needed, call for newpol
--If not, use policy and gain G, above      % New policy is needed
newpol

Enter policy as row matrix of indices [1 6 8]

Approximation to P0; rows are long-run dbn
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232

Policy in terms of indices
1     6     8

Policy in terms of actions
1     2     1

Values for selected policy
0.0526
-0.0989
0.0223

Long-run expected gain per period G
2.6061

Index     Action   Test Value
1.0000    1.0000    2.6587
2.0000    2.0000    2.3595
3.0000    3.0000    2.3254
4.0000         0         0
5.0000    1.0000    1.6057
6.0000    2.0000    2.5072
7.0000         0         0
8.0000    1.0000    2.6284
9.0000    2.0000    2.1208

Check current policy against new test values.
--If new policy needed, call for newpol
--If not, use policy, values V, and gain G, above


Since the policy selected on this iteration is the same as the previous one, the procedure has converged to an optimal policy. The first of the first three rows is maximum; the second of the next two rows is maximum; and the first of the final two rows is maximum. Thus, we have selected rows 1, 5, 6, corresponding to the optimal policy d*(121)d*(121). The expected long-run gain per period g=2.6061g=2.6061.

The value matrix is

v=v1v2v3=0.0527-0.09890.0223v=v1v2v3=0.0527-0.09890.0223
(54)

We made a somewhat arbitrary choice of the powers of P used in the approximation of P0 and B0. As we note in the development of the approximation procedures in Sec 4, the convergence of PnPn is governed by the magnitude of the largest eigenvalue other than one. We can always get a check on the reliability of the calculations by checking the eigenvalues for P corresponding to the presumed optimal policy. For the choice above, we find the eigenvalues to be 1, -0.1250, 0.0833. Since 0.12540.00020.12540.0002, the choices of exponents should be quite satisfactory. In fact, we could probably use P8P8 as a satisfactory approximation to P0, if that were desirable. The margin allows for the possibility that for some policies the eigenvalues may not be so small. —

7. Policy iteration– with discounting
It turns out that the policy iteration procedure is simpler in the case of discounting, as suggested by the evolution analysis in Sec 4. We have the following two-phase procedure, based on that analysis.
1. Value-determination procedure. Given the qi and transition probabilities p(i,j)p(i,j) for the current policy, solve v=q+αPvv=q+αPv to get for the corresponding vi
v=[I-αP]-1qv=[I-αP]-1q
(55)
2. Policy-improvement procedure Given {vi:1iM}{vi:1iM} for the current policy, determine a new policy d*, with each d*(i)=ai*d*(i)=ai* determined as that action for which
qi*+αj=1Mp*(i,j)vj=maxk{qi(aik)+j=1Mpij(aik)vjaikAi}qi*+αj=1Mp*(i,j)vj=maxk{qi(aik)+j=1Mpij(aik)vjaikAi}
(56)
We illustrate the Matlab procedure with the same numerical example as in the previous case, using discount factor a=0.8a=0.8. The data file is the same, so that we call for it, as before.

## Example 5

Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making some initial choices.

newpolprep
Data needed:
Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actions
Type number to identify GA matrix types
Type 1.  Gains associated with a state
Type 2.  One-step transition gains
Type 3.  Gains associated with a demand
Row vector of actions
Value of alpha (= 1 for no discounting)

Enter type number to show gain type 2
Enter value of alpha (= 1 for no discounting) 0.8
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector A of possible actions A
Enter n, the power of P  to approximate P0 16
Enter s, the power of P  in the approximation of V 6

Index      Action  Test Value
1.0000    1.0000    2.6667
2.0000    2.0000    2.3750
3.0000    3.0000    2.3333
4.0000         0         0
5.0000    1.0000    1.6250
6.0000    2.0000    2.5000
7.0000         0         0
8.0000    1.0000    2.6250
9.0000    2.0000    2.1250

Call for newpold


The procedure for policy iteration is in the file newpold.m.

% file newpold.m  (with discounting)
% version of 3/23/92
d = input('Enter policy as row matrix of indices ');
D = A(d);
P = PA(d',:);        % transition matrix for policy selected
Q = QA(d',:);        % average next-period gain for policy selected

V = (eye(P) - a*P)\Q;            % Present values for unlimited future
T = [(1:q)' A' (QA + a*PA*V)];   % Test values for determining new policy
disp(' ')
disp('Approximation to P0; rows are long-run dbn')
disp(P0)
disp('    Policy in terms of indices')
disp(d)
disp('    Policy in terms of actions')
disp(D)
disp('  Values for policy')
disp(V)
DIS =['    Index     Action    Test Value'];
disp(DIS)
disp(T)
disp('Check current policy against new test values.')
disp('--If new policy needed, call for newpold')
disp('--If not, use policy and values above')

newpold
Enter policy as row matrix of indices [3 5 9]  % Deliberately poor choice

Approximation to P0; rows are long-run dbn
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232

Policy in terms of indices
3     5     9

Policy in terms of actions
3     1     2

Values for policy
10.3778
9.6167
10.1722

Index     Action  Test Value
1.0000    1.0000   10.7111
2.0000    2.0000   10.3872
3.0000    3.0000   10.3778
4.0000         0         0
5.0000    1.0000    9.6167
6.0000    2.0000   10.6089
7.0000         0         0
8.0000    1.0000   10.7133
9.0000    2.0000   10.1722

Check current policy against new test values.
--If new policy needed, call for newpold
--If not, use policy and values above

newpold
Enter policy as row matrix of indices [1 6 8]

Approximation to P0; rows are long-run dbn
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232
0.3939    0.2828    0.3232

Policy in terms of indices
1     6     8

Policy in terms of actions
1     2     1

Values for policy
13.0844
12.9302
13.0519

   Index     Action  Test Value
1.0000    1.0000   13.0844
2.0000    2.0000   12.7865
3.0000    3.0000   12.7511
4.0000         0         0
5.0000    1.0000   12.0333
6.0000    2.0000   12.9302
7.0000         0         0
8.0000    1.0000   13.0519
9.0000    2.0000   12.5455

Check current policy against new test values.
--If new policy needed, call for newpold
--If not, use policy and values above


Since the policy indicated is the same as the previous policy, we know this is an optimal policy. Rows 1, 6, 8, indicate the optimal policy to be d*(1,2,1)d*(1,2,1). It turns out in this example that the optimal policies are the same for the discounted and undiscounted cases. That is not always true. The v matrix gives the present values for unlimited futures, for each of the three possible starting states.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

#### Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

#### Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks