Inside Collection (Textbook): Topics in Applied Probability
Summary: We first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. Then we consider three cases: a. Gain associated with a state; b. One-step transition gains; and c. Gains associated with a demand under certain reasonable conditions. Matlab implementations of the results of analysis provide machine solutions to a variety of problems.
In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.
Let XN be an ergodic, homogeneous Markov chain with state
space E and gain sequence
Consider an ergodic, homogeneous Markov chain XN with gain sequence
Suppose the transition matrix P and the next-period expected gain matrix q are
If k units are ordered, the cost in dollars is
For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use
the
The largest eigenvalue
Taking any row of
We thus have
For the Poisson distribution
Hence
Then, for
Recall that we are dealing with expected discounted costs. Since we
usually consider starting with a full stock, the amount
We consider again the inventory problem. We have
The eigenvalues are
The approximate value of B0 is found to be
The value
The average gain per period is clearly
A Markov decision process has three states: State space
Actions: State 1:
Transition probabilities and rewards are:
|
| |
A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is expedient to include the case number. This example belongs to type 2. Data in file md61.m
type = 2
PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0;
1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0;
3/8 1/4 3/8; 1/8 1/4 5/8]
GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0; % Zero rows are separators
2 1 2; 1 4 4; 0 0 0;
2 3 3; 3 2 2]
A = [1 2 3 0 1 2 0 1 2]
md61
type = 2
PA = 0.3333 0.3333 0.3333
0.2500 0.3750 0.3750
0.3333 0.3333 0.3333
0 0 0
0.1250 0.3750 0.5000
0.5000 0.2500 0.2500
0 0 0
0.3750 0.2500 0.3750
0.1250 0.2500 0.6250
GA = 1 3 4
2 2 3
2 2 3
0 0 0
2 1 2
1 4 4
0 0 0
2 3 3
3 2 2
A = 1 2 3 0 1 2 0 1 2
Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedure is in the file newpolprep.m
% file newpolprep.m
% version of 3/23/92
disp('Data needed:')
disp(' Matrix PA of transition probabilities for states and actions')
disp(' Matrix GA of gains for states and actions')
disp(' Type number to identify GA matrix types')
disp(' Type 1. Gains associated with a state')
disp(' Type 2. One-step transition gains')
disp(' Type 3. Gains associated with a demand')
disp(' Row vector of actions')
disp(' Value of alpha (= 1 for no discounting)')
c = input('Enter type number to show gain type ');
a = input('Enter value of alpha (= 1 for no discounting) ');
PA = input('Enter matrix PA of transition probabilities ');
GA = input('Enter matrix GA of gains ');
if c == 3
PD = input('Enter row vector of demand probabilities ');
end
if c == 1
QA = GA';
elseif c ==2
QA = diag(PA*GA'); % (qx1)
else
QA = GA*PD';
end
A = input('Enter row vector A of possible actions '); % (1xq)
m = length(PA(1,:));
q = length(A);
n = input('Enter n, the power of P to approximate P0 ');
s = input('Enter s, the power of P in the approximation of V ');
QD = [(1:q)' A' QA]; % First col is index; second is A; third is QA
DIS = [' Index Action Value'];
disp(DIS)
disp(QD)
if a == 1
call = 'Call for newpol';
else
call = 'Call for newpold';
end
disp(call)
newpolprep % Call for preparatory program in file npolprep.m
Data needed:
Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actions
Type number to identify GA matrix types
Type 1. Gains associated with a state
Type 2. One-step transition gains
Type 3. Gains associated with a demand
Row vector of actions
Value of alpha (= 1 for no discounting)
Enter type number to show gain type 2
Enter value of alpha (=1 for no discounting) 1
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector A of possible actions A
Enter n, the power of P to approximate P0 16
Enter s, the power of P in the approximation of V 6
Index Action Value
1.0000 1.0000 2.6667
2.0000 2.0000 2.3750
3.0000 3.0000 2.3333
4.0000 0 0
5.0000 1.0000 1.6250
6.0000 2.0000 2.5000
7.0000 0 0
8.0000 1.0000 2.6250
9.0000 2.0000 2.1250
Call for newpol
The procedure has prompted for the procedure newpol (in file newpol.m)
% file: newpol.m (without discounting)
% version of 3/23/92
d = input('Enter policy as row matrix of indices ');
D = A(d); % policy in terms of actions
P = PA(d',:); % selects probabilities for policy
Q = QA(d',:); % selects next-period gains for policy
P0 = P^n; % Display to check convergence
PI = P0(1,:); % Long-run distribution
G = PI*Q % Long-run expected gain per period
C = P + eye(P);
for j=2:s
C = C + P^j; % C = I + P + P^2 + ... + P^s
end
V = (C - (s+1)*P0 )*Q; % B = B0*Q
disp(' ')
disp('Approximation to P0; rows are long-run dbn')
disp(P0)
disp('Policy in terms of indices')
disp(d)
disp('Policy in terms of actions')
disp(D)
disp('Values for the policy selected')
disp(V)
disp('Long-run expected gain per period G')
disp(G)
T = [(1:q)' A' (QA + PA*V)]; % Test values for determining new policy
DIS =[' Index Action Test Value'];
disp(DIS)
disp(T)
disp('Check current policy against new test values.')
disp('--If new policy needed, call for newpol')
disp('--If not, use policy, values V, and gain G, above')
newpol
Enter policy as row matrix of indices [2 6 9] % A deliberately poor choice
Approximation to P0; rows are long-run dbn
0.2642 0.2830 0.4528
0.2642 0.2830 0.4528
0.2642 0.2830 0.4528
Policy in terms of indices
2 6 9
Policy in terms of actions
2 2 2
Long-run expected gain per period G
2.2972
Index Action Test Value
1.0000 1.0000 2.7171
2.0000 2.0000 2.4168
3.0000 3.0000 2.3838
4.0000 0 0
5.0000 1.0000 1.6220
6.0000 2.0000 2.5677
7.0000 0 0
8.0000 1.0000 2.6479
9.0000 2.0000 2.0583
Check current policy against new test values.
--If new policy needed, call for newpol
--If not, use policy and gain G, above % New policy is needed
newpol
Enter policy as row matrix of indices [1 6 8]
Approximation to P0; rows are long-run dbn
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
Policy in terms of indices
1 6 8
Policy in terms of actions
1 2 1
Values for selected policy
0.0526
-0.0989
0.0223
Long-run expected gain per period G
2.6061
Index Action Test Value
1.0000 1.0000 2.6587
2.0000 2.0000 2.3595
3.0000 3.0000 2.3254
4.0000 0 0
5.0000 1.0000 1.6057
6.0000 2.0000 2.5072
7.0000 0 0
8.0000 1.0000 2.6284
9.0000 2.0000 2.1208
Check current policy against new test values.
--If new policy needed, call for newpol
--If not, use policy, values V, and gain G, above
Since the policy selected on this iteration is the same as the
previous one, the procedure has converged to an optimal policy.
The first of the first three rows is maximum; the second of the next two
rows is maximum; and the first of the final two rows is maximum. Thus,
we have selected rows 1, 5, 6, corresponding to the optimal policy
The value matrix is
We made a somewhat arbitrary choice of the powers of P used in the
approximation of P0 and B0. As we note in the development
of the approximation procedures in Sec 4, the convergence of
Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making some initial choices.
newpolprep
Data needed:
Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actions
Type number to identify GA matrix types
Type 1. Gains associated with a state
Type 2. One-step transition gains
Type 3. Gains associated with a demand
Row vector of actions
Value of alpha (= 1 for no discounting)
Enter type number to show gain type 2
Enter value of alpha (= 1 for no discounting) 0.8
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector A of possible actions A
Enter n, the power of P to approximate P0 16
Enter s, the power of P in the approximation of V 6
Index Action Test Value
1.0000 1.0000 2.6667
2.0000 2.0000 2.3750
3.0000 3.0000 2.3333
4.0000 0 0
5.0000 1.0000 1.6250
6.0000 2.0000 2.5000
7.0000 0 0
8.0000 1.0000 2.6250
9.0000 2.0000 2.1250
Call for newpold
The procedure for policy iteration is in the file newpold.m.
% file newpold.m (with discounting)
% version of 3/23/92
d = input('Enter policy as row matrix of indices ');
D = A(d);
P = PA(d',:); % transition matrix for policy selected
Q = QA(d',:); % average next-period gain for policy selected
V = (eye(P) - a*P)\Q; % Present values for unlimited future
T = [(1:q)' A' (QA + a*PA*V)]; % Test values for determining new policy
disp(' ')
disp('Approximation to P0; rows are long-run dbn')
disp(P0)
disp(' Policy in terms of indices')
disp(d)
disp(' Policy in terms of actions')
disp(D)
disp(' Values for policy')
disp(V)
DIS =[' Index Action Test Value'];
disp(DIS)
disp(T)
disp('Check current policy against new test values.')
disp('--If new policy needed, call for newpold')
disp('--If not, use policy and values above')
newpold
Enter policy as row matrix of indices [3 5 9] % Deliberately poor choice
Approximation to P0; rows are long-run dbn
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
Policy in terms of indices
3 5 9
Policy in terms of actions
3 1 2
Values for policy
10.3778
9.6167
10.1722
Index Action Test Value
1.0000 1.0000 10.7111
2.0000 2.0000 10.3872
3.0000 3.0000 10.3778
4.0000 0 0
5.0000 1.0000 9.6167
6.0000 2.0000 10.6089
7.0000 0 0
8.0000 1.0000 10.7133
9.0000 2.0000 10.1722
Check current policy against new test values.
--If new policy needed, call for newpold
--If not, use policy and values above
newpold
Enter policy as row matrix of indices [1 6 8]
Approximation to P0; rows are long-run dbn
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
0.3939 0.2828 0.3232
Policy in terms of indices
1 6 8
Policy in terms of actions
1 2 1
Values for policy
13.0844
12.9302
13.0519
Index Action Test Value
1.0000 1.0000 13.0844
2.0000 2.0000 12.7865
3.0000 3.0000 12.7511
4.0000 0 0
5.0000 1.0000 12.0333
6.0000 2.0000 12.9302
7.0000 0 0
8.0000 1.0000 13.0519
9.0000 2.0000 12.5455
Check current policy against new test values.
--If new policy needed, call for newpold
--If not, use policy and values above
Since the policy indicated is the same as the previous policy,
we know this is an optimal policy. Rows 1, 6, 8, indicate the
optimal policy to be