Killed Markov Decision Processes for Countable Models for Crash Function Assessment

Killed Markov Decision Processes for Countable Models for Crash Function Assessment In this article the killed Markov decision processes for countable models on finite time interval are considered. The existence of a uniform ε-optimal policy is proved. The correctness of the fundamental equation is shown. The optimal control problem is reduced to a similar problem for derived model. Also, the optimality equation and method for simple optimal policies constructing is received. A sufficient condition of simple policies for countable models is proved. The correctness of the Markovian property is shown. Additionally dynamic programming principle is considered.


Introduction
Markov 1 or similar decision processes arise in many different areas of economics. In particular it is economic work planning of a separate business, an economic sector or entire economies. It is typical for that at the beginning of each period we can build the plan for the next period knowing the last achieved state. At such cases the system development can be described mathematically as deterministic process under the mild assumption that the system state at the end of each period is uniquely defined by the state at the end of a period and by a plan for this period.
But in many cases it is necessary to consider also the influence of such factors like for example meteorological conditions, demographic transition, demand fluctuations, the imperfection of the compound production processes coordination, scientific discoveries and inventions etc. Stochastic models are much better able to take into account these factors: if we know the state at the beginning of the period and a plan, we can only calculate the probability distribution states for the next period. Therefore, leaving aside the system states in the past periods we come to the idea of Markov decision process ("the future depends not on the past, but only on the present").
The Markov decision processes are well described for example by Dynkin and Yushkevich 2 . There the definition of Markov decision process is given, the concept of "model" Z  is presented, the definition of policy  is given, the assessment of policy -( )

Remark 1. In other words, the system transits into the initial (home) state when it hits a killed state (process is killed).
From the definition of killed state it follows:

Definition 3 (Killed Markov decision process). A killed Markov decision process on a time
interval [ , ] m n is defined through the following objects: Our goal is to find a decision method which maximizes the mathematical expectation of way l assessment : where: The decision method is meant to be some policy. Remark 3.

Policies
The next conceptions will not be well-defined without the following assumption: Assumption 1. The reward function q and terminal reward function r have the supremum, distribution is compared with probability distribution * P in space L which has such notation: Remark 4. After the definition of measure * P the way l can be interpreted as stochastic

process. Additionally this process is called Markov process if policy  is a Markov policy.
For all function  from space L the mathematical expectation of  is: The assessment (1) of the way l is an example of such function. And we denote its expectation  : (4)

Definition 9 (Assessment of policy). The value  from (4) is called assessment of policy 
and is for a killed Markov decision process * The goal of this research is the maximization of function ( )   .

Definition 11 ( -optimal policy). A Killed policy
Definition 12 (Uniform  -optimal policy). A Killed policy is called uniform  -optimal or

Existence of uniform  -optimal policy
Let x  is  -optimal policy for process * x Z . Its existence follows from the definition of supremum. Our aim is to build the one killed policy  which is  -optimal for model * Z by using a sequence of killed policies x  . It's natural to use the policy x  when x is a starting where ( ) x h -the initial state of history h .
It's clear that formula (5) defines some policy  and this policy will be  -optimal. That Proposition 1 (Existence of uniform  -optimal killed policy). Every killed policy  from (5) which is  -optimal: is uniform  -optimal, that means , 0: Remark 6. Formulas (6) and (7)

Derived model and fundamental equation
The decision process is a quite number of consecutive steps. The first step is the choice of probability distribution on 1 m A  which depends on initial state. Since the choice is taken every initial distribution  on m X accords with probability distribution  on 1 m X  . Now we consider  as initial distribution in ( 1 m  ) moment of time.
As a result, we divide our maximization problem into two subproblems: 1. We must choose the optimal policy for the next moments of time for every initial distribution on 1 m X  ; 2. We must choose the first step according to maximum reward and maximum value of the optimal policy assessment in the next time moments for initial distribution  . Equation (8) is called fundamental and expresses the assessment  of random policy  in model * Z in terms of the assessment  of some policies in model * Z  .

Definition 14 (Derived model). The model that builds of model * Z by deletion m X and
Remark 7. The fundamental equation is correct even without Assumption 1.

Reducing the problem of optimal decision to analogical problem for derived model
From fundamental equation (8) it follows the valuation: where: y -not killed states, * y -killed states.
Let operator V transforms functions on A to functions on not killed and not terminal states on X and follows the formula: Let write the inequation (9) by using operator V :

Z  and any
When we choose on the first step an action a and on all other steps we use the killed policy   then we get killed policy  in model * Z . This policy is called product of policies  and   and is denoted by   . It has the expression: Corollary 1. The assessment  of model * Z is expressed in terms of assessment   of model * Z  in the following way: where operators U and V are defined in (10) and (11).
Here ( | ) x   can be the distribution concentrated in one point ( ) The reward function q and transition function p we denote t q and t p .
According to the results from section 5 we get: where: p y a f y p y a c y a A y X

Sufficient condition of simple policies for countable models
There is still the question: shall we lose something by using only simple policies? The previous result can't give the answer. It only makes our losses indefinitely small.

Theorem 1 (Sufficient condition of simple policies). Let  is fixed initial distribution and
let  is arbitrary killed policy then exists  -simple policy such that: .
Proof. It directly follows from Proposition 5 and Proposition 6.
Proposition 5.   and for all killed policy  exists Markov policy  such that: These two policies are called equivalent.
Proposition 6. For all Markov policy  exists simple policy  such that: We say that  dominates  uniformly.

Markovian property
Let 0 < < In particular according to (24) it follows if   is a uniform  -optimal killed policy for *n t Z with terminal reward r and   is a uniform  -optimal policy for * Notes 1 Markov processes are described in Feinberg, Shwartz (2002). 2 Dynkin, Yushkevich (1975). 3 Dynkin, Yushkevich (1975). 4 Some related ideas of this subject appears in Pakes (1997). 5 Elements of dynamic programming one can find in Bellman (1977).