The Markov Property
- The Markov Property pertains to the evolution of a process where the future state only depends on the current state and not the past.
- For decision processes, the current state has to be informative
- Markov Processes can be represented using a transition graph, 1 where each node is a state and each edge is a weighted edge with a transition probability attached to it.
Markov Decision Processes
-
Caveat: Not all processes need to be Markovian to be analyzed effectively.
-
Markov Decision Processes are described using actions, rewards, states and transitions.
- An agent and environment interact at discrete time steps
- Each time step is associated with an environment state
. - An agent performs an action
in response to state . - A reward
is given for . - The sequence
is called the trajectory.
- An agent and environment interact at discrete time steps
-
An MDP is finite if the set of states, actions, and rewards are finite.
-
The dynamics of the MDP is computed as a probability distribution on states and actions defined as
In an MDP, the probability above completely characterizes the dynamics of the environment.
-
These are essential in the field of Reinforcement Learning (see here). Almost all Reinforcement Learning tasks can be understood as a Markov Decision Process.
-
The value function in an MDP assuming the existence of a policy can be approximated in a variety of ways.
- Simulation (this is cubic time complexity)
- Analytical Derivation.
- Using Dynamic Programming.
-
A Markov Reward Process is a Markov decision process without actions.
Markov Decision Process Control
- In an MDP, the existence of a discounted value function implies the existence of a deterministic optimal policy. The policy need not be unique. In addition to the polices defined here, the following properties are observed:
- The policy is deterministic.
- The policy is stationary (i.e., it does not depend on the time step. It is always applicable)
- In an MDP, the actions that appear the best (with respect to an optimal policy) after searching for one step are indeed the best.
- This follows immediately from the fact the process is memoryless.
- The implication is that there is no need to plan more than one step ahead in a Markovian process. Greedy works assuming we can find the optimal policy.
- This comes at the cost that this very rarely happens in practice due to complexity and performance costs.
Links
-
The Setting for Reinforcement Learning - for an overview of the relationship between Markov Processes and Reinforcement Learning.
-
Characterizing the Decision Problem - MDPs are a special case of the Decision Problem