The Markov Property

  • The Markov Property pertains to the evolution of a process where the future state only depends on the current state and not the past.
  • For decision processes, the current state has to be informative
  • Markov Processes can be represented using a transition graph, 1 where each node is a state and each edge is a weighted edge with a transition probability attached to it.

Markov Decision Processes

  • Caveat: Not all processes need to be Markovian to be analyzed effectively.

  • Markov Decision Processes are described using actions, rewards, states and transitions.

    • An agent and environment interact at discrete time steps
    • Each time step is associated with an environment state .
    • An agent performs an action in response to state .
    • A reward is given for .
    • The sequence
      is called the trajectory.
  • An MDP is finite if the set of states, actions, and rewards are finite.

  • The dynamics of the MDP is computed as a probability distribution on states and actions defined as

    In an MDP, the probability above completely characterizes the dynamics of the environment.

  • These are essential in the field of Reinforcement Learning (see here). Almost all Reinforcement Learning tasks can be understood as a Markov Decision Process.

  • The value function in an MDP assuming the existence of a policy can be approximated in a variety of ways.

  • A Markov Reward Process is a Markov decision process without actions.

Markov Decision Process Control

  • In an MDP, the existence of a discounted value function implies the existence of a deterministic optimal policy. The policy need not be unique. In addition to the polices defined here, the following properties are observed:
    • The policy is deterministic.
    • The policy is stationary (i.e., it does not depend on the time step. It is always applicable)
  • In an MDP, the actions that appear the best (with respect to an optimal policy) after searching for one step are indeed the best.
    • This follows immediately from the fact the process is memoryless.
    • The implication is that there is no need to plan more than one step ahead in a Markovian process. Greedy works assuming we can find the optimal policy.
    • This comes at the cost that this very rarely happens in practice due to complexity and performance costs.

Links

Footnotes

  1. See more here.