The Markov Property
Markov Decision Processes
Markov Decision Process Control
Links

Graph View

Backlinks

Fundamental Objects in Factory Physics
Inventory Management and Control
Reinforcement Learning
A Unified View on Reinforcement Learning Approaches
Backups in Reinforcement Learning
Dynamic Programming for Reinforcement Learning
Monte Carlo Methods in Reinforcement Learning
The Setting for Reinforcement Learning
MARL from a Game Theoretic Perspective

The Library

Search

Markov Processes in Machine Learning

Mar 10, 2025, 4 min read

The Markov Property

The Markov Property pertains to the evolution of a process where the future state only depends on the current state and not the past.
For decision processes, the current state has to be informative
Markov Processes can be represented using a transition graph, ¹ where each node is a state and each edge is a weighted edge with a transition probability attached to it.

Markov Decision Processes

Caveat: Not all processes need to be Markovian to be analyzed effectively.
Markov Decision Processes are described using actions, rewards, states and transitions.
- An agent and environment interact at discrete time steps
- Each time step is associated with an environment state .
- An agent performs an action in response to state .
- A reward is given for .
- The sequence is called the trajectory.
An MDP is finite if the set of states, actions, and rewards are finite.
The dynamics of the MDP is computed as a probability distribution on states and actions defined as

In an MDP, the probability above completely characterizes the dynamics of the environment.
These are essential in the field of Reinforcement Learning (see here). Almost all Reinforcement Learning tasks can be understood as a Markov Decision Process.
The value function in an MDP assuming the existence of a policy can be approximated in a variety of ways.
- Simulation (this is cubic time complexity)
- Analytical Derivation.
- Using Dynamic Programming.
A Markov Reward Process is a Markov decision process without actions.

Markov Decision Process Control

In an MDP, the existence of a discounted value function implies the existence of a deterministic optimal policy. The policy need not be unique. In addition to the polices defined here, the following properties are observed:
- The policy is deterministic.
- The policy is stationary (i.e., it does not depend on the time step. It is always applicable)
In an MDP, the actions that appear the best (with respect to an optimal policy) after searching for one step are indeed the best.
- This follows immediately from the fact the process is memoryless.
- The implication is that there is no need to plan more than one step ahead in a Markovian process. Greedy works assuming we can find the optimal policy.
- This comes at the cost that this very rarely happens in practice due to complexity and performance costs.

Links

The Setting for Reinforcement Learning - for an overview of the relationship between Markov Processes and Reinforcement Learning.
Characterizing the Decision Problem - MDPs are a special case of the Decision Problem
Sutton and Barto Ch. 3.5 - 3.6

Footnotes

See more here. ↩

Created with Quartz v4.1.0, © 2025

Have an issue Send an issue here

GitHub