Problem Statements

¹ We formulate the MARL problem using an extension of MDPs.

A stochastic game is defined as a set of key elements

is the number of agents. is the set of environmental states shared by all agents. denotes terminal states of the environment is the set of actions of the -th agent. The set of all action configurations is denoted is the reward function for the -th agent for each time step gives the transition probability from state to given action was taken. gives the initial state distribution. We sample the initial state from this.

As a convention, we use the superscript for the -th agent and for all other agents.
In this context, all agents simultaneously make decisions (we have a static game). Each agent aims to find a policy so that its reward function is maximized. The environment, in turn, can change in response to the agents’ actions.
The joint action pertains to an element of . That is, it is of the form .
The policy of the agent is conditioned on its history ², that is . The history contains the states and the joint actions.
Like MDPs, we assume the Markov property for .

A more general form introduces the notion of partial observability. A partially observable stochastic game (POSG / POMDP) is one where the agent can only receive incomplete information about the environment. More formally, it adds for each agent in a stochastic game

- a finite set of observations - defines the observation function that specifies probabilities over the agent’s possible observations. It is written as .
- The probability of a joint observation is calculated as
- A special case where all agents have the same reward function is called a Decentralized POMDP (Dec-POMDP)
  - Time-complexity wise, Dec-POMDP is NEXP-complete.
- Observability conditions introduce special cases such as
  - The agents do not observe the actions of the other agents.
  - The agents may observe only a subset of the state and joint action due to having a limited view of the environment , where and .
  - A noisy environment. This happens when multiple possible observations occur with non-zero probability.
- This necessitates the use of beliefs through a belief state that encodes the probability of the states the environment may be in at time . It follows Bayes rule to form a posterior distribution updated through belief state filtering
  - Note that updating the belief states is a complex task especially for multiagent environments.
  - This is mitigated by assuming that agents do not have exact knowledge of the elements of a POMDP. They can be approximated through RNNs
The full history up to time is denoted as . It is of the form
We may introduce communication as an action that is observed by the agents but does not affect the environment. In such a case, we have an action space that is a combination of an environment action space and a communication action space

The actions in do not affect the environments. is independent of .
- A standard assumption: Agents do not know the meaning of the actions in , including the communication actions.
- Thus, agents must learn how to interpret the meaning of actions
In MARL, we assume that agents do not know any reward functions, but instead they only experience the immediate effects of their own actions.
Open Multi-agent environments - environments where agents may dynamically enter and leave the environment.
A no-conflict game is a game where all agents agree on the most-preferred outcome. More formally, where
- Agents operating in these games tend to be risk-averse, especially when multiple equilibria exist.
A common reward game is a game where all agents share the same reward function. That is .

Solution Concepts

Defines the notion of “optimal” agent behavior.
We first define the notion of return (see more here). These two definitions are equivalent
- The History Based Expected Return for agent under policy as the average of the returns of agent in each possible history
- The Recursive Expected Error is analogous to Bellman backup. The expected return is defined under and . denotes concatenation
Note: The solution concepts defined below are known NP-hard problems in the general case. The computation of Nash equilibria, for example is PPAD complete.

Equilibria

The Best Response Policy is defined as the policy that maximizes the return
In a zero-sum game with two agents, a joint policy is a minimax solution if
- Here the agent’s policy is optimized against an opponent that seeks to minimize the agent’s return
- In the general case, we can solve this with Linear Programming by solving the following
In a general-sum game with agents, a joint policy is in Nash Equilibrium if
- Here the agent plays a best response to the policies of the other agents
- (Fink, 1964) There is always a Nash equilibrium in stationary strategies ³

A Nash equilibrium may be relaxed to an -Nash equilibrium for where
- Every Nash equilibrium is surrounded by a region of -Nash Equilibria.
- The return for an -Nash equilibrium need not be close to the return of the actual Nash equilibrium. Great discontinuities can lead to such a scenario.
In a general-sum normal-form game with agents, let be a joint policy which assigns probabilities to joint actions . Then is a correlated equilibrium if for every agent and every action modifier
- The action modifier pertains to any possible deviation to the joint policy.
- This policy generalizes Nash equilibria to allow for policies where actions can be correlated.
- Each agent plays a best response and doesn’t want to deviate from their joint actions . It is a best response to the conditional distribution over actions recommended to other agents by given
- A more general case is a coarse correlated equilibrium which requires the inequality hold for an unconditional action modifier — which is simply a constant action .
  - Agents must decide before seeing the recommended action, whether to follow the joint policy, assuming the other agents will follow it.

No-Regret

The regret of an agent is the difference between the rewards the agent received and the rewards it could have received from following a different action or policy.

It may be defined as follows in terms of rewards and actions

Or in terms of policies chosen across episodes

An agent has no-regret if, in the limit of infinitely many episodes, the regret is at most . More formally
An agent has -no-regret if for .
Regret assumes that the actions or policies of the other agents remain fixed in each episode.
When the policies of more than one agent changes, minimizing regret does not imply maximizing return.

Miscellaneous

We can extend concepts from single-agent RL. In particular, the quantities from RL.
The state value is defined as

Which resembles the Bellman equation.
The -value or the state-action value of a subset of agents, as

That is, we average out the value from the other agents not in the set .
The multi-agent advantage function of with respect to is defined as

If the above reduces to

Table of Contents

Graph View

Backlinks

The Library

MARL from a Game Theoretic Perspective

Problem Statements

Solution Concepts

Equilibria

Other Solution Concepts

No-Regret

Miscellaneous

Links

Table of Contents

Graph View

Backlinks

MARL from a Game Theoretic Perspective

Problem Statements §

Solution Concepts §

Equilibria §

Other Solution Concepts §

No-Regret §

Miscellaneous §

Links §

Footnotes §

Problem Statements

Solution Concepts

Equilibria

Other Solution Concepts

No-Regret

Miscellaneous

Links

Footnotes