Problem Statements

  • 1 We formulate the MARL problem using an extension of MDPs.
  • A stochastic game is defined as a set of key elements

    is the number of agents. is the set of environmental states shared by all agents. denotes terminal states of the environment is the set of actions of the -th agent. The set of all action configurations is denoted is the reward function for the -th agent for each time step gives the transition probability from state to given action was taken. gives the initial state distribution. We sample the initial state from this.

    As a convention, we use the superscript for the -th agent and for all other agents.

  • In this context, all agents simultaneously make decisions (we have a static game). Each agent aims to find a policy so that its reward function is maximized. The environment, in turn, can change in response to the agents’ actions.

  • The joint action pertains to an element of . That is, it is of the form .

  • The policy of the agent is conditioned on its history 2, that is . The history contains the states and the joint actions.

  • Like MDPs, we assume the Markov property for .

  • A more general form introduces the notion of partial observability. A partially observable stochastic game (POSG / POMDP) is one where the agent can only receive incomplete information about the environment. More formally, it adds for each agent in a stochastic game

    - a finite set of observations - defines the observation function that specifies probabilities over the agent’s possible observations. It is written as .

    • The probability of a joint observation is calculated as

    • A special case where all agents have the same reward function is called a Decentralized POMDP (Dec-POMDP)

    • Observability conditions introduce special cases such as

      • The agents do not observe the actions of the other agents.
      • The agents may observe only a subset of the state and joint action due to having a limited view of the environment , where and .
      • A noisy environment. This happens when multiple possible observations occur with non-zero probability.
    • This necessitates the use of beliefs through a belief state that encodes the probability of the states the environment may be in at time . It follows Bayes rule to form a posterior distribution updated through belief state filtering

      • Note that updating the belief states is a complex task especially for multiagent environments.
      • This is mitigated by assuming that agents do not have exact knowledge of the elements of a POMDP. They can be approximated through RNNs
  • The full history up to time is denoted as . It is of the form

  • We may introduce communication as an action that is observed by the agents but does not affect the environment. In such a case, we have an action space that is a combination of an environment action space and a communication action space

    The actions in do not affect the environments. is independent of .

    • A standard assumption: Agents do not know the meaning of the actions in , including the communication actions.
    • Thus, agents must learn how to interpret the meaning of actions
  • In MARL, we assume that agents do not know any reward functions, but instead they only experience the immediate effects of their own actions.

  • Open Multi-agent environments - environments where agents may dynamically enter and leave the environment.

  • A no-conflict game is a game where all agents agree on the most-preferred outcome. More formally, where

    • Agents operating in these games tend to be risk-averse, especially when multiple equilibria exist.
  • A common reward game is a game where all agents share the same reward function. That is .

Solution Concepts

  • Defines the notion of “optimal” agent behavior.

  • We first define the notion of return (see more here). These two definitions are equivalent

    • The History Based Expected Return for agent under policy as the average of the returns of agent in each possible history

    • The Recursive Expected Error is analogous to Bellman backup. The expected return is defined under and . denotes concatenation

  • Note: The solution concepts defined below are known NP-hard problems in the general case. The computation of Nash equilibria, for example is PPAD complete.

Equilibria

  • The Best Response Policy is defined as the policy that maximizes the return

  • In a zero-sum game with two agents, a joint policy is a minimax solution if

    • Here the agent’s policy is optimized against an opponent that seeks to minimize the agent’s return

    • In the general case, we can solve this with Linear Programming by solving the following

  • In a general-sum game with agents, a joint policy is in Nash Equilibrium if

    • Here the agent plays a best response to the policies of the other agents
    • (Fink, 1964) There is always a Nash equilibrium in stationary strategies 3
  • A Nash equilibrium may be relaxed to an -Nash equilibrium for where

    • Every Nash equilibrium is surrounded by a region of -Nash Equilibria.
    • The return for an -Nash equilibrium need not be close to the return of the actual Nash equilibrium. Great discontinuities can lead to such a scenario.
  • In a general-sum normal-form game with agents, let be a joint policy which assigns probabilities to joint actions . Then is a correlated equilibrium if for every agent and every action modifier

    • The action modifier pertains to any possible deviation to the joint policy.
    • This policy generalizes Nash equilibria to allow for policies where actions can be correlated.
    • Each agent plays a best response and doesn’t want to deviate from their joint actions . It is a best response to the conditional distribution over actions recommended to other agents by given
    • A more general case is a coarse correlated equilibrium which requires the inequality hold for an unconditional action modifier — which is simply a constant action .
      • Agents must decide before seeing the recommended action, whether to follow the joint policy, assuming the other agents will follow it.

Other Solution Concepts

  • Equilibria , as a solution concept, are limited because

    • They may be suboptimal - they do not correspond to maximized expected returns.
    • They are not unique which can also entail different returns.
    • They may be incomplete in that they do not specify behaviors for off-equilibrium paths.
  • A joint policy is Pareto dominated by another joint policy if

    A joint policy that is not Pareto dominated is Pareto optimal. The returns of a Pareto optimal policy are also referred to as Pareto optimal.

    • No agent can be better off without all other agents being worse off
    • A solution concept (such as an equilibrium) can be Pareto optimal without being an equilibrium solution.
  • The social welfare of a joint policy is defined as

    A joint policy is welfare-optimal if

  • The social fairness of a joint policy is defined as

    A joint policy is fairness-optimal fi

  • Among the policies that achieve equal welfare, the policy with the greatest fairness is that which gives equal expected return for each agent

  • Welfare optimality implies Pareto optimality.

No-Regret

  • The regret of an agent is the difference between the rewards the agent received and the rewards it could have received from following a different action or policy.

    It may be defined as follows in terms of rewards and actions

Or in terms of policies chosen across episodes

  • An agent has no-regret if, in the limit of infinitely many episodes, the regret is at most . More formally

  • An agent has -no-regret if for .

  • Regret assumes that the actions or policies of the other agents remain fixed in each episode.

  • When the policies of more than one agent changes, minimizing regret does not imply maximizing return.

Miscellaneous

  • We can extend concepts from single-agent RL. In particular, the quantities from RL.

  • The state value is defined as

    Which resembles the Bellman equation.

  • The -value or the state-action value of a subset of agents, as

    That is, we average out the value from the other agents not in the set .

  • The multi-agent advantage function of with respect to is defined as

    If the above reduces to

Links

Footnotes

  1. Yang and Wang (2021). An Overview of Multi-agent Reinforcement Learning from Game Theoretical Perspective

  2. Information sets in Game Theory terms.

  3. Fink (1964) Equilibrium in a Stochastic n-person game