Convergence and Learning Performance

For the following, let be a joint policy and the solution be . The following define convergence criteria.
Note: Convergence is hard to check in practice.
Infinite Data
Expected Return
Empirical Distribution
Averaged Joint Policy. This is equivalent to the empirical distribution equivalence except we use the following

Where is the sequence of joint actions sampled from the joint policy conditioned on the history .
Convergence of empirical distribution to a set of solutions

Let be some distance metric (such as those used in a Loss Function)
Average return

Where denotes the average return across episodes.

Extensions

MARL can either be cooperative wherein agents work together; Competitive MARL where the game is zero-sum or a mix of the two.
MARL environments can be homogeneous in which case all agents share the same set of actions, observations and rewards.
- An environment has weakly homogeneous agents if for any joint policy and permutation between agents , it follows that
- An environment has strongly homogeneous agents if it is weakly homogeneous agents and if the optimal joint policy consists of identical policies, that is

Challenges

Most algorithm for single-agent RL cannot be extended as MARL has intrinsic non-stationarity due to the environment changing with agent actions, and the agents themselves changing.
- This becomes more of a problem when using Function approximation or PGMs.
The environment needs to be accurately modeled.
Agent goals may be significantly different to each others. Policy sharing becomes more difficult.
When the environment has multiple equilibria, equilibrium selection is needed — we need to establish which equilibrium should the agents agree on and why?
- One approach to this is to further constrain the solution space with additional criterion such as Pareto optimality.
- Another is to exploit the structure of the game.
- Agent modeling can also be used where agents predict the actions of other agents and act accordingly (in a similar manner here)
- Communication can also be used but this requires more consideration as the problem becomes more complex.
Multi-Agent Credit Assignment - in MARL, there is the question of whose actions among the agents contributed to the rewards.
- This requires that agents understand the causal relation between their actions and rewards and that of the other agents’ actions.
- One approach is to assign values to joint actions.
- Difference Rewards - an approach where the agent considers what reward they would have received if the other agents chose a different action than what they actually chose.
- We may also learn decompositions of the collective reward.
Scalability. More agents means more work needed
- More agents = Larger joint action space and potentially larger state space.
- More agents also means more non-stationarity in the environment.
It is hard to define optimality for a MARL environment.

Human Interaction

¹ point out various challenges for MARL in the context of different fields. In general these can be grouped as follows:
- MARL cannot handle unforeseen circumstances, especially those not seen in training. This can be dangerous (i.e., in an autonomous vehicle context)
- Considering the human in the loop — relinquishing control to the human operator as needed or coordinating with them
- Providing model interpretability and transparency
- Model robustness. MARL can be sensitive to slight perturbations.
- In low data settings, MARL heavily relies on sample efficiency.
In systems where a MARL system is required to interact with humans, MARL approaches present additional challenges
- Additional Non-stationarity due to human intervention.
- The diversity of human behavior with respect to culture, beliefs, and behavior shifts from interacting with the MARL system. The MARL system needs to model this behavior to integrate with humans better.
- Complex Heterogeneity due to diverse human backgrounds, physical system specifications, and differences between the time scales of the simulation and reality.
- Scalability especially when the number of agents increases, and considering human factors.
As ² points out, AI and human behavior differs substantially, so extending AI to apply to the real world setting requires some form of human “suboptimality”

Safety

¹ define the following extension of the regular stochastic game:

A safe MARL game is a stochastic game where each agent has a set of cost functions where each cost function is of the form with cost constraining values .

The goal of the agents is to still maximize reward subject to the constraint that
Meeting the safety requirements should not lead to a degradation in model performance.
A state-adversarial stochastic game is a stochastic game with an additional set called the uncertainty set of adversarial states of agent .

Given a joint policy and the joint adversarial perturbation where agents are attacked, we have the following Bellman equation

Here robustness comes from mitigating against perturbations in observed states

A model-adversarial stochastic game is a stochastic game where we have uncertainty sets for the reward functions and transition probabilities and .

The Bellman equation is given as

Here robustness comes from mitigating against perturbations in the environment’s model — transition functions and rewards
Data Poisoning - is an attack where attackers modify the rewards in a dataset to encourage each agent to adopt a harmful target policy with minimal modifications. Applies for offline MARL
Adversarial Attacks - involves manipulating agent observations to reduce the overall team reward in a cooperative setting.
Backdoor Attack - it conspires to manipulate both the actions and the rewards of the poisoned agents. It uses three modules — trigger design, action poisoning, and reward hacking.
Robustness comes from improving the following aspects of training
- State observations
- Actions - considering how the actions of agents interact with each other.
- Rewards and Models - estimate the uncertainty set of a model adversarial stochastic game.
- Adversarial Policies - detect Trojan agents and their triggers. It also includes methods to un-learn detected backdoors.
- Communication

Table of Contents

Graph View

Backlinks

The Library

MARL Problem Statement

Convergence and Learning Performance

Extensions

Challenges

Human Interaction

Safety

Links

Table of Contents

Graph View

Backlinks

MARL Problem Statement

Convergence and Learning Performance §

Extensions §

Challenges §

Human Interaction §

Safety §

Links §

Footnotes §

Convergence and Learning Performance

Extensions

Challenges

Human Interaction

Safety

Links

Footnotes