-
In the heterogeneous setting, we no longer assume that agents necessarily have the same observation space, action space, or reward functions.
-
The core issues with Heterogeneous MARL is that the usual approaches are typically for Homogeneous MARL which does not necessarily extend well to the Heterogeneous setting.
-
One approach is to allow for heterogeneity among agent policies. That is,
and are different functions when . - Caveat: For joint rewards, we have the problem of Credit Assignment — that is, how to determine an agent’s contribution to a state.
- Caveat: The Miscoordination Problem Even with proper credit assignment, policy updates for agent
may interfere with updates for agent resulting in a poor joint policy over all.
-
According to 1, there are two classes of heterogeneous systems (which are not necessarily mutually exclusive).
- Physical heterogeneity comes from differences in the agents in terms of hardware (sensors and actuators) or physical constraint. This implies different observation and action spaces.
- Physical heterogeneity can be addressed through homogeneous solutions but with added constraints on the agents.
- Behavioral heterogeneity comes from components having distinct policy outputs when observing the same input.
- Same Objective Heterogeneity means that the agents have the same objective function but optimized through heterogeneous behavior.
- Different Objective Heterogeneity means that the agents have different objective functions that they optimize.
- Physical heterogeneity comes from differences in the agents in terms of hardware (sensors and actuators) or physical constraint. This implies different observation and action spaces.
-
The inherent trade off between homogeneity and heterogeneity is that of sample efficiency (homogeneous) and resilience / performance (heterogeneous).
HARL
-
[^Zhong_2023] proposes Heterogeneous MARL (HARL) algorithms for the cooperative setting designed to coordinate agent updates. In particular, the key idee of their scheme is to perform sequential updates on each individual agent’s policy rather than update the whole joint policy,
-
(Zhong 4) Multi-Agent Advantage Decomposition. In any cooperative Markov games given a joint policy
, for any state and agent subset , the following holds for the Multi-agent Advantage. That is, a joint policy can be improved sequentially.
-
(Zhang 6) Let
be a joint policy. For any joint policy we have Where
We define
as follows. Let be the joint policy, be some other joint policy of agents and be some other policy of . ThenThe sequential update scheme is given below
-
In performing the sequential update, we take into account the previous agent updates.
-
(Zhong 7) The Multi-Agent Policy Iteration with Monotonic Improvement Guarantee monotonically improves. In fact, (Zhong 8) The policy converges to the Nash Equilibrium.
- The algorithm is not practical however since it (1) assumes the use of the full state space and action space and (2) requires the computation of the KL Divergence.
-
The Sequential HARL algorithm can be made more practical using TRPO and PPO versions as shown below
Parameter Sharing Methods
UAS
-
2 introduces the use of a Unified Action Space (UAS) which consists of semantic representations of agent actions from a latent space of all possible agent actions, particularly in the case when agents are physically heterogeneous (i.e., different constraints or capabilities).
-
Action masks are used to generate the agent policies.
-
(Yu 1) There exists a fully-cooperative physically heterogeneous MARL problem such that the joint reward with parameter sharing is suboptimal.
More formally if
is the optimal joint reward, and is the optimal joint reward under parameter sharing. Also let denote the probability of action . and there are agents, we have -
Semantic representations come from dividing the action space based on semantics (i.e., what the actions do).
For all
, divide the local available action set sets with different action semantics. The Unified Action Space is defined asClearly
.To identify different action semantics, we use an Action Mask operator. That is, given an action mask
we extract the actions as -
By performing action masking, we are able to use global parameter sharing while maintaining action semantics. We operate on the Unified Action Space with parameter sharing and to extract action semantics, we perform action masking.
-
In addition to UAS, we introduce the Cross Group Inverse Loss to facilitate learning and predicting the policy of other agents.
It is calculated as follows. Let
be the action mask for other physically heterogeneous agents. Then the CGI loss is the mean squared error between and . If each agent has an associated independent network branch parameterized by , we haveWhere
is the hidden state of an associated GRU which encodes the trajectory. The average is calculated over samples from the replay bufferSimilarly for value-based algorithms
-
The overall pipeline is shown below
-
A version of MAPPO using Unified Action Spaces is shown below. The relevant equations for calculations are as follows
Each
is a hyperparameter that acts as weighting coefficients
-
We can also extend QMIX similarly. The relevant equations are as follows
Communication
- 1 proposes a GNN based approach called Heterogeneous GNN PPO (HetGPPO) which enables both inter-agent communication and learning in Dec-POMDP environments.
-
Motivation: Prior methods address heterogeneity without considering the use of communication to mitigate the partial observability in a Dec-POMDP (i.e., CTDE relaxes this constraint for the critic during training). Relaxing the assumption to the case of homogeneous agents prevents agents from using heterogeneous actions to achieve their objectives.
-
We can extend the regular Game theoretic formulation by introducing a communication graph for each agent.
At each time step, the observation
is communicated to an agent in the neighborhood .The goal is still the same, however, to learn policies for each agent .
-
The model allows for behavioral typing — where environmental conditions nudge agents to behave in particular ways.
-
The proposed solution is both performant (compared to Homogeneous parameter sharing) and resilient (i.e., even with observation noise, the agents perform well.)
-
Footnotes
-
Bettini, Shankar, and Prorok (2023) Heterogeneous Multi-Robot Reinforcement Learning ↩ ↩2
-
Yu et al. (2024) Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space ↩