Value Reward Decomposition

Rationale: Regular PGM- based approaches are difficult because (1) it can be difficult to learn a centralized value function that scales with exponential action space; and (2) centralized value functions do not imply efficient decentralized execution. [^valdec_1]

The goal is to decompose the following where is the common reward.
A simpler utility function can be used for each agent, conditioned only on individual observation history and actions. We denote this with
The Individual-Global-Max (IGM) property states that greedy joint actions with respect to the centralized action-value function should be equal to the joint action composed of the greedy individual actions of all agents that maximize their individual utilities.
- More formally, define both forms of greedy actions as follows
- The IGM property is satisfied if the following holds for all histories with joint observation histories . individual observation histories and centralized information .
- Because of IGM, we can have all agents follow a policy based on their local utility (which is more efficient to compute) and it would be equivalent to the original Centralized Value Function-based policy
- It also helps establish reward credit since the reward is “common” to all agents.
- The IGM property is not necessarily satisfied for all environments

Linear Value Decomposition

Assume a linear decomposition of common rewards. That is
The decomposition is as follows. It is guaranteed to satisfy the IGM property
Value Decomposition Networks (VDN) maintain a replay buffer containing the experience of all agents and jointly optimizes the loss function

VDN. Image taken from Albrecht, Christianos , and Schafer

Extends Linear Value Decomposition to apply to cases when the contribution of each agent to the common reward is non-linear.
QMIX extends VDN by ensuring that the centralized action-value function is strictly monotonic with respect to individual utilities. That is
- In other words the utility of any agent for its action must increase the decomposed centralized action value function.
QMIX uses a mixing network which is simply a standard feedforward Neural Network that combines individual utilities
- To ensure monotonicity, we assume the mixing network only allows for positive weights.
- is obtained using a hypernetworks , which also receives as input. When it outputs , it applies an absolute value. Because the hypernetwork receives as input, we add to the replay buffer.
- The loss function optimized is

QMIX. Image taken from Albrecht, Christianos and Schafer

Monotonic Value Decomposition only provides a sufficient condition but not a necessary condition.
The following decomposition gives both sufficient and necessary (under affine transformations) conditions.

Where is the greedy joint action , is the unrestricted centralized action-value function, and denotes a utility function
is conditioned on centralized information since the individual utility functions of each agent may lack information (especially if the environment is partially observable).
QTRAN trains a network that optimizes the following:
- For each agent, their individual utility functions
- A single network for the global utility function
- A single network for the centralized action-value function .
For the centralized action-value function the following loss function (a TD-error) is minimized

For the utility functions, we minimize additional regularization terms given by the following (for the first case in the condition specified above where )
For the second case, we apply the following

QTRAN gas a few limitations
- It does not scale well due to relying on joint action space
- It does not directly enforce the necessary and sufficient conditions for IGM, instead using regularization terms. Hence, IGM is not guaranteed.

¹ suggest that representational constraints on the joint action values lead to poor exploration and suboptimality. This is because none of these methods are able to perform committed exploration wherein a specific set of actions (probabilistically unlikely to be chosen in that exact sequence )are required to perform further exploration
The monotonicity constraint can prevent the Q-network from correctly remembering the true value of the optimal action (currently perceived as suboptimal
Multi-Agent Variational Exploration learns an ensemble of monotonic approximations of QMIX via latent space representation
- Each model can be seen as a mode of committed joint exploration.
- A shared latent variable is controlled by a hierarchical policy that offloads -greedy with committed exploration. Fixing gives a monotonic approximation to the optimal action-value function.
- The loss function becomes as follows
- A hierarchical policy objective is obtained through parameter freezing
- Another objective makes use of mutual information to encourage diverse but identifiable behavior among policies derived from . The actions in the trajectory are encoded by an RNN. is an operator that returns a per-agent Boltzmann policy with respect to the utilities at each time step .
  
  A tractable lower bound is provided using a variational distribution parameterized by as a proxy for the posterior over .
The complete objective is the following

Maven Architecture. Image taken from Mahajan et al. (2020)

MAVEN Algorithm. Image taken from Mahajan et al. (2020