Training and Execution Paradigms

  • Centralized Training - a training paradigm where we only make use of all the information gathered by the agents in the entire system
  • Decentralized Training - a training paradigm where we make use of only the local information of each agent.
  • Decentralized Execution - the policies of the agents are only conditioned on their local history of observations
  • Centralized Execution - the policies of the agents are conditioned based on the information gathered from all agents.

CTCE

  • CTCE - a special case is Centralized models involve each agent pooling experience feeding back to a single learner. Policies are them returned for local execution.

    • In effect, we have more agents to help with experience
    • All agents have similar policies
    • We can have a local independent actor, but a centralized critic.
    • This requires that we transform joint reward into a single scalar reward. This is a non-trivial task
    • Another challenge is that it is less scalable especially when the action space grows exponentially with the number of agents.
    • It also means that observations are not localized and training is based on globally gathered data
  • Centralized Q-learning is an approach to learning a policy for stochastic games where we maintain joint action values for joint actions. It is essentially Q-learning but on the joint actions and values

CQL. Image taken from Albrecht, Christianos and Schafer

DTDE

  • A special case is Independent Learning, where each agent learns its own policy using only its own local history of observations, actions and rewards, ignoring the existence of other agents.

    • This allows adaptation to local perception of the environment.

    • A decentralized approach helps when there is an asymmetry in what agents can do.

    • Each agent uses a copy of the same learning algorithm.

    • It avoids combinatorial explosion of action spaces

    • It can be used for problems that require local policies.

    • It does not require a reward function to be converted to a scalar function.

    • From the perspective of agent , the other policies become a part of the environment’s state transition function

    • They serve as good baselines which are SOTA.

    • We are unable to use standard experience replay buffers since the environment now has multiple agents.

    • Agents are unable to distinguish between non-stationarity due to other agents or due to the environment itself

    • From the perspective of the agent, the environment is non-stationary because each agent makes a decision that affects the environment. Learning is more unstable.

  • Another special case is Agent Modeling

  • Despite its simplicity, independent learning has been shown to perform competitively to more sophisticated algorithms in multi-agent deep RL

  • Independent Q-Learning is a similar approach to CQL but in the context of independent learning

    • For certain classes of Stochastic games, an infinitesimal (using infinitely small learning steps) IQL is guaranteed to converge to a Nash Equilibrium. While, for others it may not converge at all or only under certain conditions.

IQL. Image taken from Albrecht, Christianos and Schafer

Centralized Training Decentralized Execution

Fundamental Approaches to Independent Learning

  • Self Play

  • Mixed Play - agents use different learning algorithms.

    • Ad-hoc teamwork - agents collaborate with previously unknown agents whose behaviors may be unknown.
  • Some algorithms also bridge algorithm self-play and mixed play.

Stochastic Game Algorithms

Links