• (Algorithm) Self Play - agents use the same learning algorithm or same policy.
    • It helps reduce the non-stationarity that would come from having all agents learn using different algorithms .
  • (Policy-based) Self-play - Algorithms [^self_play] [^self_play_2] where the agent plays directly against itself to exploit its own weakness.
    • These may learn faster than algorithm self-play since the experiences of all agents can be used to train a single policy.
    • It is more restricted in requiring agents have symmetrical roles and egocentric observations.
    • Self play requires that agents in the game have symmetrical roles and ego-centric observations (i.e., observations are relative to it).

Population Based Training

  • Population-based training - extends self-play by training policies against a distribution of other policies, including past versions of themselves.

  • It allows self-play to be sued in general-sum games with two or more agents, wherein agents may have different (non-symmetric) roles, actions, and observations.

  • The general approach is follows

    • Initializel policy population for each agent .
    • In the -th generation, evaluate the current policies with respect to the performance of other policies.
    • Based on the evaluations of policies, modify the existing policies or add new policies

PSRO

  • Policy Space Response Oracles. These involve constructing a meta-game and then applying equilibrium analysis to the meta-game.

PSRO. Image taken from Albrecht, Christianos, and Schafer
  • The meta-game is a simplified abstraction to the original game. This allows a characterization of the types of policies (the population) that are feasible. The set is then refined as training continues.

    • The action space of the meta-game is the set of policies in the current population
    • The reward for each meta-agent is estimated empirically by simulating policies sampled from the population. In the limit, this will converge to the expected reward of the agent.
  • The metagame is then solved using a meta-solver.

    • The meta-solver computes distributions for each population where is the probability assigned to .
    • This is analogous to mixing policies. It determines the probability that an agent will play a certain policy.
  • An oracle computes a new policy to add to each agent’s population. The new policy is a best-response policy with respect to the distribution and it is added to the population for the next generation.

  • PSRO is guaranteed to converge, and when it converges the computed distribution is guaranteed to be a Nash Equilibrium. Any stochastic policy can be obtained through a mixture of deterministic policies so we can consider only deterministic policies. Assuming the episodes terminate, we only have finitely many deterministic policies .

  • PSRO may converge to different solution types using different meta-solvers and oracles.

AlphaStar

  • 1 Builds on top of PSRO. It achieved good performance in Starcraft II.
    • Each observation contains an overview map of the environment and all entities in the environment (with their associated attributes).
    • Actions are hierarchical specifying the action type, the unit to perform that action, the target of the action, and when the agent wants to select its next action.
    • Policies are initialized using Human Play (i.e., Imitation Learning)
    • The policies are then trained using A2C. The agent is penalized for deviating from human play.

AlphaStar. Image taken from Vinyals et al. (2019)
  • It makes use of League Training - a single league of policies corresponds to the different types of agents
    • It uses past policy copies of each agent type.
    • It uses prioritized fictitious self-play (PFSP) to compute distribution where
      With a weighting function .
      • The weighting functions are either hard (to focus on difficult opponents) or variable (focusing on opponents of similar level)
    • There are three types of agents
      • Main agents - trained with self-play, PFSP and against past policies of main-exploiters.
      • Main exploiter agents - trained to exploit the weakness of the main agent. They are added to the league when they manage to defeat all main agents.
      • League exploiter agents - trained against all policies in the league. They are trained to identify strategies that no policy in the league is effective against.

Other Approaches

Fictitious Co-play

  • 2 introduces Fictitious Co-Play wherein checkpoints of each agent’s models are saved and an agent partner is trained to collaborate with both the fully trained agents and their past checkpoints.

  • Rationale: Past checkpoints simulate varying skill levels and promotes robustness of the model especially when faced with human collaborators which have their own preferences and skill levels.

  • This allows the agent to better collaborate with humans without having to rely on human data for behavioral cloning.

  • It consists of two stages

    • A diverse pool of partners are trained independently in self-play to get a variety of strategies. Checkpoints of each self-play partner is kept.
    • An FCP agent is trained as a best response to the pool of diverse partners. The partner agents are frozen so that the FCP agent can adapt to them
  • FCP outperforms previous SOTA methods and is preferred by humans over them.

    • FCP is not biased and can adapt to its collaborator’s preferences (assuming choosing either yields equal value).
    • FCP can achieve zero-shot collaboration.
  • For larger games, FCP would require a larger population. Methods would need to be explored to encourage behavior diversity.

  • It is reliant on a reward function but this need not be the case.

Fictitious Co-Play. Image taken from Strouse et al. (2021)

Links

Footnotes

  1. Vinyals et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning

  2. Strouse et al. (2021) Collaborating with Humans without Human Data