• Joint Action Learning uses Temporal Difference Learning to learn estimates of the joint-action value models to estimate the expected returns

  • Here we use something similar to -learning but for joint action

  • JAL-GT is a variant of this where we frame the problem as a Static game with the reward function being given as.

    • Learning then becomes discovering the right game to play

JAL-GT. Image taken from Albrecht, Christianos and Schafer
  • We solve using a game theoretic solution concept to obtain the solution — a policy from which
    • Minimax Q-Learning is guaranteed to learn the unique minimax value of the stochastic game under the assumption of all joint actions and states being tried infinitely often.
      • Policies trained under Minimax Q-Learning do not necessarily exploit weaknesses to opponents when they exist.
      • At the same time, Minimax policies are robust to exploitation by assuming a worst-case opponent during training
    • Nash Q-Learning is guaranteed to learn a Nash equilibrium and is applicable to general-sum games but it is very restrictive
      • It requires that all joint actions and states being tried infinitely often, and all normal-form games either
        • Have a global optimum where each agent individually achieves its maximum possible return ; or
        • Have a saddle point in which if any agent deviates, then all other agents will receive a higher expected return.
      • These assumptions bypass the equilibrium selection problem since in either a global optimum or a saddle point, the expected return of each agent is the same.
      • Implicitly, we have that agents consistently choose either global optima or saddle points.
    • Correlated Q-Learning uses a correlated equilibrium but we need to modify the algorithm so that agents sample from correlated equilibrium instead
      • This has an advantage over Nash Q-Learning
        • It explores more solutions with potentially higher expected returns
        • It can be computed efficiently with Linear Programming
      • Convergence to a correlated equilibrium is not guaranteed.
      • Equilibrium selection becomes a concern, however this can be mitigated using a protocol to consistently select an equilibrium (i.e., highest welfare)

Limitations

  • JAL is not applicable to certain games since information in the joint action value model is insufficient to construct an equilibrium policy. The limitations come in finding a deterministic equilibrium but for a game where many stochastic equilibria exist instead.

  • A stationary equilibrium is one which is not conditioned on the state rather than the history . s

  • No Stationary Deterministic Equilibrium Theorem (Zinkevich, Greenwald, and Litman (2005) )

Let denote the functions in game . For any NoSDE game with a unique equilibrium joint policy , there exists another NoSDE game which differs from only in the reward functions and has its own unique equilibrium joint policy such taht

  • Specific action probabilities may not be computable using functions alone

Links