Joint Action Learning

Joint Action Learning uses Temporal Difference Learning to learn estimates of the joint-action value models to estimate the expected returns
Here we use something similar to -learning but for joint action
JAL-GT is a variant of this where we frame the problem as a Static game with the reward function being given as.
- Learning then becomes discovering the right game to play

JAL-GT. Image taken from Albrecht, Christianos and Schafer

We solve using a game theoretic solution concept to obtain the solution — a policy from which
- Minimax Q-Learning is guaranteed to learn the unique minimax value of the stochastic game under the assumption of all joint actions and states being tried infinitely often.
  - Policies trained under Minimax Q-Learning do not necessarily exploit weaknesses to opponents when they exist.
  - At the same time, Minimax policies are robust to exploitation by assuming a worst-case opponent during training
- Nash Q-Learning is guaranteed to learn a Nash equilibrium and is applicable to general-sum games but it is very restrictive
  - It requires that all joint actions and states being tried infinitely often, and all normal-form games either
    - Have a global optimum where each agent individually achieves its maximum possible return ; or
    - Have a saddle point in which if any agent deviates, then all other agents will receive a higher expected return.
  - These assumptions bypass the equilibrium selection problem since in either a global optimum or a saddle point, the expected return of each agent is the same.
  - Implicitly, we have that agents consistently choose either global optima or saddle points.
- Correlated Q-Learning uses a correlated equilibrium but we need to modify the algorithm so that agents sample from correlated equilibrium instead
  - This has an advantage over Nash Q-Learning
    - It explores more solutions with potentially higher expected returns
    - It can be computed efficiently with Linear Programming
  - Convergence to a correlated equilibrium is not guaranteed.
  - Equilibrium selection becomes a concern, however this can be mitigated using a protocol to consistently select an equilibrium (i.e., highest welfare)

Limitations

JAL is not applicable to certain games since information in the joint action value model is insufficient to construct an equilibrium policy. The limitations come in finding a deterministic equilibrium but for a game where many stochastic equilibria exist instead.
A stationary equilibrium is one which is not conditioned on the state rather than the history . s
No Stationary Deterministic Equilibrium Theorem (Zinkevich, Greenwald, and Litman (2005) )

Let denote the functions in game . For any NoSDE game with a unique equilibrium joint policy , there exists another NoSDE game which differs from only in the reward functions and has its own unique equilibrium joint policy such taht

Specific action probabilities may not be computable using functions alone

Table of Contents

Graph View

Backlinks

The Library

Joint Action Learning

Limitations

Links

Table of Contents

Graph View

Backlinks

Joint Action Learning

Limitations §

Links §

Limitations

Links