The Oversight Game: Learning to Cooperatively Balance an AI: Agent’s Safety and Autonomy

October302025| Working Paper No. 4309

Download

As increasingly capable agents are deployed, a central safety challenge is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface in which an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or engage in oversight (oversee), and model this interaction as a two-player Markov Game. When this game forms a Markov Potential Game, we prove an alignment guarantee: any increase in the agent’s utility from acting more autonomously cannot decrease the human’s value. This establishes a form of intrinsic alignment where the agent’s incentive to seek autonomy is structurally coupled to the human’s welfare. Practically, the framework induces a transparent control layer encouraging the agent to defer when risky and act when safe. While we use gridworld simulations to illustrate the emergence of this collaboration, our primary validation involves an agentic tool-use task where two 30B-parameter language models are fine-tuned via independent policy gradient. We demonstrate that even as the agents learn to coordinate on the fly, this framework effectively reduces safety violations in realistic, open-ended environments.

Keywords

technology

artificial intelligence

AI. Artificial intelligence

oversight