Learn large, variable-size action space for Diplomacy game

I am making an environment using OpenAI gym for Diplomacy, and making an AI for it.

In Diplomacy, a player has many units, and each unit has a number of moves available to it.

Therefore, the player's action space is the product of each unit's moves, minus the combinations which make no sense.

What I am doing is constructing a list of all available actions for the agent, like so:

(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris - Burgundy)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris - Brest)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris - Picardy)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris - Gascony)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris - Paris)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Burgundy, France: TROOP Paris Supports TROOP Marseilles - Burgundy)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris - Burgundy)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris - Brest)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris - Picardy)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris - Gascony)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris - Paris)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Gascony, France: TROOP Paris Supports TROOP Marseilles - Gascony)
(France: FLEET Brest Coast - English Channel, France: TROOP Marseilles - Spain, France: TROOP Paris - Burgundy)
... (many hundred more)

And making the action space larger than the largest ones of these lists I've ever encountered - it is presently 2024.

Previously, I represented this entire list inside the observation space, but this slowed down the learning substantially - I get orders of magnitude higher fps when I don't represent this list in the observation space.

The agent selects an action, which is just an index of this list.

If the index is too large:

action = action % len(all_possible_actions)

So that the agent always selects a valid action.

My thought is that the agent will learn the structure of this list (the list is a function of the environment, so the agent should be able to reproduce some representation of it).

However, my impression is that forcing the agent to learn how this list is produced, the indexing, and the rules of the game will substantially slow down learning.

Has anybody dealt with a similar problem and solved it? What would be a good way to give "hints" to the network regarding the structure of this list without exploding the observation space?

Topic openai-gym tensorflow reinforcement-learning python

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.