MARL Toolbox v.1.0 by Lucian Busoniu
This chapter will walk through the steps necessary for creating a set of agents and a world, running learning with this set of agents in this world, and replaying the behaviour learned by the agents. The agents will be Q-learners, and the world a 5x5 gridworld with two obstacles, like in Figure 1.
The code for this example can be found in the learndemo script.
Jump to section:
The agent constructor is described in the chapter on agent behaviour; check the agent help for a reminder. We construct the red agent first. It is initially placed at coordinates 1,1 and wants to get to its goal in cell 5,5. As noted in the tasks chapter, agents can take 5 distinct actions in a gridworld: left, right, up, down, and stay put. The agent is a Q-learner, uses greedy action selection, and random exploration. Thus, the red agent is constructed by the Matlab command:
>> red = agent(1, [1 1], [5 5], 5, 'plainq', 'greedyact', 'randomexplore');
The green agent is initially placed at coordinates 5,1 and wants to get to its goal in cell 1,5. Note the ID of the green agent must differ from that of the red agent. The other arguments remain unchanged:
>> green = agent(2, [5 1], [1 5], 5, 'plainq', 'greedyact', 'randomexplore');
We now place the two agents in a cell array:
>> agents = {red, green};
We create a gridworld named "Square". Its size is 5x5 cells, and the obstacles are placed at coordinates 3,1 and 3,5. We leave the reward magnitudes at their default values. The gridworld structure can then be created with the command:
>> gw = gridworld(agents, 'Square', [5 5], [3 1; 3 5]);
The learning mechanism is implemented by the learn function. The signature of this function is:
[agents, stats] = learn(world, agents, learnparam, agentlearnparam, freezemask);
with the arguments:
world
- the world where learning takes place.
agents
- the cell array of agents, the same with which the world was constructed.
learnparam
- a structure containing configuration parameters for the learning process.
agentlearnparan
- contains the learning parameters that are to be fed to the agents. It may be either a single structure, when the learning parameters are common to all the agents, or a cell array of structures with the same length as the number of agents, in which case each agent has its own set of learning parameters.
freezemask
- a boolean vector with the same length as the number of agents. Agents for which the corresponding elements of the freeze mask are set are not allowed to learn - are "frozen". This parameter is optional, and by default all agents are allowed to learn.
world
- the possibly altered world.
agents
- the agents, with the knowledge gained during the learning process.
stats
- the learning statistics.
The learn function is generic and does not depend on the episodic character of the task; however, only episodic_learning is currently implemented. It requires the following fields on the learnparams
structure:
trials
- how many trials to allow learning to run at most.
maxiter
- how many iterations to allow a trial to run at most.
convgoal
- the convergence goal. Convergence is verified by computing the standard deviation of the last few trials. If this deviation falls under the convgoal
threshold, the mechanism considers that convergence was achieved and learning is stopped.
avgtrials
- how many trials to consider when computing the convergence goal.
trials
- how many trials were run until convergence (or until the limit was reached).
iter
- how many iterations each trial took to complete (or until the limit was reached).
In the gridworld example, we setup the learning process in the following way: maximum 100 trials, with maximum 300 iterations per trial, convergence goal 0.25, averaging over the last 10 trials. We setup the learning parameters of the Q-learning agents as follows:
>> lp.trials = 100; lp.maxiter = 300; lp.convgoal = .25; lp.avgtrials = 10; >> alp.alpha = .3; alp.gamma = .95; alp.lambda = .5; alp.epsilon = .5;
In order to see what is happenning during learning, a final preparation step is necessary: show the gridworld view.
>> gw = gridworld_display(gw);
Finally, learning can be run:
>> [gw, agents, stats] = learn(gw, agents, lp, alp);
The agents converge after about 15 trials. The convergence may be examined by plotting the learning statistics:
>> figure; stairs(1:stats.trials, stats.iter); grid;
The resulting convergence plot should look similar to that presented in Figure 2.
Once the agents have learned, their behaviour can be examined by putting them in the world and letting them act. This is called replaying in the toolbox terminology, and the replaying mechanism is implemented by the replay function. The signature of this function is:
[world, agents] = replay(world, agents, speed, maxsteps);
with the arguments:
world
- the world where the agents live.
agents
- the agents that have learned.
speed
- (optional) the speed of the replay. Ranges between 1 and 10, 1 is approximately one second per iteration, 10 is full speed. Default is 7.
maxsteps
- (optional) how much to allow the policy to run at most. Defaults to 200 steps. May be -1 ('run forever').
world
- the possibly altered world.
agents
- the possibly altered agents.
Note that, in order for replaying to make sense, the world must implement a visual representation.
The replay function is generic and does not depend on the episodic character of the task; however, only episodic_replay is currently implemented. In addition to the maximum iterations count, the episodic replay function also checks for agents getting stuck and stops the replay in this case. Agents get stuck if they learned bad policies, and keep taking actions that do not succeed in altering the state of the world. The stuck check only works in environments where the state views of the agents do not include random components.
In the gridworld example, the behaviour of the two agents can be replayed with the Matlab command:
>> [gw, agents] = replay(gw, agents);