TOC   *   Intro   *   Agent behaviour   *   Worlds   *   Running and replaying   *   Batch experiments

3. Running learning and replaying learned behaviour

This chapter will walk through the steps necessary for creating a set of agents and a world, running learning with this set of agents in this world, and replaying the behaviour learned by the agents. The agents will be Q-learners, and the world a 5x5 gridworld with two obstacles, like in Figure 1.

Gridworld Example

Figure 1. Gridworld example

The code for this example can be found in the learndemo script.

Jump to section:

  1. Creating agents
  2. Creating the world
  3. Running learning
  4. Replaying the learned behaviour

Creating agents

The agent constructor is described in the chapter on agent behaviour; check the agent help for a reminder. We construct the red agent first. It is initially placed at coordinates 1,1 and wants to get to its goal in cell 5,5. As noted in the tasks chapter, agents can take 5 distinct actions in a gridworld: left, right, up, down, and stay put. The agent is a Q-learner, uses greedy action selection, and random exploration. Thus, the red agent is constructed by the Matlab command:

>> red = agent(1, [1 1], [5 5], 5, 'plainq', 'greedyact', 'randomexplore');

The green agent is initially placed at coordinates 5,1 and wants to get to its goal in cell 1,5. Note the ID of the green agent must differ from that of the red agent. The other arguments remain unchanged:

>> green = agent(2, [5 1], [1 5], 5, 'plainq', 'greedyact', 'randomexplore');

We now place the two agents in a cell array:

>> agents = {red, green};

Creating the world

We create a gridworld named "Square". Its size is 5x5 cells, and the obstacles are placed at coordinates 3,1 and 3,5. We leave the reward magnitudes at their default values. The gridworld structure can then be created with the command:

>> gw = gridworld(agents, 'Square', [5 5], [3 1; 3 5]);

Running learning

The learning mechanism is implemented by the learn function. The signature of this function is:

[agents, stats] = learn(world, agents, learnparam, agentlearnparam, freezemask);

with the arguments:

The function returns:

The learn function is generic and does not depend on the episodic character of the task; however, only episodic_learning is currently implemented. It requires the following fields on the learnparams structure:

Episodic learning returns a statistics structure with two fields:

In the gridworld example, we setup the learning process in the following way: maximum 100 trials, with maximum 300 iterations per trial, convergence goal 0.25, averaging over the last 10 trials. We setup the learning parameters of the Q-learning agents as follows:

>> lp.trials = 100; lp.maxiter = 300; lp.convgoal = .25; lp.avgtrials = 10; >> alp.alpha = .3; alp.gamma = .95; alp.lambda = .5; alp.epsilon = .5;

In order to see what is happenning during learning, a final preparation step is necessary: show the gridworld view.

>> gw = gridworld_display(gw);

Finally, learning can be run:

>> [gw, agents, stats] = learn(gw, agents, lp, alp);

The agents converge after about 15 trials. The convergence may be examined by plotting the learning statistics:

>> figure; stairs(1:stats.trials, stats.iter); grid;

The resulting convergence plot should look similar to that presented in Figure 2.

Convergence Example

Figure 2. Example convergence plot

Replaying the learned behaviour

Once the agents have learned, their behaviour can be examined by putting them in the world and letting them act. This is called replaying in the toolbox terminology, and the replaying mechanism is implemented by the replay function. The signature of this function is:

[world, agents] = replay(world, agents, speed, maxsteps);

with the arguments:

The function returns:

Note that, in order for replaying to make sense, the world must implement a visual representation.

The replay function is generic and does not depend on the episodic character of the task; however, only episodic_replay is currently implemented. In addition to the maximum iterations count, the episodic replay function also checks for agents getting stuck and stops the replay in this case. Agents get stuck if they learned bad policies, and keep taking actions that do not succeed in altering the state of the world. The stuck check only works in environments where the state views of the agents do not include random components.

In the gridworld example, the behaviour of the two agents can be replayed with the Matlab command:

>> [gw, agents] = replay(gw, agents);

TOC   *   Intro   *   Agent behaviour   *   Worlds   *   Running and replaying   *   Batch experiments