TOC   *   Intro   *   Agent behaviour   *   Worlds   *   Running and replaying   *   Batch experiments

1. Defining agent behaviour

The behaviour of a reinforcement learning agent can be divided into three components:

  1. Learning algorithm, by which the agent updates its knowledge on the basis of experience in the world.
  2. Action selection algorithm, which dictates how the agent selects its actions on the basis of its knowledge.
  3. Exploration strategy, which alters the action selected by the mechanism above with the purpose of gathering new information from the world.

The last two components together are referred to by the literature as the policy of the agent. They are separated here because two agents using the same action selection procedure might use different exploration strategies on top of it.

The implementation takes care of properly sequencing these components. In order to completely specify the agent behaviour, three functions must be provided, corresponding to the three components. These functions can be implemented separately; however, learning and action selection must agree on the data structures they use in common. In addition, a learning initialization function must be provided, that is charged with setting up the data structures required by the learning algorithm.

The signature of the agent constructor is the folowing:

a = agent(id, state, goal, actionscount, learnfun, actfun, explorefun);

The input arguments are:

Detailed information about how the agent behaviour mechanism works can be found in the documentation of agent, agent_learn, and agent_act. A few important pieces of information are maintained by the mechanism:

A set of templates is provided that demonstrates the required signature and functionality of the four behaviour functions: lrn, lrn_init, act, and explore. As an example, the implementation of the Q-learning algorithm with an epsilon-greedy policy is presented (Watkins and Dayan, 1992).

Jump to section:

  1. Learning and learning initialization
  2. Action selection
  3. Exploration strategy
  4. References

Learning and learning initialization

At each learning iteration, the reinforcement learner receives the changed world state, the joint action that resulted in this state change, and the joint reward the agents received. These quantities may be noisy or incomplete. The signature of the learning function is:

a = lrn(a, state, action, reward, params);

The params argument is a structure and are used by the learning mechanism to signal various events, such as the beginning of a new trial in episodic environments.

The signature of the learning initialization function is:

a = lrn_init(a, extraparam);

The extra parameters are used to transmit useful information to the initialization procedure, e.g., the size of the world's state space. They will contain a flag named episodic, signaling whether the task is episodic.

The Q-learning algorithm chosen as an example maintains a table of so-called Q-values, indexed on the agent's state and action: Q(s, a). At each step, this Q-table is updated by:

Q(s, a) = Q(s, a) + alpha [r + gamma·maxa'Q(s', a') - Q(s, a)]

with s the previous state, s' the new state, a the action taken in s, and alpha and gamma learning parameters. The algorithm learns faster when an eligibility trace is used. In this case, more Q-values are updated simultaneously, the update step alpha being scaled for each value with an eligibility weight. The eligibility trace, like the Q-table, is indexed on state and action, E(s, a). The complete formulae are besides the scope of this document; see (Sutton and Barto, 1998).

Let the Q-learning function be named plainq. Following the naming convention required by the toolbox, the initialization function must be named plainq_init. During initialization, the Q-table and the eligibility trace must be created. If the size of the world's state space is provided to the agent, in the statespacezise field on the parameters, it is used to initialize the tables to their correct sizes; otherwise, they are initialized to a default size. To make the learning updates efficient, flat vector representations are used for both tables, the tables are indexed first by action and then by state, and their (identical) size is cached in a field qsize of the agent.

function a = plainq_init(a, extraparam) if isfield(extraparam, 'statespacesize') && ~isempty(extraparam.statespacesize), % we only care about the agent's state space size % we find it out by using the indices map of the agent - % see agent_initlearn for more information on this map % the action dimension is the first in the tables a.qsize = [a.actionscount extraparam.statespacesize(a.indices(1) : a.indices(2))]; else % make a guess on the state space size a.qsize = [a.actionscount 50 * ones(1, a.indices(2) - a.indices(1) + 1)]; end; % a "flat" vector representation of the Q-table a.Q = zeros(prod(a.qsize), 1); % the eligibility trace has the same shape a.E = a.Q; % the eligibility trace is a volatile field, i.e., it is only used during learning % see agent_cleanup for information on volatile fields a.volatile = {'E'};

The learning function implements the update equation of the Q-table and the eligibility trace.

function a = plainq(a, state, action, reward, params); % the "episodic" field signals whether the task is episodic; % "newtrial" signals, in an episodic task, if a new episode is just starting if a.learnparam.episodic && params.newtrial, a.E = 0 * a.Q; % clear eligibility trace end; % linearize multidimensional indices to point into the flat representation of the tables ix = ndi2lin([1 state(a.indices(1) : a.indices(2))], a.qsize); % next state qix = ndi2lin([action(a.indices(5)) a.state], a.qsize); % previous state % note the use of the "a.state" field, where agent_act has stored the previous agent state % perform the updates delta = reward(a.indices(6)) + a.learnparam.gamma * max(a.Q(ix : ix + a.actionscount - 1)) - a.Q(qix); a.E(qix) = 1; a.Q = a.Q + a.learnparam.alpha * delta * a.E; a.E = a.learnparam.gamma * a.learnparam.lambda * a.E * ~a.exploring; % the "exploring" field was set by the exploration function if it altered the action choice

Action selection

The agent chooses actions on the basis of its current state. As before, a set of parameters is also supplied to the action selection function. The signature of the action function is:

[a, action] = act(a, state, params);

In Q-learning, the agent always chooses the action that corresponds to the best Q-value in the current state:

a = arg maxa'Q(s, a')

If there are more actions with the best Q-value, one is chosen randomly. This is called a greedy policy. It is implemented in a function called greedyact.

function [a, action] = greedyact(a, state, params) % compute a vector of indices spanning the positions corresponding to all actions in the current state ix = ndi2lin([1 state(a.indices(1) : a.indices(2))], a.qsize); ix = ix : ix + a.actionscount - 1; % find the maximum, if not unique break ties randomly topactions = find(a.Q(ix) == max(a.Q(ix))); action = topactions(ceil(rand * length(topactions)));

Exploration strategy

The exploration strategy of the agent takes the action chosen by the action selection, and may alter it in order to gather more information on the task. It may discriminate between states in doing this. The signature of the exploration function is:

[a, action] = explore(a, action, state, params);

The exploration function is required to support two special modes, signaled by a mode field on the parameters. The first special mode is 'init', meaning that the exploration behaviour of the agent is being initialized. Typically, initial exploration settings are saved at this point. The second special mode is 'reset', meaning that the agent should fall back to its initial exploratory behaviour. This mode should also be able to perform a partial reset, i.e., given a weight between 0 and 1, the agent should manifest a bias towards exploration equal only to the weight times the initial bias. The weight is supplied in the weight field of the parameters. If no value is supplied, the weight should be assumed equal to 1.

The reason for which a separate function is not used for tehse special modes, as is the case with the learning algorithm, is that initializing and resetting exploration are typically simple procedures. In contrast, initializing the learning data structures can be quite involved for some algorithms.

As an example, the simplest and most used exploration strategy is implemented. This strategy chooses a random action with probability epsilon, and leaves the original action unchanged with probability (1 - epsilon). The exploration probability epsilon is a learning parameter, and decays with time, e.g., with the number of trials. Name the function randomexplore. Its implementation is:

function [a, action] = randomexplore(a, action, state, params) % check for special modes if isfield(params, 'mode'), switch params.mode, case 'init', % save the exploration probability a.learnparam.epsilon0 = a.learnparam.epsilon; return; case 'reset', if isfield(params, 'weight'), w = params.weight; else w = 1; end; % reset with the given weight a.learnparam.epsilon = w * a.learnparam.epsilon0; return; end; end; % alter action with prob. epsilon, setting the "exploring" flag if done so if rand < a.learnparam.epsilon, action = ceil(rand * a.actionscount); a.exploring = 1; else a.exploring = 0; end; % decay exploration probability either with a decay ratio, if supplied % on the learning parameters, or else with the number of trials if isfield(a.learnparam, 'epsilondecay'), a.learnparam.epsilon = a.learnparam.epsilon * a.learnparam.epsilondecay; elseif params.newtrial, a.learnparam.epsilon = a.learnparam.epsilon / a.trialsk; end;

We can at this point create an agent that exhibits the defined behaviour, by calling the constructor with the learning, action, and exploration function names (the first four arguments are placeholders for now):

a2 = agent(1, [0 0], [1 1], 4, 'plainq', 'greedyact', 'randomexplore');

The parameters required by the agent's behaviour (in the example alpha, gamma, epsilon, lambda, optionally epsilondecay) may be provided to the agents via the learn function, which will then take care of properly initiliazing the agents.

References

TOC   *   Intro   *   Agent behaviour   *   Worlds   *   Running and replaying   *   Batch experiments