MARL Toolbox v.1.0 by Lucian Busoniu
The structure of a multiagent reinforcement learning task is specified by two elements: a transition function, that describes how the state of the world changes as a result of the agents' actions, and a reward function, that describes how the reward signals fed to the agents are constructed. The MARL toolbox requires additional behaviour from a world implementation, such as an interface for resetting the task to its initial state. The transition and reward engines need not be fully specified from the start as mathematical functions; instead new states and rewards can be computed on the fly by a Matlab function. Many tasks for which the mathematical functions would be very difficult to specify, can be easily described in this fashion.
The pieces of functionality that a task (called a "world" in the terminology of the toolbox) needs to provide, are listed below. Each of them must be implemented in a separate Matlab function. Each function name is obtained by appending a specific suffix to the world name chosen by the programmer. Details for each function are given in dedicated sections of this page.
''
- empty).
'_advance'
).
'_reset'
).
'_indexmaps'
).
'_display'
).
'_destroy'
).
The implementation takes care of calling the above components in the correct sequence. A set of templates is provided describing their required format and functionality: world, world_advance, world_reset, world_indexmaps, world_display, and world_destroy.
The implementation of a two-dimensional gridworld is presented as an example. Gridworlds are popular test beds for MARL algorithms. They are abstractions of important real domains such as robotic navigation. The gridworld is a rectangular two-dimensional surface divided into square cells of three types: empty, obstacle, and goal cells. Agents can only move one cell in the four compass directions or stand still. They have distinct goal cells, can only travel through empty cells, with at most one agent in a given cell at a given time, and cannot swap places. The gridworld is an episodic taks. A learning episode starts with the agents in their initial positions and ends when all agents have reached their corresponding goal cells (they may reach them at different moments in time). Typically, agents are rewarded positively when reaching their goal, negatively when they bump into walls or into each other, and zero or a small negative amount in other situations.
An example of a gridworld is given in the figure below. Agents are coloured discs, their corresponding goals are identically coloured diamonds, and grey discs are obstacle cells.
We proceed now to explain step by step how the six components are implemented to generate the functionality of the gridworld. We focus on general aspects rather than on the implementation details specific to the gridworld, which can be quite complex at some points. For the implementation details, please inspect the commented code of the functions.
Jump to section:
Each world must take into account the set of agents it hosts. This must be given, as a cell array of agent structures, as the first argument of the world construction function. Other arguments will typically be required to construct the specific type of world; no restrictions are imposed on these arguments. The standard suffix for the constructor is empty. The signature of the constructor function is thus:
w = world(agents, ...);
The constructor is required to store the following pieces of information in fields of the world structure:
type
. This string is used by the learning mechanism to form the names of the functions implementing the six components.
tasktype
. This should be either 'episodic'
or 'continuing'
. Note that currently, only episodic tasks are supported by the toolbox.
To define a gridworld, its size and the coordinates of the obstacles inside it must be specified. The starting and goal positions of the agents are obtained from the agents themselves, so they don't need to be supplied as arguments. Furthermore, we choose to name gridworlds, and to allow the user to customize the magnitude of the rewards the agents receive in the three standard cases: ordinary move, bump, and reaching the goal.
We choose the natural name gridworld
for the world type. A possible signature for the gridworld constructor is then:
gw = gridworld(agents, name, size, worldsize, obstacles, rewards);
A graphical representation similar to that in Figure 1 is developed for the gridworld. This representation is implemented separately from the gridworld, in the gridworldview structure; though not mandatory, this separation is recommended when the visual representation of the world is complex.
The main data structure used to implement the gridworld's internal state is a two-dimensional table with an element representing each gridworld cell. Elements store numerical codes identifying each type of cell. Agent IDs walk in this table, representing agents walking in the gridworld cells. The code implementing the constructor follows. If the constructor receives the string 'user'
instead of the obstacles coordinates, the user can place obstacles interactively, with the mouse (see gridworld_obstacles).
function gw = gridworld(agents, name, worldsize, obstacles, rewards) REQ_ARG_COUNT = 3; DEFAULT_REWARDS = [-1 -2 10]; % verification of required arguments if nargin < REQ_ARG_COUNT, error('Incorrect number of inputs'); end; % world constants initialization gw.name = name; gw.size = worldsize; % action displacements in the 2D space of the grid - left, right, up, down, % stay put gw.actoffsets = [-1 0; 1 0; 0 -1; 0 1; 0 0]'; % grid world constants gw.tasktype = 'episodic'; gw.type = 'gridworld'; gw.viewtype = 'gridworldview'; % handle variable argument lists if nargin < REQ_ARG_COUNT + 1, obstacles = []; end; if nargin < REQ_ARG_COUNT + 2, gw.rewards = DEFAULT_REWARDS; else gw.rewards = rewards; end; gw.grid = zeros(worldsize); gw.acount = length(agents); % maintain a shortcut to the world state: agent positions on columns gw.state = zeros([2 gw.acount]); % agent ids map specifies the id for an agent on a given position in the % state matrix gw.aidmap = zeros([gw.acount 1]); for i = 1:gw.acount, a = agents{i}; gw.grid(a.state(1), a.state(2)) = a.id; gw.grid(a.goal(1), a.goal(2)) = 1000 + a.id; gw.state(:, i) = a.state'; gw.aidmap(i) = a.id; end; % this is the initial state gw.initialstate = gw.state; % place any specified obstacles if isnumeric(obstacles) && ~isempty(obstacles), for i = 1 : size(obstacles, 1), gw.grid(obstacles(i, 1), obstacles(i, 2)) = -1; end; end; gw.view = gridworldview(gw); if ischar(obstacles), if obstacles == 'user', % get obstacles interactively gw = gridworld_obstacles(gw); end; end;
The function implementing the evolution of the world must alter the state of the world as a result of the agents' actions, and supply the new state to the agents, together with the associated rewards. To allow for simulating different sensors in the agents, the learning mechanism supplies distinct versions of the world state and rewards to distinct agents. These possibly different versions are called (agent) views in the toolbox terminology. As some learning algorithms require the learners to observe the rewards of other agents, the reward view of an agent must include elements for the other agents' rewards. Because some learning algorithms require observations of all agents' actions, the evolution function must return views over the joint action, as well.
The signature of the evolution function is given below. The view matrices contain one agent view per row, with the nth row corresponding to the nth agent. The evolution function name is formed from the world type by adding the suffix '_advance'
.
[w, stateviews, actionviews, rewardviews] = world_advance(w, actions)
The gridworld evolution function implements the specific transition and reward models for the gridworld. It also updates the graphical representation of the gridworld. The true values of the world state, joint reward, and joint action, are given to all the agents.
function [gw, stateviews, actionviews, rewardviews] = gridworld_advance(gw, actions) acount = gw.acount; % find out which agents tried to act acting = find(actions); still = find(actions == 5); moved = zeros(1, acount); putback = moved; atgoal = moved; moved(acting) = 1; moved(still) = 0; % construct the candidate state state = gw.state; cstate = state; for i = acting, castate = cstate(:, i) + gw.actoffsets(:, actions(i)); % candidate state for agent % agent can only try to move onto own goal, empty space or another agent if sum(castate >= [1;1] & castate <= gw.size') == 2, % check for out-of-grid e = gw.grid(castate(1), castate(2)); % current destination content if e >= 0 && (e < 1000 || e == 1000 + gw.aidmap(i)), cstate(:, i) = castate; atgoal(i) = e > 1000; else putback(i) = 1; end; else putback(i) = 1; end; end; % send back agents that try to switch positions "passing one through % another" movedix = find(moved & ~putback); for ii = 1 : length(movedix) - 1, i = movedix(ii); % check whether destination cell holds an agent e = gw.grid(cstate(1, i), cstate(2, i)); if e == 0 || e > 1000, continue; end; % if agent, test whether switching positions if all(cstate(:, i) == state(:, e)) && all(cstate(:, e) == state(:, i)), % put them back cstate(:, [i e]) = state(:, [i e]); putback([i e]) = 1; end; end; % send any agents that bumped into each other back to their original positions i = 1; while i < acount, overlap = find(sum(abs(cstate(:, i+1:acount) - repmat(cstate(:, i), 1, acount - i))) == 0); if isempty(overlap), i = i + 1; continue; else overlap = [0 overlap] + i; % current agent also cstate(:, overlap) = state(:, overlap); % send agents back putback(overlap) = 1; i = 1; % reset search end; end; % still agents have not been "put back" since they did not depart in the % first place putback(still) = 0; % reset agents were those that were trying to move and were put back reset = find(moved & putback); % the ones that departed their original positions were those that weren't put back departed = find(moved & ~putback); % agents that reached their goal dissappear from the grid arrived = find(moved & ~putback & ~atgoal); % change the positions of the moved agents in the grid grid = gw.grid(:); if ~isempty(departed), grid(sub2ind(gw.size, state(1, departed), state(2, departed))) = 0; end; if ~isempty(arrived), grid(sub2ind(gw.size, cstate(1, arrived), cstate(2, arrived))) = gw.aidmap(arrived); end; gw.grid(:) = grid; % promote the (now consistent) candidate state gw.state = cstate; % distribute rewards to agents rewards = gw.rewards(1) * ones(1, acount); rewards(reset) = gw.rewards(2); rewards(find(atgoal)) = gw.rewards(3); % generate views for all agents stateviews = repmat(cstate(:)', acount, 1); actionviews = repmat(actions, acount, 1); rewardviews = repmat(rewards, acount, 1); % repaint or schedule for repaint depending on visibility state % efficiency speedups are implemented here, see gridworld_display for details if gridworldview_isvisible(gw.view), % update view xs = [state(1, departed) cstate(1, arrived)]; ys = [state(2, departed), cstate(2, arrived)]; gw = gridworld_display(gw, 1, [min(xs) max(xs) min(ys) max(ys)]); else gw.view.tainted = 1; end;
The reset function brings the world's state back to its initial value. The reset function must report this state to the agents, using a views mechanism similar to that described above. Any side-effects of the reset, e.g., changes in the state of the graphical representation of the world, must also be handled by this function. The reset function name is formed from the world type by adding the suffix '_reset'
.
The signature of the evolution function is given below. The view matrices contain one agent view per row, with the nth row corresponding to the nth agent.
[w, stateviews] = world_reset(w);
The gridworld implementation of this function resets the internal grid to its initial state and updates the gridworld view accordingly. Like the evolution function, it provides the true world state on each agent view.
function [gw, stateviews] = gridworld_reset(gw) if any(any(gw.state - gw.initialstate)), % the state has changed % flatten grid, remove any agents and place them in their start positions grid = gw.grid(:); grid(find(grid > 0 & grid < 1000)) = 0; grid(sub2ind(gw.size, gw.initialstate(1, :), gw.initialstate(2, :))) = gw.aidmap; gw.grid(:) = grid; if gridworldview_isvisible(gw.view), % update view xs = [gw.state(1, :) gw.initialstate(1, :)]; ys = [gw.state(2, :) gw.initialstate(2, :)]; gw = gridworld_display(gw, 1, [min(xs) max(xs) min(ys) max(ys)]); else gw.view.tainted = 1; end; gw.state = gw.initialstate; end; stateviews = repmat(gw.state(:)', gw.acount, 1);
In order to determine its part of the world state, joint action, and joint reward, each agent uses an index map. The index map contains six elements:
The index maps generation function name is formed from the world type by adding the suffix '_indexmaps'
. It returns a matrix of index maps, with the nth row corresponding to the nth agent. In order to help the agents in their creation of learning data structures, it also provides (a view on) the size of the world's state space.
[imaps, size] = world_indexmaps(w);
The gridworld implementation of this function is straightforward. There is no common state, and the agent's states are their two-dimensional coordinates, arranged one after the other in the world state vector.
function [imaps, size] = gridworld_indexmaps(gw) imaps = [1:2:2*gw.acount-1; 2:2:2*gw.acount; ones([1 gw.acount]); zeros([1 gw.acount]); 1:gw.acount; 1:gw.acount]'; size = repmat(gw.size, 1, gw.acount);
If the world implements a graphical view, it must offer a way to toggle this view on and off. The display function accomplishes this task. It receives an optional flag that indicates the desired state of the view: 1
for "on", 0
for "off". The default is to show the view.
The display function name is formed from the world type by adding the suffix '_display'
.
w = world_display(w, visible);
The gridworld implementation of this function, besides hiding and showing the figure that contains the graphical view of the gridworld, is also charged with the actual update of the figure elements. It implements an efficient repainting mechanism, that updates only if the view is actually visible, and only the largest rectangular area of the figure that has been altered since the last repaint.
function gw = gridworld_display(gw, visible, update) % verification of required arguments, defaults for not supplied REQ_ARG_COUNT = 1; if nargin < REQ_ARG_COUNT, error('Incorrect number of inputs'); end; v = gw.view; if nargin < REQ_ARG_COUNT + 1, visible = 1; end; if nargin < REQ_ARG_COUNT + 2, % update all if tainted and showing if visible && v.tainted, update = 'all'; else update = 0; end; end; % process update if update ~= 0, if ischar(update) update = [1 v.width 1 v.height]; end; for i = update(1):update(2), for j = update(3):update(4), e = gw.grid(i, j); switch gw.grid(i, j), case 0, % empty cell set(v.polys(i, j), 'FaceColor', v.colors.BLANK, 'EdgeColor', v.colors.BLANK); case -1, % obstacle cell set(v.polys(i, j), 'FaceColor', v.colors.OBSTACLE, 'EdgeColor', v.colors.OBSTACLE); otherwise, % agent or goal if e < 1000, % agent; goals cannot move, so we do not consider them % get the color row containing the agent color colorrow = find(v.colors.AGENTS(:, 1) == e); color = v.colors.AGENTS(colorrow, 2:4); set(v.polys(i, j), 'FaceColor', color, 'EdgeColor', color); end; end; % END cell type switch end; % FOR rows iterator end; % FOR columns iterator v.tainted = 0; % view is up to date end; % END process update % show / hide if visible, set(v.gui, 'Visible', 'on'); pause(0.001); % yield some processor for view painting else set(v.gui, 'Visible', 'off'); end;
The cleanup function releases any resources that are used by the gridworld and its view. Its name is formed from the world type by adding the suffix '_destroy'
.
w = world_destroy(w)
The gridworld implementation of this function releases the figure handle used by the grid world view.
function gw = gridworld_destroy(gw) close(gw.view.gui); rmfield(gw, 'view');