MARL Toolbox v.1.0 by Lucian Busoniu
In addition to the learning and replaying mechanisms, the toolbox also provides two functions that can be used to run and process data from large batch experiments: runexp and processexp. An experiment is interpreted here as a set of repetitions of a learning process, where each repetition has the same initial configuration. The goal of this repeated execution is to eliminate through averaging the effect of random elements in the agents' algorithms.
The interface with runexp and processexp is quite complex, and is described in detail in the help text of these functions. We give here only a brief description of the features they offer, and provide an example demonstrating their use.
The code for this example can be found in the batchdemo script.
Jump to section:
Experiments can be run using the runexp function. The signature of this function is:
runexp(expconfigs, datafile, datadir, nruns, silent);
A sequence of experiments can be run at once. The experiment configuration data is represented in a cell array, with each cell corresponding to an experiment. For each experiment, the following parameters can be set:
datadir/datafile
, and can be later processed using the complementary function processexp. While running, the function outputs progress information at the Matlab console.
The nruns
argument sets the default number of runs for experiments that do not set this parameter via their options. The silent
argument, if 'on'
, silents all graphical and text output regardless of the experiment option values.
Henceforth, we use the term "batch" to refer to a sequence of experiments as described above. As an example, we design a batch that compares the performance of basic Q-learning and a version of Q-learning that uses a Q-table indexed on the complete world state. The comparison is done by running each algorithm 30 times on the gridworld presented in Figure 1 (the same gridworld as in the previous sections).
The code for setting up and running the batch is given below. Note that all graphical output is suppressed via option settings. This gives better speed. However, be aware that the batch will still take some time to complete.
NRUNS = 30; expc = {struct(... ... % world arguments 'worldtype', 'gridworld', ... 'worldargs', {{[], 'Square', [5 5], [3 1; 3 5]}}, ... ... % learning parameters and agent learning parameters ... % note we do learning is not stopped upon convergence 'lp', struct('trials', 80, 'maxiter', 300, 'convgoal', -1, 'avgtrials', 30), ... 'alp', struct('alpha', .3, 'gamma', .95, 'lambda', .5, 'epsilon', .3), ... ... % agent arguments 'agentargs', {{... {[1 1], [5 5], 5, 'plainq', 'greedyact', 'randomexplore'}, ... {[5 1], [1 5], 5, 'plainq', 'greedyact', 'randomexplore'} }}, ... ... % options structure 'options', struct('nruns', NRUNS, 'show', 'off', 'convplot', 'off', 'plotpause', 15, 'label', 'Plain Q') ... ), struct(... ... % for the second experiment, anything that is not specified ... % remains the same as configured for the first experiment 'lp', struct('trials', 100, 'maxiter', 400, 'convgoal', -1, 'avgtrials', 30), ... 'agentargs', {{... {[1 1], [5 5], 5, 'fullstate_plainq', 'fullstate_greedyact', 'randomexplore'}, ... {[5 1], [1 5], 5, 'fullstate_plainq', 'fullstate_greedyact', 'randomexplore'} }}, ... 'options', struct('nruns', NRUNS, 'show', 'off', 'convplot', 'off', 'plotpause', 15, 'label', 'Complete-state Q') ... )}; fname = 'q_fullstateq'; % the data are saved in current directory runexp(expc, fname, pwd, NRUNS, 'off');
A sequence of experiments created by runexp can be processed using processexp. This function has the signature:
processexp(datafile, mode, plotcount, plotfields);
The type of the information generated by processing is specified by the mode
parameter, and can be one of the following:
mode = 'plot'
. The statistics of each experiment, averaged over runs, are plotted.
mode = 'replay'
. One run of each experiment is randomly chosen, and the behaviour that the agents learned in that run is replayed. The world must support a visual representation for this option to work.
mode = 'replayeach'
. The behaviour that the agents learned in every run of every experiment is replayed. The world must support a visual representation for this option to work. Processing long experiments in this mode can be quite lengthy.
mode = 'manual'
. The user interactively selects the processing mode for each experiment, out of the three above.
'plot'
mode can be further customized by using the plotcount
and plotfields
arguments:
plotcount
indicates how many experiments should be plotted on a single figure. This is useful when the batch is divided in similar sequences, each sequence being characterized by certain parameter settings. For example, the two experiments above could be duplicated with a changed value of alpha; in this case, the value of plotcount
would be 2, and each generated figure would correspond to a certain setting of alpha.
plotfields
specifies the set of statistics fields that will be plotted. This argument is a cell array of field names. See learn, episodic_learn for information on learning statistics. This option is not currently useful, as only iterations statistics are maintained, but will become useful as the complexity of the learning statistics increases in future versions of the toolbox.
For example, the following command replays the behaviour of one set of Q-learners, and one set of full-state Q-learners:
>> processexp(fname, 'replay');
To plot compared convergence statistics, we use:
>> processexp(fname, 'plot');
The resulting figure should look similar to Figure 2.