The repository contains useful downloadable material related to my research and teaching, including Matlab software, presentations, and demonstration movies. If an item has a "»" button to its right, this button can be clicked to reveal more information; the "«" button then hides this information again (requires Javascript).
Software

Approximate RL and DP toolbox,
July 2013 release. (13 July 2013, 1.6 MBytes).
»
Since the previous release of the toolbox was getting rather old, I decided to publish a new version. Be warned though: this is very much workinprogress, a snapshot of the code that I use for my daily research. So expect undocumented behavior, bugs, but also plenty of new algorithms – hic sunt leones!
New features:
 Online, optimistic planning algorithms: for deterministic systems (opd), for discrete Markov decision processes (opss), with continuous actions (sooplp), openloop optimistic planning (olop), and hierarchical OLOP (holop). The entry point connecting planning to the system is genmpc. OPD and OPMDP can be used while applying longer sequences of actions / longer tree policies.
 Fitted Qiteration with local linear regression approximation.
 An extensive mechanism for running batch experiments (testing an algorithm with a grid of parameters and inspecting results). See the /batch subdirectory, and as examples the batch experiment files left in system directories, such as op_ip.
 A standardized interface for realtime control problems, see e.g. ipsetup_rtproblem. Another example is the implementation for the EdRo robot. Two online RL implementations compatible with this interface are rtapproxqlearn and rtlspionline.
 New simulation tasks include notably a resonating robot arm (where a spring is used to make the motion more energy efficient), and a simple navigation problem in 2D.
 Additional demonstration scripts, including one for planning and another focused on leastsquares types of policy iteration.
 For classical, discrete RL: the MonteCarlo and DynaQ implementations are new; Qlearning and SARSA now support experience replay. See cleaningrobot_demo for examples. Two new problems: machine replacement (as described by Bertsekas) and gridworld navigation. These are very simple tasks useful to explain or experiment with DP and RL.
The same standardized task interface is followed as before, with some extensions. Things should be largely backwardcompatible with the old version, if you encounter trouble let me know. I have left in the system directories functions and scripts for many experiments I have run, in case they are useful. See also the description and documentation for the previous version of the toolbox.
I have also included code related to my recent forays into cooperative multiagent control:
 Multiagent planning (magenmpc, maopd_...), with specific focus on consensus problems (although generalizable). Multiagent tasks: linear agents and robotarm agents. Standard linear consensus and flocking protocols. See the paper OP for Consensus.
 Multiagent consensus using optimistic optimization (ooconsensus), and as a sidebenefit the DOO and SOO algorithms of Remi Munos (doosoo).
«

Optimistic planning,
a selection of algorithms as a standalone package. (13 July 2013, 79.3 KBytes).
»
This package is a subset of the Approximate RL and DP toolbox, containing only optimistic planning algorithms, and reorganized to be selfcontained. It is helpful if you are only interested in this type of algorithm. The following algorithms are included:
optimistic planning for deterministic systems ( opd), for discrete Markov decision processes ( opss), with continuous actions ( sooplp); and openloop optimistic planning ( olop) with the theoretical variant ( olop_theoretical) as described in the paper by Bubeck and Munos.
See the readme file for a more detailed description.
«

MARL toolbox ver. 1.3,
a Matlab multiagent reinforcement learning toolbox (4 August 2010, 336.9 KBytes).
»
The MultiAgent Reinforcement Learning toolbox is a package of Matlab functions and scripts that I used in my research on multiagent learning. We prefer Matlab for its ease of use with numeric computations and its rapid prototyping facilities. Since no Matlab toolbox for dynamic multiagent tasks was available when I started my PhD project, I started writing one of my own. This is the result. The toolbox is developed with modularity in mind, separating for instance the agent behaviour from the world engine and the latter from the rendering GUI. Currently the toolbox supports only episodic environments, but hooks are in place for continuing tasks as well. The learning, action selection and exploration methods can be independently plugged into the agents' behaviour.
Several types of gridworldbased environments are implemented, and agents can learn using a set of algorithms among which singleagent Qlearning, team Qlearning, minimaxQ, WoLFPHC and an adaptive state expansion algorithm developed by us. Everything is written for the generic nagent case, except minimaxQ, which is most meaningful in the twoagent case.
The latest version, 1.3, adds the Distributed Qlearning algorithm and the new 'robotic rescue' gridworld environment used in the example of our survey chapter MultiAgent Reinforcement Learning: An Overview (where the problem was described more generically as 'object transportation'). Also included is a demonstration script illustrating the experiments reported in the chapter.
«

MARL toolbox documentation,
the documentation files for the MARL toolbox (4 August 2010, 223.1 KBytes).
»
This archive accompanies the MultiAgent Reinforcement Learning toolbox, and documents its features and usage. An uptodate HTML reference of the functions and scripts in the toolbox is included, but the documentation itself has unfortunately not been updated since version 1.0 of the toolbox.
«

Approximate RL and DP toolbox,
developed in Matlab. (6 June 2010, 967.6 KBytes).
»
This toolbox contains Matlab implementations of a number of approximate reinforcement learning (RL) and dynamic programming (DP) algorithms, notably including the algorithms used in our book Reinforcement Learning and Dynamic Programming Using Function Approximators.
The toolbox features:
 Algorithms for approximate value iteration: grid Qiteration, fuzzy Qiteration, and fitted Qiteration with nonparametric and neural network approximators. In addition, an implementation of fuzzy Qiteration with crossentropy (CE) optimization of the membership functions is provided.
 Algorithms for approximate policy iteration: leastsquares policy iteration (LSPI), policy iteration with LSPEQ policy evaluation, online LSPI and online policy iteration with LSPEQ, as well as online LSPI with explicitly parameterized and monotonic policies. These algorithms all support generic approximators, of which a variety are already implemented.
 Algorithms for approximate policy search: policy search with adaptive basis functions, using the CE or DIRECT methods for global optimization. An additional generic policy search algorithm, with a configurable optimization technique and generic policy approximators, is provided.
 Implementations of several wellknown reinforcement learning benchmarks (the caronthehill, bicycle balancing, inverted pendulum swingup), as well as more specialized controloriented tasks (DC motor, robotic arm control) and a highly challenging HIV infection control task.
 A set of thoroughly commented demonstrations illustrating how all these algorithms can be used.
 A standardized task interface means that users will be able to implement their own tasks. The algorithms functions also follow standardized inputoutput conventions, and use a highly flexible, standardized configuration mechanism.
 Optimized Qiteration and policy iteration implementations, taking advantage of Matlab builtin vectorized and matrix operations (many of them exploiting LAPACK and BLAS libraries) to run extremely fast.
 Extensive result inspection facilities (plotting of policies and value functions, execution and solution performance statistics, etc.).
 Implementations of several classical RL and DP algorithms for discrete problems: Qlearning and SARSA with or without eligibility traces, Qiteration, and policy iteration with Qfunctions.
For more details, see the the readme file of the toolbox. Note you will need two additional functions to make the toolbox work! The readme file describes how these functions can be obtained.
Since June 6th 2010, the archive also includes the regression trees package of Pierre Geurts, redistributed with his kind permission.
«

makepdf,
a Windows XP batch script to automate the creation of PDF files from DVI (21 November 2008, 2.4 KBytes).
Presentations

Optimistic planning for continuousaction deterministic systems,
an overview of the SOOP algorithm (1 July 2013, 1.4 MBytes).
»
This talk describes our continuousaction planning algorithm called SOOP. Like other online planning methods, SOOP has no direct dependence on the state space structure. It explores a solution space consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. To this end, it borrows the principle of the simultaneous optimistic optimization method, and develops a nontrivial adaptation of this principle to the planning problem.
«

Optimistic planning for networked control systems,
explaining how the features of planning make it suitable for NCS (25 June 2013, 2.7 MBytes).
»
This work deals with nearoptimal control in networked control systems, using Optimistic Planning – a modelpredictive type of approach borrowed from artificial intelligence. OP produces long, nearoptimal sequences of actions for very general nonlinear systems and cost functions. The idea is to transmit a longer sequence of actions to a buffer and wait until this sequence is exhausted, thereby reducing network usage and computation. We analyze the closedloop performance and the effect of sending longer or shorter subsequences, in theory as well as numerical experiments.
«

Reinforcement learning with function approximation,
my talk in the Optimal Adaptive Control workshop at the IEEE Conference on Decision and Control (11 December 2011, 5.5 MBytes).
»
Artificialintelligence techniques for reinforcement learning are introduced starting from the discretetime, discretevalued roots of the field. After motivating and formalizing the problem, several essential classes of basic algorithms will be described: value iteration, policy iteration, and policy search. Using function approximation, these algorithms will be extended to work in continuousvariable systems. Algorithm development is complemented by a study of theoretical questions regarding convergence and solution quality, and by illustrative examples and case studies.
«

Optimistic planning for nearoptimal control in MDPs,
an indepth description of our optimistic planning algorithm and its analysis (1 December 2011, 1.1 MBytes).
»
Markov decision processes (MDPs) describe general problems in which actions must be applied to a system so as to maximize a longterm cumulative reward. Such problems arise in many fields, including automatic control, artificial intelligence, medicine, economics etc. Recently, the community has intensified its interest in online planning methods for solving MDPs, due to their relative independence on the state dimensionality. At every interaction step, these methods select an action based on a local exploration of possible control policies from the current state (so they are also a type of modelpredictive control).
In this presentation, we consider a planning algorithm that optimistically explores the space of closedloop policies, always refining the most promising solution found so far. This is similar to how classical planning works, so the algorithm can be seen e.g. as an extension of classical AO* to infinitehorizon MDPs. We analyze the quality of the action choices made by optimistic planning, for problems with a finite number of actions and possible next states for each transition. Performance does not directly depend on these numbers, instead the algorithm implicitly adapts to the (unknown) problem complexity. In particular, specializing the result for some interesting classes of MDPs illustrates the algorithm works better when there are fewer nearoptimal policies and less uniform transition probabilities. The presentation closes with some promising experimental results, including the online control of a simulated HIV infection.
«

Reinforcement learning lectures,
introducing classical and approximate RL (3 March 2010, 2.1 MBytes).
»
This is a twopart lecture on reinforcement learning (RL) for discrete and continuousvariable tasks.
In the first part, the Markov decision process formalism is introduced and the optimal RL solution is characterized. This is followed by a discussion of classical, discrete online RL algorithms. Eligibility traces and experience replay are introduced.
The second part briefly returns to the classical dynamic programming (DP) algorithms for value and policy iteration, and then extends them to the approximate, continuousvariable case. Throughout the two lectures, simulation and realtime control examples accompany the theoretical developments and algorithm descriptions.
This presentation employs the demonstration movies below, and refers to the RL demos under "Software" above.
«

Reinforcement learning in continuous state and action spaces,
my defense presentation, with a very gentle introduction to the topic. (13 January 2009, 391.3 KBytes).
»
This presentation introduces the basics of RL and dynamic programming (DP), and the need for
approximation in continuous spaces. Very little prior knowledge is required (basic math should be enough), and nearly every concept is illustrated graphically. So, this presentation may be useful to persons unfamiliar with the RL and DP field.
«

Modelbased reinforcement learning with fuzzy approximation,
an overview of our fuzzy Qiteration algorithm, with convergence and consistency results. (9 April 2008, 930.5 KBytes).
»
Reinforcement learning is a widely used paradigm for learning control.
Computing exact reinforcement learning solutions is generally only possible
when process states and control actions take values in a small discrete set.
In practice, approximate algorithms are necessary. This presentation first
introduces the RL problem, and then describes an approximate, modelbased
reinforcement learning algorithm. This algorithm relies on a fuzzy partition
of the state space, and on a discretization of the action space. It converges
to a solution that lies within a bound of the optimal solution. Under
continuity assumptions on the dynamics and the reward function, the algorithm
is also consistent, which means that the optimal solution is asymptotically
obtained as the approximation accuracy increases.
The algorithm is applied to an example control problem, where a good performance is obtained.
The influence of discontinuous reward functions, which do not satisfy the conditions for consistency,
is studied. It appears that a continuous reward function is important for a predictable improvement
in performance as the approximation accuracy increases. Finally, the algorithm is used to swing up
an underactuated inverted pendulum.
«

Reinforcement learning for multiagent systems,
a good overview talk to which I collaborated; this was presented by Prof. Robert Babuska at the CABS colloquium (the link opens a separate download page) (22 June 2006).
Demonstration Movies

Learning to swing up an inverted pendulum,
using online leastsquares policy iteration. (8 January 2009, 51.8 MBytes).
»
The inverted pendulum is obtained by placing a weight offcenter on a disk driven by a DC motor. The motor is underactuated, so it cannot push the weight up in one go, but must swing back and forth. Half of the learning trials are started with the weight pointing down, and half in a random initial state obtained by applying a sequence of random actions (that is the reason for the large random actions applied even after the controller has learned to properly swing up the pendulum).
«

Final swingup solution,
after the online LSPI learning experiment was completed. (8 January 2009, 864.9 KBytes).

Robot goalkeeper learning to catch the ball,
using approximate online RL and experience replay (demo by Sander Adam). (1 October 2008, 13.3 MBytes).
