The repository contains useful downloadable material related to my research and teaching, including Matlab software, presentations, and demonstration movies. Presentations are selectively chosen for tutorial value. If an item has a "»" button to its right, this button can be clicked to reveal more information; the "«" button then hides this information again (requires Javascript).
Software

Approximate RL and DP toolbox,
latest snapshot, including bugfixes and new, workinprogress algorithms and experiments  possibly with their own, new bugs. (9 January 2016, 1.9 MBytes).
»
What's new:
9 Jan 2016
 Compatibility fixes for recent versions of Matlab (changes in graphics handling, the optimization toolbox, etc.)
 Some bugfixes and extra comments
«

Optimistic planning,
a selection of algorithms as a standalone package. (13 July 2013, 79.3 KBytes).
»
This package is a subset of the Approximate RL and DP toolbox, containing only optimistic planning algorithms, and reorganized to be selfcontained. It is helpful if you are only interested in this type of algorithm. The following algorithms are included:
optimistic planning for deterministic systems ( opd), for discrete Markov decision processes ( opss), with continuous actions ( sooplp); and openloop optimistic planning ( olop) with the theoretical variant ( olop_theoretical) as described in the paper by Bubeck and Munos.
See the readme file for a more detailed description.
«

Approximate RL and DP toolbox,
July 2013 release. (13 July 2013, 1.6 MBytes).
»
Since the previous release of the toolbox was getting rather old, I decided to publish a new version. Be warned though: this is very much workinprogress, a snapshot of the code that I use for my daily research. So expect undocumented behavior, bugs, but also plenty of new algorithms – hic sunt leones!
New features:
 Online, optimistic planning algorithms: for deterministic systems (opd), for discrete Markov decision processes (opss), with continuous actions (sooplp), openloop optimistic planning (olop), and hierarchical OLOP (holop). The entry point connecting planning to the system is genmpc. OPD and OPMDP can be used while applying longer sequences of actions / longer tree policies.
 Fitted Qiteration with local linear regression approximation.
 An extensive mechanism for running batch experiments (testing an algorithm with a grid of parameters and inspecting results). See the /batch subdirectory, and as examples the batch experiment files left in system directories, such as op_ip.
 A standardized interface for realtime control problems, see e.g. ipsetup_rtproblem. Another example is the implementation for the EdRo robot. Two online RL implementations compatible with this interface are rtapproxqlearn and rtlspionline.
 New simulation tasks include notably a resonating robot arm (where a spring is used to make the motion more energy efficient), and a simple navigation problem in 2D.
 Additional demonstration scripts, including one for planning and another focused on leastsquares types of policy iteration.
 For classical, discrete RL: the MonteCarlo and DynaQ implementations are new; Qlearning and SARSA now support experience replay. See cleaningrobot_demo for examples. Two new problems: machine replacement (as described by Bertsekas) and gridworld navigation. These are very simple tasks useful to explain or experiment with DP and RL.
The same standardized task interface is followed as before, with some extensions. Things should be largely backwardcompatible with the old version, if you encounter trouble let me know. I have left in the system directories functions and scripts for many experiments I have run, in case they are useful. See also the description and documentation for the previous version of the toolbox.
I have also included code related to my recent forays into cooperative multiagent control:
 Multiagent planning (magenmpc, maopd_...), with specific focus on consensus problems (although generalizable). Multiagent tasks: linear agents and robotarm agents. Standard linear consensus and flocking protocols. See the paper OP for Consensus.
 Multiagent consensus using optimistic optimization (ooconsensus), and as a sidebenefit the DOO and SOO algorithms of Remi Munos (doosoo).
«

MARL toolbox ver. 1.3,
a Matlab multiagent reinforcement learning toolbox (4 August 2010, 336.9 KBytes).
»
The MultiAgent Reinforcement Learning toolbox is a package of Matlab functions and scripts that I used in my research on multiagent learning. We prefer Matlab for its ease of use with numeric computations and its rapid prototyping facilities. Since no Matlab toolbox for dynamic multiagent tasks was available when I started my PhD project, I started writing one of my own. This is the result. The toolbox is developed with modularity in mind, separating for instance the agent behaviour from the world engine and the latter from the rendering GUI. Currently the toolbox supports only episodic environments, but hooks are in place for continuing tasks as well. The learning, action selection and exploration methods can be independently plugged into the agents' behaviour.
Several types of gridworldbased environments are implemented, and agents can learn using a set of algorithms among which singleagent Qlearning, team Qlearning, minimaxQ, WoLFPHC and an adaptive state expansion algorithm developed by us. Everything is written for the generic nagent case, except minimaxQ, which is most meaningful in the twoagent case.
The latest version, 1.3, adds the Distributed Qlearning algorithm and the new 'robotic rescue' gridworld environment used in the example of our survey chapter MultiAgent Reinforcement Learning: An Overview (where the problem was described more generically as 'object transportation'). Also included is a demonstration script illustrating the experiments reported in the chapter.
«

MARL toolbox documentation,
the documentation files for the MARL toolbox (4 August 2010, 223.1 KBytes).
»
This archive accompanies the MultiAgent Reinforcement Learning toolbox, and documents its features and usage. An uptodate HTML reference of the functions and scripts in the toolbox is included, but the documentation itself has unfortunately not been updated since version 1.0 of the toolbox.
«

Approximate RL and DP toolbox,
developed in Matlab. (6 June 2010, 967.6 KBytes).
»
This toolbox contains Matlab implementations of a number of approximate reinforcement learning (RL) and dynamic programming (DP) algorithms, notably including the algorithms used in our book Reinforcement Learning and Dynamic Programming Using Function Approximators.
The toolbox features:
 Algorithms for approximate value iteration: grid Qiteration, fuzzy Qiteration, and fitted Qiteration with nonparametric and neural network approximators. In addition, an implementation of fuzzy Qiteration with crossentropy (CE) optimization of the membership functions is provided.
 Algorithms for approximate policy iteration: leastsquares policy iteration (LSPI), policy iteration with LSPEQ policy evaluation, online LSPI and online policy iteration with LSPEQ, as well as online LSPI with explicitly parameterized and monotonic policies. These algorithms all support generic approximators, of which a variety are already implemented.
 Algorithms for approximate policy search: policy search with adaptive basis functions, using the CE or DIRECT methods for global optimization. An additional generic policy search algorithm, with a configurable optimization technique and generic policy approximators, is provided.
 Implementations of several wellknown reinforcement learning benchmarks (the caronthehill, bicycle balancing, inverted pendulum swingup), as well as more specialized controloriented tasks (DC motor, robotic arm control) and a highly challenging HIV infection control task.
 A set of thoroughly commented demonstrations illustrating how all these algorithms can be used.
 A standardized task interface means that users will be able to implement their own tasks. The algorithms functions also follow standardized inputoutput conventions, and use a highly flexible, standardized configuration mechanism.
 Optimized Qiteration and policy iteration implementations, taking advantage of Matlab builtin vectorized and matrix operations (many of them exploiting LAPACK and BLAS libraries) to run extremely fast.
 Extensive result inspection facilities (plotting of policies and value functions, execution and solution performance statistics, etc.).
 Implementations of several classical RL and DP algorithms for discrete problems: Qlearning and SARSA with or without eligibility traces, Qiteration, and policy iteration with Qfunctions.
For more details, see the the readme file of the toolbox. Note you will need two additional functions to make the toolbox work! The readme file describes how these functions can be obtained.
Since June 6th 2010, the archive also includes the regression trees package of Pierre Geurts, redistributed with his kind permission.
«

makepdf,
a Windows XP batch script to automate the creation of PDF files from DVI (21 November 2008, 2.4 KBytes).
Presentations

Basics of Reinforcement Learning,
a very condensed introduction to basic dynamic programming and RL methods. Taught at the Transylvanian Summer School on Machine Learning, in ClujNapoca, Romania (20 July 2018, 4.5 MBytes).

AI Planning with Applications to Switched Systems,
discussing, in addition to some planning techniques, their adaptations for switched system control. Keynote at the IFAC CESCIT conference (6 June 2018, 5.4 MBytes).

Online, Optimistic Planning for Markov Decision Processes,
an indepth course mainly on my recent research into optimistic planning algorithms, with a practical session. Taught at the ACAI Summer School on RL, in Nieuwpoort, Belgium (10 October 2017).

Approximate Dynamic Programming and Reinforcement Learning for Control,
an invited, threeday intensive Master course at the Polytechnic University of Valencia, Spain (21 June 2017).
»
This course provides methods for controlling systems that are too complex or insufficiently known to apply classical control design techniques. Classical foundations are connected to recent developments. The focus is placed on learning algorithms for control, in particular reinforcement learning (RL). Special attention is also paid to modelbased techniques related to RL, as they can be very useful in controlling complex systems even when a model is known. After introducing the RL problem, the dynamic programming algorithms that sit at the foundation of RL are described. Then, classical, discretevariable RL algorithms are introduced. In the second part of the course, the dynamical programming and RL algorithms are extended with approximation techniques, in order to make them applicable to continuousvariable control, as well as to largescale discretevariable problems. Several online planning techniques are discussed.
«
Demonstration Movies

Fall detection using a quadrotor,
A Parrot AR.Drone 2 monitors a person for falls while flying at a set distance and orientation. The location of the person, as well as falls, are detected with deeplearning vision algorithms. With Paul Dragan and Cristi Iuga, see our conference paper for details. (1 December 2017).

Assistive robot demo using online POMDP planning,
Cyton Gamma 1500 robot arm, with Pioneer3AT mobile base and endeffector camera, flips off electrical switches forgotten on. Uses an online planning algorithm called AEMS2 for partiallyobservable Markov decision processes. With Elod Pall and Levente Tamas, see our IROS paper for details. (7 July 2016).

Planning to swing up a rotary pendulum in real time,
using the continuousaction simultaneous optimistic optimization for planning (SOOP) algorithm. With Elod Pall. (24 November 2014).

Learning to swing up an inverted pendulum,
using online leastsquares policy iteration. (8 January 2009, 51.8 MBytes).
»
The inverted pendulum is obtained by placing a weight offcenter on a disk driven by a DC motor. The motor is underactuated, so it cannot push the weight up in one go, but must swing back and forth. Half of the learning trials are started with the weight pointing down, and half in a random initial state obtained by applying a sequence of random actions (that is the reason for the large random actions applied even after the controller has learned to properly swing up the pendulum). Also have a look at the final swingup solution, after learningw was completed.
«

Robot goalkeeper learning to catch the ball,
using approximate online RL and experience replay (demo by Sander Adam). (1 October 2008, 13.3 MBytes).
