The repository contains useful downloadable material related to my research and teaching, including Matlab software, presentations, and demonstration movies. Presentations are selectively chosen for tutorial value. If an item has a "»" button to its right, this button can be clicked to reveal more information; the "«" button then hides this information again (requires Javascript).
Software

Approximate RL and DP toolbox,
latest snapshot, including bugfixes and new, workinprogress algorithms and experiments  possibly with their own, new bugs. (9 January 2016, 1.9 MBytes).
»
What's new:
9 Jan 2016
 Compatibility fixes for recent versions of Matlab (changes in graphics handling, the optimization toolbox, etc.)
 Some bugfixes and extra comments
«

Optimistic planning,
a selection of algorithms as a standalone package. (13 July 2013, 79.3 KBytes).
»
This package is a subset of the Approximate RL and DP toolbox, containing only optimistic planning algorithms, and reorganized to be selfcontained. It is helpful if you are only interested in this type of algorithm. The following algorithms are included:
optimistic planning for deterministic systems ( opd), for discrete Markov decision processes ( opss), with continuous actions ( sooplp); and openloop optimistic planning ( olop) with the theoretical variant ( olop_theoretical) as described in the paper by Bubeck and Munos.
See the readme file for a more detailed description.
«

Approximate RL and DP toolbox,
July 2013 release. (13 July 2013, 1.6 MBytes).
»
Since the previous release of the toolbox was getting rather old, I decided to publish a new version. Be warned though: this is very much workinprogress, a snapshot of the code that I use for my daily research. So expect undocumented behavior, bugs, but also plenty of new algorithms – hic sunt leones!
New features:
 Online, optimistic planning algorithms: for deterministic systems (opd), for discrete Markov decision processes (opss), with continuous actions (sooplp), openloop optimistic planning (olop), and hierarchical OLOP (holop). The entry point connecting planning to the system is genmpc. OPD and OPMDP can be used while applying longer sequences of actions / longer tree policies.
 Fitted Qiteration with local linear regression approximation.
 An extensive mechanism for running batch experiments (testing an algorithm with a grid of parameters and inspecting results). See the /batch subdirectory, and as examples the batch experiment files left in system directories, such as op_ip.
 A standardized interface for realtime control problems, see e.g. ipsetup_rtproblem. Another example is the implementation for the EdRo robot. Two online RL implementations compatible with this interface are rtapproxqlearn and rtlspionline.
 New simulation tasks include notably a resonating robot arm (where a spring is used to make the motion more energy efficient), and a simple navigation problem in 2D.
 Additional demonstration scripts, including one for planning and another focused on leastsquares types of policy iteration.
 For classical, discrete RL: the MonteCarlo and DynaQ implementations are new; Qlearning and SARSA now support experience replay. See cleaningrobot_demo for examples. Two new problems: machine replacement (as described by Bertsekas) and gridworld navigation. These are very simple tasks useful to explain or experiment with DP and RL.
The same standardized task interface is followed as before, with some extensions. Things should be largely backwardcompatible with the old version, if you encounter trouble let me know. I have left in the system directories functions and scripts for many experiments I have run, in case they are useful. See also the description and documentation for the previous version of the toolbox.
I have also included code related to my recent forays into cooperative multiagent control:
 Multiagent planning (magenmpc, maopd_...), with specific focus on consensus problems (although generalizable). Multiagent tasks: linear agents and robotarm agents. Standard linear consensus and flocking protocols. See the paper OP for Consensus.
 Multiagent consensus using optimistic optimization (ooconsensus), and as a sidebenefit the DOO and SOO algorithms of Remi Munos (doosoo).
«

MARL toolbox documentation,
the documentation files for the MARL toolbox (4 August 2010, 223.1 KBytes).
»
This archive accompanies the MultiAgent Reinforcement Learning toolbox, and documents its features and usage. An uptodate HTML reference of the functions and scripts in the toolbox is included, but the documentation itself has unfortunately not been updated since version 1.0 of the toolbox.
«

MARL toolbox ver. 1.3,
a Matlab multiagent reinforcement learning toolbox (4 August 2010, 336.9 KBytes).
»
The MultiAgent Reinforcement Learning toolbox is a package of Matlab functions and scripts that I used in my research on multiagent learning. We prefer Matlab for its ease of use with numeric computations and its rapid prototyping facilities. Since no Matlab toolbox for dynamic multiagent tasks was available when I started my PhD project, I started writing one of my own. This is the result. The toolbox is developed with modularity in mind, separating for instance the agent behaviour from the world engine and the latter from the rendering GUI. Currently the toolbox supports only episodic environments, but hooks are in place for continuing tasks as well. The learning, action selection and exploration methods can be independently plugged into the agents' behaviour.
Several types of gridworldbased environments are implemented, and agents can learn using a set of algorithms among which singleagent Qlearning, team Qlearning, minimaxQ, WoLFPHC and an adaptive state expansion algorithm developed by us. Everything is written for the generic nagent case, except minimaxQ, which is most meaningful in the twoagent case.
The latest version, 1.3, adds the Distributed Qlearning algorithm and the new 'robotic rescue' gridworld environment used in the example of our survey chapter MultiAgent Reinforcement Learning: An Overview (where the problem was described more generically as 'object transportation'). Also included is a demonstration script illustrating the experiments reported in the chapter.
«

Approximate RL and DP toolbox,
developed in Matlab. (6 June 2010, 967.6 KBytes).
»
This toolbox contains Matlab implementations of a number of approximate reinforcement learning (RL) and dynamic programming (DP) algorithms, notably including the algorithms used in our book Reinforcement Learning and Dynamic Programming Using Function Approximators.
The toolbox features:
 Algorithms for approximate value iteration: grid Qiteration, fuzzy Qiteration, and fitted Qiteration with nonparametric and neural network approximators. In addition, an implementation of fuzzy Qiteration with crossentropy (CE) optimization of the membership functions is provided.
 Algorithms for approximate policy iteration: leastsquares policy iteration (LSPI), policy iteration with LSPEQ policy evaluation, online LSPI and online policy iteration with LSPEQ, as well as online LSPI with explicitly parameterized and monotonic policies. These algorithms all support generic approximators, of which a variety are already implemented.
 Algorithms for approximate policy search: policy search with adaptive basis functions, using the CE or DIRECT methods for global optimization. An additional generic policy search algorithm, with a configurable optimization technique and generic policy approximators, is provided.
 Implementations of several wellknown reinforcement learning benchmarks (the caronthehill, bicycle balancing, inverted pendulum swingup), as well as more specialized controloriented tasks (DC motor, robotic arm control) and a highly challenging HIV infection control task.
 A set of thoroughly commented demonstrations illustrating how all these algorithms can be used.
 A standardized task interface means that users will be able to implement their own tasks. The algorithms functions also follow standardized inputoutput conventions, and use a highly flexible, standardized configuration mechanism.
 Optimized Qiteration and policy iteration implementations, taking advantage of Matlab builtin vectorized and matrix operations (many of them exploiting LAPACK and BLAS libraries) to run extremely fast.
 Extensive result inspection facilities (plotting of policies and value functions, execution and solution performance statistics, etc.).
 Implementations of several classical RL and DP algorithms for discrete problems: Qlearning and SARSA with or without eligibility traces, Qiteration, and policy iteration with Qfunctions.
For more details, see the the readme file of the toolbox. Note you will need two additional functions to make the toolbox work! The readme file describes how these functions can be obtained.
Since June 6th 2010, the archive also includes the regression trees package of Pierre Geurts, redistributed with his kind permission.
«

makepdf,
a Windows XP batch script to automate the creation of PDF files from DVI (21 November 2008, 2.4 KBytes).
Presentations

Optimistic planning for continuous actions,
latest talk on an interesting algorithm for nearoptimal control in general nonlinear systems with continuous actions (6 July 2016, 759.1 KBytes).

Reinforcement learning and planning algorithms,
a highlevel overview talk I gave at the IROS 2015 workshop on Machine Learning in Planning and Control of Robot Motion (2 October 2015, 3.0 MBytes).
»
Many learning and planning methods for robot motion control are built on a foundation of optimal control in Markov decision processes. In this talk, we describe this basic problem and some fundamental reinforcement learning and planning methods to solve it. We start with dynamic programming algorithms, and then move on to modelfree approaches in reinforcement learning. Special attention is paid to function approximation, which is essential in robotics. In the second part of the talk, we describe an online planning framework, identifying its relation to reinforcement learning and detailing a few recent optimistic planning algorithms. We connect to several robotics applications along the way.
See also the MLPC website.
«

Nonlinear nearoptimal control using optimistic planning,
(Algorithms, Networked Control Systems, RealTime Control), presented at the Italian Institute of Technology (25 September 2014, 4.6 MBytes).
»
Markov decision processes describe problems in which actions must be applied to a system so as to maximize a longterm cumulative reward. Such problems arise in many fields, including robotics, automatic control, artificial intelligence, medicine, economics etc. This talk focuses on optimistic planning (OP) methods to solve Markov decision processes. Like predictive control, OP finds action sequences and applies them in receding horizon. However, OP is very different from classical predictive control, and in fact integrates ideas from reinforcement learning, classical AI search, and global optimization. Its main advantage is a high degree of generality, and a tight relationship between computation invested and nearoptimality. We start by presenting a few OP methods and outlining their computation versus nearoptimality analysis. Then we show how OP can control nonlinear systems over a network. A main challenge to applying OP in real life, such as in robotics, is its high computational cost, so we also describe conditions under which it can be adapted to work in realtime.
«

Optimistic planning for nearoptimal control in MDPs,
an indepth description of the optimistic planning algorithm for MDPs and its analysis (1 December 2011, 1.1 MBytes).
»
Markov decision processes (MDPs) describe general problems in which actions must be applied to a system so as to maximize a longterm cumulative reward. Such problems arise in many fields, including automatic control, artificial intelligence, medicine, economics etc. Recently, the community has intensified its interest in online planning methods for solving MDPs, due to their relative independence on the state dimensionality. At every interaction step, these methods select an action based on a local exploration of possible control policies from the current state (so they are also a type of modelpredictive control).
In this presentation, we consider a planning algorithm that optimistically explores the space of closedloop policies, always refining the most promising solution found so far. This is similar to how classical planning works, so the algorithm can be seen e.g. as an extension of classical AO* to infinitehorizon MDPs. We analyze the quality of the action choices made by optimistic planning, for problems with a finite number of actions and possible next states for each transition. Performance does not directly depend on these numbers, instead the algorithm implicitly adapts to the (unknown) problem complexity. In particular, specializing the result for some interesting classes of MDPs illustrates the algorithm works better when there are fewer nearoptimal policies and less uniform transition probabilities. The presentation closes with some promising experimental results, including the online control of a simulated HIV infection.
«

Reinforcement learning lectures,
introducing classical and approximate RL (3 March 2010, 2.1 MBytes).
»
This is a twopart lecture on reinforcement learning (RL) for discrete and continuousvariable tasks.
In the first part, the Markov decision process formalism is introduced and the optimal RL solution is characterized. This is followed by a discussion of classical, discrete online RL algorithms. Eligibility traces and experience replay are introduced.
The second part briefly returns to the classical dynamic programming (DP) algorithms for value and policy iteration, and then extends them to the approximate, continuousvariable case. Throughout the two lectures, simulation and realtime control examples accompany the theoretical developments and algorithm descriptions.
This presentation employs the demonstration movies below, and refers to the RL demos under "Software" above.
«
Demonstration Movies

Assistive robot demo using online POMDP planning,
Cyton Gamma 1500 robot arm, with Pioneer3AT mobile base and endeffector camera, flips off electrical switches forgotten on. Uses an online planning algorithm called AEMS2 for partiallyobservable Markov decision processes. With Elod Pall and Levente Tamas, see our IROS paper for details. (7 July 2016).

Planning to swing up a rotary pendulum in real time,
using the continuousaction simultaneous optimistic optimization for planning (SOOP) algorithm. With Elod Pall. (24 November 2014).

Final swingup solution,
after the online LSPI learning experiment was completed. (8 January 2009, 864.9 KBytes).

Learning to swing up an inverted pendulum,
using online leastsquares policy iteration. (8 January 2009, 51.8 MBytes).
»
The inverted pendulum is obtained by placing a weight offcenter on a disk driven by a DC motor. The motor is underactuated, so it cannot push the weight up in one go, but must swing back and forth. Half of the learning trials are started with the weight pointing down, and half in a random initial state obtained by applying a sequence of random actions (that is the reason for the large random actions applied even after the controller has learned to properly swing up the pendulum).
«

Robot goalkeeper learning to catch the ball,
using approximate online RL and experience replay (demo by Sander Adam). (1 October 2008, 13.3 MBytes).
