Approximate RL and DP toolbox,
latest snapshot, including bugfixes and new, work-in-progress algorithms and experiments - possibly with their own, new bugs. (9 January 2016, 1.9 MBytes).
9 Jan 2016
- Compatibility fixes for recent versions of Matlab (changes in graphics handling, the optimization toolbox, etc.)
- Some bugfixes and extra comments
a selection of algorithms as a stand-alone package. (13 July 2013, 79.3 KBytes).
This package is a subset of the Approximate RL and DP toolbox, containing only optimistic planning algorithms, and reorganized to be self-contained. It is helpful if you are only interested in this type of algorithm. The following algorithms are included:
optimistic planning for deterministic systems (opd
), for discrete Markov decision processes (opss
), with continuous actions (sooplp
); and open-loop optimistic planning (olop
) with the theoretical variant (olop_theoretical
) as described in the paper by Bubeck and Munos.
See the readme file for a more detailed description.
Approximate RL and DP toolbox,
July 2013 release. (13 July 2013, 1.6 MBytes).
Since the previous release of the toolbox was getting rather old, I decided to publish a new version. Be warned though: this is very much work-in-progress, a snapshot of the code that I use for my daily research. So expect undocumented behavior, bugs, but also plenty of new algorithms – hic sunt leones!
- Online, optimistic planning algorithms: for deterministic systems (opd), for discrete Markov decision processes (opss), with continuous actions (sooplp), open-loop optimistic planning (olop), and hierarchical OLOP (holop). The entry point connecting planning to the system is genmpc. OPD and OP-MDP can be used while applying longer sequences of actions / longer tree policies.
- Fitted Q-iteration with local linear regression approximation.
- An extensive mechanism for running batch experiments (testing an algorithm with a grid of parameters and inspecting results). See the /batch subdirectory, and as examples the batch experiment files left in system directories, such as op_ip.
- A standardized interface for real-time control problems, see e.g. ipsetup_rtproblem. Another example is the implementation for the EdRo robot. Two online RL implementations compatible with this interface are rtapproxqlearn and rtlspionline.
- New simulation tasks include notably a resonating robot arm (where a spring is used to make the motion more energy efficient), and a simple navigation problem in 2D.
- Additional demonstration scripts, including one for planning and another focused on least-squares types of policy iteration.
- For classical, discrete RL: the Monte-Carlo and Dyna-Q implementations are new; Q-learning and SARSA now support experience replay. See cleaningrobot_demo for examples. Two new problems: machine replacement (as described by Bertsekas) and gridworld navigation. These are very simple tasks useful to explain or experiment with DP and RL.
The same standardized task interface is followed as before, with some extensions. Things should be largely backward-compatible with the old version, if you encounter trouble let me know. I have left in the system directories functions and scripts for many experiments I have run, in case they are useful. See also the description and documentation for the previous version of the toolbox.
I have also included code related to my recent forays into cooperative multiagent control:
- Multiagent planning (magenmpc, maopd_...), with specific focus on consensus problems (although generalizable). Multiagent tasks: linear agents and robot-arm agents. Standard linear consensus and flocking protocols. See the paper OP for Consensus.
- Multiagent consensus using optimistic optimization (ooconsensus), and as a side-benefit the DOO and SOO algorithms of Remi Munos (doosoo).
MARL toolbox ver. 1.3,
a Matlab multi-agent reinforcement learning toolbox (4 August 2010, 336.9 KBytes).
The Multi-Agent Reinforcement Learning toolbox is a package of Matlab functions and scripts that I used in my research on multi-agent learning. We prefer Matlab for its ease of use with numeric computations and its rapid prototyping facilities. Since no Matlab toolbox for dynamic multi-agent tasks was available when I started my PhD project, I started writing one of my own. This is the result. The toolbox is developed with modularity in mind, separating for instance the agent behaviour from the world engine and the latter from the rendering GUI. Currently the toolbox supports only episodic environments, but hooks are in place for continuing tasks as well. The learning, action selection and exploration methods can be independently plugged into the agents' behaviour.
Several types of gridworld-based environments are implemented, and agents can learn using a set of algorithms among which single-agent Q-learning, team Q-learning, minimax-Q, WoLF-PHC and an adaptive state expansion algorithm developed by us. Everything is written for the generic n-agent case, except minimax-Q, which is most meaningful in the two-agent case.
The latest version, 1.3, adds the Distributed Q-learning algorithm and the new 'robotic rescue' gridworld environment used in the example of our survey chapter Multi-Agent Reinforcement Learning: An Overview (where the problem was described more generically as 'object transportation'). Also included is a demonstration script illustrating the experiments reported in the chapter.
MARL toolbox documentation,
the documentation files for the MARL toolbox (4 August 2010, 223.1 KBytes).
This archive accompanies the Multi-Agent Reinforcement Learning toolbox, and documents its features and usage. An up-to-date HTML reference of the functions and scripts in the toolbox is included, but the documentation itself has unfortunately not been updated since version 1.0 of the toolbox.
Approximate RL and DP toolbox,
developed in Matlab. (6 June 2010, 967.6 KBytes).
This toolbox contains Matlab implementations of a number of approximate reinforcement learning (RL) and dynamic programming (DP) algorithms, notably including the algorithms used in our book Reinforcement Learning and Dynamic Programming Using Function Approximators
The toolbox features:
- Algorithms for approximate value iteration: grid Q-iteration, fuzzy Q-iteration, and fitted Q-iteration with nonparametric and neural network approximators. In addition, an implementation of fuzzy Q-iteration with cross-entropy (CE) optimization of the membership functions is provided.
- Algorithms for approximate policy iteration: least-squares policy iteration (LSPI), policy iteration with LSPE-Q policy evaluation, online LSPI and online policy iteration with LSPE-Q, as well as online LSPI with explicitly parameterized and monotonic policies. These algorithms all support generic approximators, of which a variety are already implemented.
- Algorithms for approximate policy search: policy search with adaptive basis functions, using the CE or DIRECT methods for global optimization. An additional generic policy search algorithm, with a configurable optimization technique and generic policy approximators, is provided.
- Implementations of several well-known reinforcement learning benchmarks (the car-on-the-hill, bicycle balancing, inverted pendulum swingup), as well as more specialized control-oriented tasks (DC motor, robotic arm control) and a highly challenging HIV infection control task.
- A set of thoroughly commented demonstrations illustrating how all these algorithms can be used.
- A standardized task interface means that users will be able to implement their own tasks. The algorithms functions also follow standardized input-output conventions, and use a highly flexible, standardized configuration mechanism.
- Optimized Q-iteration and policy iteration implementations, taking advantage of Matlab built-in vectorized and matrix operations (many of them exploiting LAPACK and BLAS libraries) to run extremely fast.
- Extensive result inspection facilities (plotting of policies and value functions, execution and solution performance statistics, etc.).
- Implementations of several classical RL and DP algorithms for discrete problems: Q-learning and SARSA with or without eligibility traces, Q-iteration, and policy iteration with Q-functions.
For more details, see the the readme file of the toolbox. Note you will need two additional functions to make the toolbox work! The readme file describes how these functions can be obtained.
Since June 6th 2010, the archive also includes the regression trees package of Pierre Geurts, redistributed with his kind permission.
a Windows XP batch script to automate the creation of PDF files from DVI (21 November 2008, 2.4 KBytes).
Learning control for a communicating mobile robot,
on our recent research on machine learning for control of a robot that must, at the same time, learn a map and optimally transmit a data buffer. A short talk given at the American Control Conference, Philadelphia, US (10 July 2019, 1.2 MBytes).
Basics of Reinforcement Learning,
a very condensed introduction to basic dynamic programming and RL methods. Taught at the Transylvanian Summer School on Machine Learning, in Cluj-Napoca, Romania (20 July 2018, 4.5 MBytes).
AI Planning with Applications to Switched Systems,
discussing, in addition to some planning techniques, their adaptations for switched system control. Keynote at the IFAC CESCIT conference (6 June 2018, 5.4 MBytes).
Online, Optimistic Planning for Markov Decision Processes,
an in-depth course mainly on my recent research into optimistic planning algorithms, with a practical session. Taught at the ACAI Summer School on RL, in Nieuwpoort, Belgium (10 October 2017).
Approximate Dynamic Programming and Reinforcement Learning for Control,
an invited, three-day intensive Master course at the Polytechnic University of Valencia, Spain (21 June 2017).
This course provides methods for controlling systems that are too complex or insufficiently known to apply classical control design techniques. Classical foundations are connected to recent developments. The focus is placed on learning algorithms for control, in particular reinforcement learning (RL). Special attention is also paid to model-based techniques related to RL, as they can be very useful in controlling complex systems even when a model is known. After introducing the RL problem, the dynamic programming algorithms that sit at the foundation of RL are described. Then, classical, discrete-variable RL algorithms are introduced. In the second part of the course, the dynamical programming and RL algorithms are extended with approximation techniques, in order to make them applicable to continuous-variable control, as well as to large-scale discrete-variable problems. Several online planning techniques are discussed.
Learning control of a communicating drone,
A Parrot AR.Drone 2 learns a radio map and transmits a buffer at the same time, with an approach similar to the one in the ACC 2019 talk above. (1 December 2019).
Fall detection using a quadrotor,
A Parrot AR.Drone 2 monitors a person for falls while flying at a set distance and orientation. The location of the person, as well as falls, are detected with deep-learning vision algorithms. With Paul Dragan and Cristi Iuga, see our conference paper for details. (1 December 2017).
Assistive robot demo using online POMDP planning,
Cyton Gamma 1500 robot arm, with Pioneer3AT mobile base and end-effector camera, flips off electrical switches forgotten on. Uses an online planning algorithm called AEMS2 for partially-observable Markov decision processes. With Elod Pall and Levente Tamas, see our IROS paper for details. (7 July 2016).
Planning to swing up a rotary pendulum in real time,
using the continuous-action simultaneous optimistic optimization for planning (SOOP) algorithm. With Elod Pall. (24 November 2014).
Learning to swing up an inverted pendulum,
using online least-squares policy iteration. (8 January 2009, 51.8 MBytes).
The inverted pendulum is obtained by placing a weight off-center on a disk driven by a DC motor. The motor is underactuated, so it cannot push the weight up in one go, but must swing back and forth. Half of the learning trials are started with the weight pointing down, and half in a random initial state obtained by applying a sequence of random actions (that is the reason for the large random actions applied even after the controller has learned to properly swing up the pendulum). Also have a look at the final swingup solution
, after learningw was completed.
Robot goalkeeper learning to catch the ball,
using approximate online RL and experience replay (demo by Sander Adam). (1 October 2008, 13.3 MBytes).