|
This page contains a representative selection of my publications, in reverse chronological order. Use the "»" button to reveal an abstract of each publication, and "«" to hide the abstract again (requires Javascript).
Books
-
L. Busoniu, R. Babuska, B. De Schutter, D. Ernst,
Reinforcement Learning and Dynamic Programming Using Function Approximators,
CRC Press, Automation and Control Engineering Series. April 2010, 280 pages, ISBN 978-1439821084.
»
Abstract:
Reinforcement learning (RL) can optimally solve decision and control problems involving complex dynamic systems, without requiring a mathematical model of the system. If a model is available, dynamic programming (DP), the model-based counterpart of RL, can be used. RL and DP are applicable in a variety of disciplines, including automatic control, artificial intelligence, economics, and medicine. Recent years have seen a surge of interest RL and DP using compact, approximate representations of the solution, which enable algorithms to scale up to realistic problems.
This book provides an in-depth introduction to RL and DP with function approximators. A concise description of classical RL and DP (Chapter 2) builds the foundation for the remainder of the book. This is followed by an extensive review of the state-of-the-art in RL and DP with approximation, which combines algorithm development with theoretical guarantees, illustrative numerical examples, and insightful comparisons (Chapter 3). Each of the final three chapters (4 to 6) is dedicated to a representative algorithm from the three major classes of methods: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications.
For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work.
Access the book's website at http://www.dcsc.tudelft.nl/rlbook for additional information about the book and how to obtain it, as well as free access to a sample chapter and other downloadable material, including errata.
«
Journal papers
-
I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema,
Efficient Model Learning Methods for Actor-Critic Control.
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics,
2012.
Published online.
»
Abstract: We propose two new actor-critic algorithms for reinforcement learning. Both algorithms use local linear regression (LLR) to learn approximations of the functions involved. A crucial feature of the algorithms is that they also learn a process model and this, in combination with LLR, provides an efficient policy update for faster learning. The first algorithm uses a novel model-based update rule for the actor parameters. The second algorithm does not use an explicit actor, but learns a reference model which represents a desired behaviour, from which desired control actions can be calculated using the inverse of the learned process model. The two novel methods and a standard actor-critic algorithm are applied to the pendulum swing-up problem, in which the novel methods achieve faster learning than the standard algorithm.
Online at IEEEXplore.
«
-
S. Adam, L. Busoniu, R. Babuska,
Experience Replay for Real-Time Reinforcement Learning Control.
IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews,
vol. 42,
no. 2,
pages 201–212,
2012.
»
Abstract: Reinforcement learning (RL) algorithms can automatically learn optimal control strategies for nonlinear, possibly stochastic
systems. A promising approach for RL control is experience replay (ER), which quickly learns from a limited amount of data by
repeatedly presenting these data to an underlying RL algorithm. Despite its benefits, ER RL has been studied only sporadically
in the literature, and its applications have largely been confined to simulated systems. Therefore, in this paper we evaluate
ER RL on real-time control experiments involving a pendulum swing-up problem and the vision-based control of a goalkeeper
robot. These real-time experiments are complemented by simulation studies and comparisons with traditional RL. As a
preliminary, we develop a general ER framework that can be combined with essentially any incremental RL technique, and
instantiate this framework for the approximate Q-learning and SARSA algorithms. The successful real-time learning results
presented here are highly encouraging for the applicability of ER RL in practice.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions.
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics,
vol. 41,
no. 1,
pages 196–209,
2011.
»
Abstract: This paper introduces an algorithm for direct search of control policies in continuous-state,
discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy
that can be represented using a given number of basis functions (BFs), where a discrete action
is assigned to each BF. The type of the BFs and their number are specified in advance and
determine the complexity of the representation. Considerable flexibility is achieved by
optimizing the locations and shapes of the BFs, together with the action assignments. The
optimization is carried out with the cross-entropy method and evaluates the policies by their
empirical return from a representative set of initial states. The return for each representative
state is estimated using Monte Carlo simulations. The resulting algorithm for cross-entropy policy
search with adaptive BFs is extensively evaluated in problems with two to six state variables, for
which it reliably obtains good policies with only a small number of BFs. In these experiments,
cross-entropy policy search requires vastly fewer BFs than value-function techniques with
equidistant BFs, and outperforms policy search with a competing optimization algorithm called DIRECT.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Approximate Dynamic Programming with a Fuzzy Parametrization.
Automatica,
vol. 46,
no. 5,
pages 804–814,
2010.
»
Abstract: Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control.
Computing exact DP solutions is in general only possible when the process states and the
control actions take values in a small discrete set. In practice, it is necessary to approximate
the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy
partition of the state space, and on a discretization of the action space. This
fuzzy Q-iteration algorithm works for deterministic processes, under the discounted
return criterion. We prove that fuzzy Q-iteration asymptotically converges to a solution that
lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained
in a finite number of iterations is also derived. Under continuity assumptions on the dynamics
and on the reward function, we show that fuzzy Q-iteration is consistent, i.e., that it
asymptotically obtains the optimal solution as the approximation accuracy increases. These
properties hold both when the parameters of the approximator are updated in a synchronous
fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to
converge at least as fast as the synchronous one. The performance of fuzzy Q-iteration is
illustrated in a two-link manipulator control problem.
Online at ScienceDirect.
«
-
L. Busoniu, R. Babuska, B. De Schutter,
A Comprehensive Survey of Multi-Agent Reinforcement Learning.
IEEE Transactions on Systems, Man, and Cybernetics — Part C: Applications and Reviews,
vol. 38,
no. 2,
pages 156–172,
2008.
Recipient of the 2009 Andrew P. Sage Award for the best paper published annually in the IEEE Transactions on Systems, Man
and Cybernetics.
»
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including robotics,
distributed control, telecommunications, and economics. The complexity of many tasks arising
in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents
must instead discover a solution on their own, using learning. A significant part of the research
on multi-agent learning concerns reinforcement learning techniques. This paper provides a
comprehensive survey of multi-agent reinforcement learning (MARL). A central issue in the field
is the formal statement of the multi-agent learning goal. Different viewpoints on this issue have
led to the proposal of many different goals, among which two focal points can be distinguished:
stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents.
The MARL algorithms described in the literature aim—either explicitly or implicitly—at one of these
two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting.
A representative selection of these algorithms is discussed in detail in this paper, together with the
specific issues that arise in each category. Additionally, the benefits and challenges of MARL are
described along with some of the problem domains where MARL techniques have been applied. Finally,
an outlook for the field is provided.
Keywords: multi-agent systems,
reinforcement learning, game theory, distributed control.
This is an extended and revised version of the ICARCV-06 MARL paper.
Online at IEEEXplore.
«
Contributions to books
-
L. Busoniu, R. Munos, R. Babuska,
A Review of Optimistic Planning in Markov Decision Processes.
In Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control,
F. Lewis, D. Liu, Editors,
Wiley,
2012.
To appear.
»
Abstract: We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value – or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.
«
-
L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuska, B. De Schutter,
Least-Squares Methods for Policy Iteration.
In Reinforcement Learning: State of the Art,
M. Wiering, M. van Otterlo, Editors,
series Adaptation, Learning, and Optimization,
no. 12,
pages 75–109.
Springer,
2012.
»
Abstract: Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization.We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.
Online at SpringerLink.
«
-
L. Busoniu, B. De Schutter, R. Babuska,
Approximate Dynamic Programming and Reinforcement Learning.
In Interactive Collaborative Information Systems,
R. Babuska, F.C.A. Groen, Editors,
series Studies in Computational Intelligence,
no. 281,
pages 3–44.
Springer,
2010.
»
Abstract: DP and RL can be used to address problems from a variety of fields, including automatic control,
artificial intelligence, operations research, and economy. Many problems in these fields are described
by continuous variables, whereas DP and RL can find exact solutions only in the discrete case.
Therefore, approximation is essential in practical DP and RL. This chapter provides an in-depth
review of the literature on approximate DP and RL in large or continuous-space, infinite-horizon
problems. Value iteration, policy iteration, and policy search approaches are presented in turn.
Model-based (DP) as well as online and batch model-free (RL) algorithms are discussed. We review
theoretical guarantees on the approximate solutions produced by these algorithms. Numerical
examples illustrate the behavior of several representative algorithms in practice. Techniques
to automatically derive value function approximators are discussed, and a comparison between
value iteration, policy iteration, and policy search is provided. The chapter closes with a
discussion of open issues and promising research directions in approximate DP and RL.
Online at SpringerLink.
«
-
L. Busoniu, R. Babuska, B. De Schutter,
Multi-Agent Reinforcement Learning: An Overview.
In Innovations in Multi-Agent Systems and Applications,
D. Srinivasan, L. Jain, Editors,
series Studies in Computational Intelligence,
no. 310,
pages 183–221.
Springer,
2010.
»
Abstract: Multi-agent systems can be used to address problems in a variety of domains, including robotics, distributed control,
telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult
to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own,
using learning. A significant part of the research on multi-agent learning concerns reinforcement learning
techniques. This chapter reviews a representative selection of MARL algorithms for fully cooperative,
fully competitive, and more general (neither cooperative nor competitive) tasks. The benefits and challenges
of MARL are described. A central challenge in the field is the formal statement of a multi-agent
learning goal; this chapter reviews the learning goals proposed in the literature. The problem domains where
MARL techniques have been applied are briefly discussed. Several MARL algorithms are applied to an illustrative example
involving the coordinated transportation of an object by two cooperative robots. In an outlook for the MARL field,
a set of important open issues are identified, and promising research directions to address these issues are outlined.
The code used in the example is available for download, as part of the MARL toolbox in the Repository section.
This is an extended and revised version of the SMC 2008 paper above.
Online at SpringerLink.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Continuous-State Reinforcement Learning with Fuzzy Approximation.
In Adaptive Agents and Multi-Agent Systems III,
K. Tuyls, A. Nowé, Z. Guessoum, D. Kudenko, Editors,
series Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence),
vol. 4865,
pages 27–43.
Springer,
2008.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. There exist several
convergent and consistent RL algorithms which have been intensively studied. In their original form, these algorithms
require that the environment states and agent actions take values in a relatively small discrete set.
Fuzzy representations for approximate, model-free RL have been proposed in the literature for the
more difficult case where the state-action space is continuous. In this work, we propose a fuzzy
approximation architecture similar to those previously used for Q-learning, but we combine it with
the model-based Q-value iteration algorithm. We prove that the resulting algorithm converges. We also
give a modified, asynchronous variant of the algorithm that converges at least as fast as the original
version. An illustrative simulation example is provided.
This is an extended and revised version of the ALAMAS-07 paper.
Online at SpringerLink.
«
Conference papers
-
M. Vaandrager, R. Babuska, L. Busoniu, G. Lopes,
Imitation Learning with Non-Parametric Regression.
Accepted at
The 2012 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-12),
Cluj-Napoca, Romania,
24–27 May
2012.
»
Abstract: Humans are very fast learners. Yet, we rarely learn a task completely from scratch. Instead, we usually start with a rough approximation of the desired behavior and take the learning from there. In this paper, we use imitation to quickly generate a rough solution to a robotic task from demonstrations, supplied as a collection of state-space trajectories. Appropriate control actions needed to steer the system along the trajectories are then automatically learned in the form of a (nonlinear) state-feedback control law. The learning scheme has two components: a dynamic reference model and an adaptive inverse process model, both based on a data-driven, non-parametric method called local linear regression. The reference model infers the desired behavior from the demonstration trajectories, while the inverse process model provides the control actions to achieve this behavior and is improved online using learning. Experimental results with a pendulum swing-up problem and a robotic arm demonstrate the practical usefulness of this approach. The resulting learned dynamics are not limited to single trajectories, but capture instead the overall dynamics of the motion, making the proposed approach a promising step towards versatile learning machines such as future household robots, or robots for autonomous missions.
«
-
L. Busoniu, R. Munos,
Optimistic Planning for Markov Decision Processes.
In
Proceedings 15th International Conference on Artificial Intelligence and Statistics (AISTATS-12),
pages 182–189,
La Palma, Canary Islands, Spain,
21–23 April
2012.
»
Abstract: The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities.
The PDF includes supplementary material to the paper, containing proofs of the analytical results.
Online at JMLR Proceedings.
«
-
S. Norrouzadeh, L. Busoniu, R. Babuska,
Efficient Knowledge Transfer in Shaping Reinforcement Learning.
In
Proceedings 18th IFAC World Congress (IFAC-11),
Milano, Italy,
22 August–2 September
2011.
»
Abstract: Reinforcement learning is an attractive solution for deriving an optimal control policy by on-line exploration of the control task. Shaping aims to accelerate reinforcement learning by starting from easy tasks and gradually increasing the complexity, until the original task is solved. In this paper, we consider the essential decision on when to transfer learning from an easier task to a more difficult one, so that the total learning time is reduced. We propose two transfer criteria for making this decision, based on the agent's performance. The first criterion measures the agent's performance by the distance between its current solution and the optimal one, and the second by the empirical return obtained. We investigate the learning time gains achieved by using these criteria in a classical gridworld navigation benchmark. This numerical study also serves to compare several major shaping techniques.
Online at IFAC PapersOnLine.
«
-
I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema,
Actor-Critic Control with Reference Model Learning.
In
Proceedings 18th IFAC World Congress (IFAC-11),
Milano, Italy,
22 August–2 September
2011.
»
Abstract: We propose a new actor-critic algorithm for reinforcement learning. The algorithm does not use an explicit actor, but learns a reference model which represents a desired behaviour, along which the process is to be controlled by using the inverse of a learned process model. The algorithm uses Local Linear Regression (LLR) to learn approximations of all the functions involved. The online learning of a process and reference model, in combination with LLR, provides an efficient policy update for faster learning. In addition, the algorithm facilitates the incorporation of prior knowledge. The novel method and a standard actor-critic algorithm are applied to the pendulum swingup problem, in which the novel method achieves faster learning than the standard algorithm.
Online at IFAC PapersOnLine.
«
-
L. Busoniu, R. Munos, B. De Schutter, R. Babuska,
Optimistic Planning for Sparsely Stochastic Systems.
In
Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11),
pages 48–55,
Paris, France,
11–15 April
2011.
Part of the Special Session on Active Reinforcement Learning.
»
Abstract: We propose an online planning algorithm for finite-action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as model-predictive (receding-horizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Approximate Reinforcement Learning: An Overview.
In
Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11),
pages 1–8,
Paris, France,
11–15 April
2011.
»
Abstract: Reinforcement learning (RL) allows agents to learn how to optimally interact with complex environments. Fueled by recent advances in approximation-based algorithms, RL has obtained impressive successes in robotics, artificial intelligence, control, operations research, etc. However, the scarcity of survey papers about approximate RL makes it difficult for newcomers to grasp this intricate field. With the present overview, we take a step toward alleviating this situation. We review methods for approximate RL, starting from their dynamic programming roots and organizing them into three major classes: approximate value iteration, policy iteration, and policy search. Each class is subdivided into representative categories, highlighting among others offline and online algorithms, policy gradient methods, and simulation-based techniques. We also compare the different categories of methods, and outline possible ways to enhance the reviewed algorithms.
Online at IEEEXplore.
«
-
E. Schuitema, L. Busoniu, R. Babuska, P. Jonker,
Control Delay in Reinforcement Learning for Real-Time Dynamic Systems: A Memoryless Approach.
In
Proceedings 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-10),
pages 3226–3231,
Taipei, Taiwan,
18–22 October
2010.
»
Abstract: Robots controlled by Reinforcement Learning (RL) are still rare. A core challenge to the application of RL to robotic systems
is to learn despite the existence of control delay – the delay between measuring a system's state and acting upon it.
Control delay is always present in real systems. In this work, we present two novel temporal difference (TD) learning
algorithms for problems with control delay. These algorithms improve learning performance by taking the control delay into
account. We test our algorithms in a gridworld, where the delay is an integer multiple of the time step, as well as in the
simulation of a robotic system, where the delay can have any value. In both tests, our proposed algorithms outperform
classical TD learning algorithms, while maintaining low computational complexity.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Online Least-Squares Policy Iteration for Reinforcement Learning Control.
In
Proceedings 2010 American Control Conference (ACC-10),
pages 486–491,
Baltimore, United States,
30 June – 2 July
2010.
»
Abstract: Reinforcement learning is a promising paradigm for learning optimal control.
We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively
evaluate and improve control policies. State-of-the-art, least-squares
techniques for policy evaluation are sample-efficient and have relaxed convergence requirements.
However, they are typically used in offline PI, whereas a central goal of reinforcement
learning is to develop online algorithms. Therefore, we propose an online PI algorithm
that evaluates policies with the so-called least-squares temporal difference for Q-functions (LSTD-Q).
The crucial difference between this online least-squares policy iteration (LSPI)
algorithm and its offline counterpart is that, in the online case, policy improvements must be
performed once every few state transitions, using only an incomplete evaluation of the current policy.
In an extensive experimental evaluation, online LSPI is found to work well for a wide range of
its parameters, and to learn successfully in a real-time example. Online LSPI also compares favorably
with offline LSPI and with a different flavor of online PI, which instead of LSTD-Q employs another
least-squares method for policy evaluation.
Online at IEEEXplore.
«
-
L. Busoniu, B. De Schutter, R. Babuska, D. Ernst,
Using Prior Knowledge to Accelerate Online Least-Squares Policy Iteration.
In
Proceedings 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-10),
Cluj-Napoca, Romania,
28–30 May
2010.
»
Abstract: Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as
working without any prior knowledge about the system, such knowledge is often available and can be exploited to great
advantage. In this paper, we consider prior knowledge about the monotonicity of the control policy with respect to the
system states, and we introduce an approach that exploits this type of prior knowledge to accelerate a state-of-the-art RL
algorithm called online least-squares policy iteration (LSPI). Monotonic policies are appropriate for important classes of
systems appearing in control applications. LSPI is a data-efficient RL algorithm that we previously extended to online
learning, but that did not provide until now a way to use prior knowledge about the policy. In an empirical evaluation,
online LSPI with prior knowledge learns much faster and more reliably than the original online LSPI.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Policy Search with Cross-Entropy Optimization of Basis Functions.
In
Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09),
pages 153–160,
Nashville, United States,
30 March – 2 April
2009.
»
Abstract: This paper introduces a novel algorithm for approximate policy search
in continuous-state, discrete-action Markov decision processes (MDPs).
Previous policy search approaches have typically used ad-hoc parameterizations
developed for specific MDPs. In contrast, the novel algorithm employs a
flexible policy parameterization, suitable for solving general discrete-action MDPs.
The algorithm looks for the best closed-loop policy that can be represented
using a given number of basis functions, where a discrete action is assigned
to each basis function. The locations and shapes of the basis functions are optimized,
together with the action assignments. This allows a large class of policies to be represented.
The optimization is carried out with the cross-entropy method and evaluates the policies by
their empirical return from a representative set of initial states. We report simulation
experiments in which the algorithm reliably obtains good policies with only a small
number of basis functions, albeit at sizable computational costs.
The SMC-B 2010 journal article above is a heavily extended and revised version of this paper.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Fuzzy Partition Optimization for Approximate Fuzzy Q-iteration.
In
Proceedings 17th IFAC World Congress (IFAC-08),
pages 5629–5634,
Seoul, Korea,
6–11 July
2008.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents.
Because exact RL can only be applied to very simple problems, approximate algorithms are
usually necessary in practice. Many algorithms for approximate RL rely on basis-function
representations of the value function (or of the Q-function). Designing a good set of basis
functions without any prior knowledge of the value function (or of the Q-function) can be a
difficult task. In this paper, we propose instead a technique to optimize the shape of a constant
number of basis functions for the approximate, fuzzy Q-iteration algorithm. In contrast to other
approaches to adapt basis functions for RL, our optimization criterion measures the actual
performance of the computed policies in the task, using simulation from a representative set
of initial states. A complete algorithm, using cross-entropy optimization of triangular fuzzy
membership functions, is given and applied to the car-on-the-hill example.
Online at IFAC PapersOnLine.
«
-
X. Yuan, L. Busoniu, R. Babuska,
Reinforcement Learning for Elevator Control.
In
Proceedings 17th IFAC World Congress (IFAC-08),
pages 2212–2217,
Seoul, Korea,
6–11 July
2008.
»
Abstract: Reinforcement learning (RL) comprises an array of techniques that learn a
control policy so as to maximize a reward signal. When applied to the control
of elevator systems, RL has the potential of finding better control policies
than classical heuristic, suboptimal policies. On the other hand, elevator systems
offer an interesting benchmark application for the study of RL. In this paper,
RL is applied to a single-elevator system. The mathematical model of the elevator system
is described in detail, making the system easy to re-implement and re-use.
An experimental comparison is made between the performance of the Q-value iteration
and Q-learning RL algorithms, when applied to the elevator system.
Online at IFAC PapersOnLine.
«
-
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
Consistency of Fuzzy Model-Based Reinforcement Learning.
In
Proceedings 2008 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-08),
pages 518–524,
Hong Kong,
1–6 June
2008.
»
Abstract: Reinforcement learning (RL) is a widely used paradigm for learning control. Computing exact RL solutions is generally only
possible when process states and control actions take values in a small discrete set. In practice, approximate algorithms are
necessary. In this paper, we propose an approximate, model-based Q-iteration algorithm that relies on a fuzzy partition of the
state space, and a discretization of the action space. Using assumptions on the continuity of the dynamics and of the reward
function, we show that the resulting algorithm is consistent, i.e., that the optimal solution is obtained asymptotically as
the approximation accuracy increases. An experimental study indicates that a continuous reward function is also important for
a predictable improvement in performance as the approximation accuracy increases.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
Fuzzy Approximation for Convergent Model-Based Reinforcement Learning.
In
2007 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-07),
pages 968–973,
London, United Kingdom,
23–26 July
2007.
»
Abstract: Reinforcement learning (RL) is a learning control paradigm that provides well-understood algorithms with good convergence and
consistency properties. Unfortunately, these algorithms require that process states and control
actions take only discrete values. Approximate solutions using fuzzy representations have been
proposed in the literature for the case when the states and possibly the actions are continuous.
However, the link between these mainly heuristic solutions and the larger body of work on approximate
RL, including convergence results, has not been made explicit. In this paper, we propose a fuzzy
approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is
convergent. The proof is based on an extension of previous results in approximate RL. We then
propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast
as the original algorithm. An illustrative simulation example is also provided.
Online at IEEEXplore.
«
-
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
Continuous-State Reinforcement Learning with Fuzzy Approximation.
In
Adaptive Learning Agents and Multi-Agent Systems (ALAMAS-07) Symposium,
pages 21–35,
Maastricht, The Netherlands,
2–3 April
2007.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. Well-understood RL algorithms
with good convergence and consistency properties exist. In their original form, these
algorithms require that the environment states and agent actions take values in a relatively
small discrete set. Fuzzy representations for approximate, model-free RL have been proposed in
the literature for the more difficult case where the state-action space is continuous. In this
work, we propose a fuzzy approximation structure similar to those previously used for
Q-learning, but we combine it with the model-based Q-value iteration algorithm. We show that
the resulting algorithm converges. We also give a modified, serial variant of the algorithm
that converges at least as fast as the original version. An illustrative simulation example is
provided.
The (downloadable) LNAI 2008 paper above is an extended and revised version of this paper.
«
-
L. Busoniu, B. De Schutter, R. Babuska,
Decentralized Reinforcement Learning Control of a Robotic Manipulator.
In
Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV-06),
pages 1347–1352,
Singapore,
5–8 December
2006.
»
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including
robotics, distributed control, telecommunications, etc. Learning approaches to multi-agent
control, many of them based on reinforcement learning (RL), are investigated in complex domains
such as teams of mobile robots. However, the application of decentralized RL to low-level
control tasks is not as intensively studied. In this paper, we investigate centralized and
decentralized RL, emphasizing the challenges and potential advantages of the latter. These are
then illustrated on an example: learning to control a two-link rigid manipulator. In closing,
some open issues and future research directions in decentralized RL are outlined.
Keywords: multi-agent learning, decentralized control, reinforcement learning.
Online at IEEEXplore.
«
-
L. Busoniu, R. Babuska, B. De Schutter,
Multi-Agent Reinforcement Learning: A Survey.
In
Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV-06),
pages 527–532,
Singapore,
5–8 December
2006.
»
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including
robotics, distributed control, telecommunications, etc. Many tasks arising in these domains
require that the agents learn behaviors online. A significant part of the research on
multi-agent learning concerns reinforcement learning techniques. However, due to different
viewpoints on central issues, such as the formal statement of the learning goal, a large number
of different methods and approaches have been introduced. In this paper we aim to present an
integrated survey of the field. First, the issue of the multi-agent learning goal is discussed,
after which a representative selection of algorithms is reviewed. Open issues are identified
and future research directions are outlined.
Keywords: multi-agent systems, reinforcement learning, game theory,
distributed control.
The (downloadable) SMC-C 2008 journal article above is an extended and revised version of this paper.
Online at IEEEXplore.
«
-
L. Busoniu, B. De Schutter, R. Babuska,
Multiagent Reinforcement Learning with Adaptive State Focus.
In
Proceedings 17th Belgian-Dutch Conference on Artificial Intelligence (BNAIC-05),
pages 35–42,
Brussels, Belgium,
17–18 October
2005.
»
Abstract: In realistic multi-agent systems, learning on the basis of complete state information is not
feasible. We introduce adaptive state focus Q-learning, a class of methods derived from
Q-learning that start learning with only the state information that is strictly necessary for a
single agent to perform the task, and that monitor the convergence of learning. If lack of
convergence is detected, the learner dynamically expands its state space to incorporate more state
information (e.g., states of other agents). Learning is faster and takes less resources than if the
complete state were considered from the start, while being able to handle situations where agents
interfere in pursuing their goals. We illustrate our approach by instantiating a simple version of
such a method, and by showing that it outperforms learning with full state information without
being hindered by the deficiencies of learning on the basis of a single agent's state.
Keywords: multi-agent learning, adaptive learning, Q-learning, coordination.
Online at PubZone.
«
PhD thesis
-
L. Busoniu,
Reinforcement Learning in Continuous State and Action Spaces,
2008, 190 pages, ISBN 978-90-9023754-1.
»
Abstract:
Reinforcement learning (RL) and dynamic programming (DP) algorithms can be used to solve problems in a variety of fields, among which automatic control, artificial intelligence, operations research, and economy. These algorithms find an optimal policy, which maximizes a numerical reward signal measuring the performance. DP algorithms require a model of the problem's dynamics, whereas RL algorithms work without a model. Online RL algorithms do not even require data in advance; they learn from experience. However, DP and RL can find exact solutions only when the states and the control actions take values in a small discrete set. In large discrete spaces and in continuous spaces, approximate solutions have to be used. This is the case, e.g., in automatic control, where the states and actions are usually continuous.
This thesis proposes several novel algorithms for approximate RL and DP, which work in problems with continuous variables: fuzzy Q- iteration, online least-squares policy iteration, and cross-entropy policy search. Fuzzy Q-iteration is a DP algorithm that represents the value function (cumulative rewards) using a fuzzy partition of the state space and a discretization of the action space. The value function is used to compute a near-optimal policy. Fuzzy Q-iteration is provably convergent and consistent. Online least-squares policy iteration is a RL algorithm that efficiently learns from experience an approximate value function and a corresponding policy. It updates the value function parameters by solving linear systems of equations. Cross-entropy policy search represents policies using a highly flexible parameterization, and optimizes the parameters with the cross-entropy method. A representative selection of control problems is used to assess the performance of the proposed algorithms. Additionally, the thesis provides an extensive review of the state-of the-art in approximate DP and RL, and discusses some fundamental open issues in the field.
To obtain a bound hardcopy of the thesis free of charge, please contact me (preferably by email) mentioning your name and address, together with
your interest in the thesis.
«
Disclaimer: The following applies to the papers that are directly available for download as PDF files. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each copyright holder. In most cases, these works may not be reposted without the explicit permission of the copyright holder. Additionally, the following applies to IEEE material: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.
|