This page contains a representative selection of my publications, categorized and arranged in reverse chronological order. Use the "»" button to reveal an abstract of each publication, and "«" to hide the abstract again (requires Javascript).
Books

L. Busoniu, L. Tamas (editors),
Handling Uncertainty and Networked Structure in Robot Control,
Springer, Series Studies in Systems, Decision and Control. February 2016, ISBN 9783319263274.
»
Abstract:
This book focuses on two challenges posed in robot control by the increasing adoption of robots in the everyday human environment: uncertainty and networked communication. Part I of the book describes learning control to address environmental uncertainty. Part II discusses state estimation, active sensing, and complex scenario perception to tackle sensing uncertainty. Part III completes the book with control of networked robots and multirobot teams.
Each chapter features indepth technical coverage and case studies highlighting the applicability of the techniques, with real robots or in simulation. Platforms include mobile ground, aerial, and underwater robots, as well as humanoid robots and robot arms.
The text gathers contributions from academic and industry experts, and offers a valuable resource for researchers or graduate students in robot control and perception. It also benefits researchers in related areas, such as computer vision, nonlinear and learning control, and multiagent systems.
See the book's website at http://rocon.utcluj.ro/roboticsbook/ for additional information about the book and how to obtain it, as well as downloadable material.
«

L. Busoniu, R. Babuska, B. De Schutter, D. Ernst,
Reinforcement Learning and Dynamic Programming Using Function Approximators,
CRC Press, Series Automation and Control Engineering. April 2010, 280 pages, ISBN 9781439821084.
»
Abstract:
Reinforcement learning (RL) can optimally solve decision and control problems involving complex dynamic systems, without requiring a mathematical model of the system. If a model is available, dynamic programming (DP), the modelbased counterpart of RL, can be used. RL and DP are applicable in a variety of disciplines, including automatic control, artificial intelligence, economics, and medicine. Recent years have seen a surge of interest RL and DP using compact, approximate representations of the solution, which enable algorithms to scale up to realistic problems.
This book provides an indepth introduction to RL and DP with function approximators. A concise description of classical RL and DP (Chapter 2) builds the foundation for the remainder of the book. This is followed by an extensive review of the stateoftheart in RL and DP with approximation, which combines algorithm development with theoretical guarantees, illustrative numerical examples, and insightful comparisons (Chapter 3). Each of the final three chapters (4 to 6) is dedicated to a representative algorithm from the three major classes of methods: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications.
For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work.
Access the book's website at http://rlbook.busoniu.net for additional information about the book and how to obtain it, as well as free access to a sample chapter and other downloadable material, including errata.
«
Journal papers

L. Busoniu, T. de Bruin, D. Tolic, J. Kober, I. Palunko,
Reinforcement Learning for Control: Performance, Stability, and Deep
Approximators.
Annual Reviews in Control,
2018.
Accepted.
»
Abstract: Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificialintelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the controltheoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that  among other things  points out some avenues for bridging the gap between control and artificialintelligence RL techniques.
«

L. Busoniu, E.Pall, R. Munos,
Continuousaction planning for discounted infinitehorizon nonlinear
optimal control with Lipschitz values.
Automatica,
vol. 92,
pages 100–108,
2018.
»
Abstract: We consider discretetime, infinitehorizon optimal control problems with discounted rewards. The value function must be Lipschitz continuous over action (input) sequences, the actions are in a scalar interval, while the dynamics and rewards can be nonlinear/nonquadratic. Exploiting ideas from artificial intelligence, we propose two optimistic planning methods that perform an adaptivehorizon search over the infinitedimensional space of action sequences. The first method optimistically refines regions with the largest upper bound on the optimal value, using the Lipschitz constant to find the bounds. The second method simultaneously refines all potentially optimistic regions, without explicitly using the bounds. Our analysis proves convergence rates to the global infinitehorizon optimum for both algorithms, as a function of computation invested and of a measure of problem complexity. It turns out that the second, simultaneous algorithm works nearly as well as the first, despite not needing to know the (usually difficult to find) Lipschitz constant. We provide simulations showing the algorithms are useful in practice, compare them with value iteration and model predictive control, and give a realtime example.
Online at ScienceDirect.
«

K. Mathe, L. Busoniu, R. Munos, B. De Schutter,
Optimistic planning with an adaptive number of action switches for nearoptimal nonlinear control.
Engineering Applications of Artificial Intelligence,
vol. 67,
2018.
»
Abstract: We consider infinitehorizon optimal control of nonlinear systems where the control actions are discrete, and focus on optimistic planning algorithms from artificial intelligence, which can handle general nonlinear systems with nonquadratic costs. With the main goal of reducing computations, we introduce two such algorithms that only search for constrained action sequences. The constraint prevents the sequences from switching between different actions more than a limited number of times. We call the first method optimistic switchlimited planning (OSP), and develop analysis showing that its fixed number of switches SS leads to polynomial complexity in the search horizon, in contrast to the exponential complexity of the existing OP algorithm for deterministic systems; and to a correspondingly faster convergence towards optimality. Since tuning SS is difficult, we introduce an adaptive variant called OASP that automatically adjusts SS so as to limit computations while ensuring that nearoptimal solutions keep being explored. OSP and OASP are analytically evaluated in representative special cases, and numerically illustrated in simulations of a rotational pendulum. To show that the algorithms also work in challenging applications, OSP is used to control the pendulum in real time, while OASP is applied for trajectory control of a simulated quadrotor.
Online at ScienceDirect.
«

R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz,
Stability Analysis of DiscreteTime InfiniteHorizon Optimal Control with Discounted Cost.
IEEE Transactions on Automatic Control,
vol. 62,
no. 6,
pages 2736–2749,
2017.
»
Abstract: We analyse the stability of general nonlinear discretetime systems controlled by an optimal sequence of inputs that minimizes an infinitehorizon discounted cost. First, assumptions related to the controllability of the system and its detectability with respect to the stage cost are made. Uniform semiglobal and practical stability of the closedloop system is then established, where the adjustable parameter is the discount factor. Stronger stability properties are thereupon guaranteed by gradually strengthening the assumptions. Next, we show that the Lyapunov function used to prove stability is continuous under additional conditions, implying that stability has a certain amount of nominal robustness. The presented approach is flexible and we show that robust stability can still be guaranteed when the sequence of inputs applied to the system is no longer optimal but nearoptimal. We also analyse stability for cost functions in which the importance of the stage cost increases with time, opposite to discounting. Finally, we exploit stability to derive new relationships between the optimal value functions of the discounted and undiscounted problems, when the latter is welldefined.
Online at IEEEXplore.
«

L. Busoniu, J. Daafouz, M.C. Bragagnolo, C. Morarescu,
Planning for optimal control and performance certification in nonlinear systems with controlled or uncontrolled switches.
Automatica,
vol. 78,
pages 297–308,
2017.
»
Abstract: We consider three problems for discretetime switched systems with autonomous, general nonlinear modes. The first is optimal control of the switching rule so as to optimize the infinitehorizon discounted cost. The second and third problems occur when the switching rule is uncontrolled, and we seek either the worstcase cost when the rule is unknown, or respectively the expected cost when the rule is stochastic. We use optimistic planning (OP) algorithms that can solve general optimal control with discrete inputs such as switches. We extend the analysis of OP to provide certification (upper and lower) bounds on the optimal, worstcase, or expected costs, as well as to design switching sequences that achieve these bounds in the deterministic case. In this case, since a minimum dwell time between switching instants is often required, we introduce a new OP variant to handle this constraint, and analyze its convergence rate. We provide consistency and closedloop performance guarantees for the sequences designed, and illustrate that the approach works well in simulations.
Online at ScienceDirect.
«

L. Busoniu, A. Daniels, R. Babuska,
Online Learning for Optimistic Planning.
Engineering Applications of Artificial Intelligence,
vol. 55,
pages 60–72,
2016.
»
Abstract: Markov decision processes are a powerful framework for nonlinear, possibly stochastic optimal control. We consider two existing optimistic planning algorithms to solve them, which originate in artificial intelligence. These algorithms have provable nearoptimal performance when the actions and possible stochastic nextstates are discrete, but they wastefully discard the planning data after each step. We therefore introduce a method to learn online, from this data, the upper bounds that are used to guide the planning process. Five different approximators for the upper bounds are proposed, one of which is specifically adapted to planning, and the other four coming from the standard toolbox of function approximation. Our analysis characterizes the influence of the approximation error on the performance, and reveals that for small errors, learningbased planning performs better. In detailed experimental studies, learning leads to improved performance with all five representations, and a local variant of support vector machines provides a good compromise between performance and computation.
Online at ScienceDirect.
«

L. Busoniu, R. Postoyan, J. Daafouz,
Nearoptimal Strategies for Nonlinear and Uncertain Networked Control Systems.
IEEE Transactions on Automatic Control,
vol. 61,
no. 8,
pages 2124–2139,
2016.
»
Abstract: We consider problems where a controller communicates with a general nonlinear plant via a network, and must optimize a performance index. The system is modeled in discrete time and may be affected by a class of stochastic uncertainties that can take finitely many values. Admissible inputs are constrained to belong to a finite set. Exploiting some optimistic planning algorithms from the artificial intelligence field, we propose two control strategies that take into account the communication constraints induced by the use of the network. Both strategies send in a single packet longhorizon solutions, such as sequences of inputs. Our analysis characterizes the relationship between computation, nearoptimality, and transmission intervals. In particular, the first strategy imposes at each transmission a desired nearoptimality, which we show is related to an imposed transmission period; for this setting, we analyze the required computation. The second strategy has a fixed computation budget, and within this constraint it adapts the next transmission instant to the last state measurement, leading to a selftriggered policy. For this case, we guarantee long transmission intervals. Examples and simulation experiments are provided throughout the paper.
Online at IEEEXplore.
«

K. Mathe, L. Busoniu,
Vision and Control for UAVs: A Survey of General Methods and of Inexpensive Platforms for Infrastructure Inspection.
Sensors,
vol. 15,
no. 7,
pages 14887–14916,
2015.
»
Abstract: Unmanned aerial vehicles (UAVs) have gained significant attention in recent years. Lowcost platforms using inexpensive sensor payloads have been shown to provide satisfactory flight and navigation capabilities. In this report, we survey vision and control methods that can be applied to lowcost UAVs, and we list some popular inexpensive platforms and application fields where they are useful. We also highlight the sensor suites used where this information is available. We overview, among others, feature detection and tracking, optical flow and visual servoing, lowlevel stabilization and highlevel planning methods. We then list popular lowcost UAVs, selecting mainly quadrotors. We discuss applications, restricting our focus to the field of infrastructure inspection. Finally, as an example, we formulate two usecases for railway inspection, a less explored application field, and illustrate the usage of the vision and control techniques reviewed by selecting appropriate ones to tackle these usecases. To select vision methods, we run a thorough set of experimental evaluations.
Online at MDPI.
«

L. Busoniu, C. Morarescu,
TopologyPreserving Flocking of Nonlinear Agents Using Optimistic Planning.
Control Theory and Technology,
vol. 13,
no. 1,
pages 70–81,
2015.
»
Abstract: We consider the generalized flocking problem in multiagent systems, where the agents must drive a subset of their state variables to common values, while communication is constrained by a proximity relationship in terms of another subset of variables. We build a flocking method for general nonlinear agent dynamics, by using at each agent a nearoptimal control technique from artificial intelligence called optimistic planning. By defining the rewards to be optimized in a wellchosen way, the preservation of the interconnection topology is guaranteed, under a controllability assumption. We also give a practical variant of the algorithm that does not require to know the details of this assumption, and show that it works well in experiments on nonlinear agents.
Online at CTT.
«

L. Busoniu, C. Morarescu,
Consensus for BlackBox Nonlinear Agents Using Optimistic Optimization.
Automatica,
vol. 50,
no. 4,
pages 1201–1208,
2014.
»
Abstract: An important problem in multiagent systems is consensus, which requires the agents to agree on certain controlled variables of interest. We focus on the challenge of dealing in a generic way with nonlinear agent dynamics, represented as a black box with unknown mathematical form. Our approach designs a reference behavior with a classical consensus method. The main novelty is using optimistic optimization (OO) to find controls that closely follow the reference behavior. The first advantage of OO is that it only needs to sample the blackbox model of the agent, and so achieves our goal of handling unknown nonlinearities. Secondly, a tight relationship is guaranteed between computation invested and closeness to the reference behavior. Our main results exploit these properties to prove practical consensus. An analysis of representative examples builds additional insight and shows that in some nontrivial problems the optimization is easy to solve by OO. Simulations on these examples accompany the analysis.
Online at ScienceDirect.
«

I. Grondman, L. Busoniu, G. Lopes, R. Babuska,
A Survey of ActorCritic Reinforcement Learning: Standard and Natural Policy Gradients.
IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews,
vol. 42,
no. 6,
pages 1291–1307,
2012.
»
Abstract: Policygradientbased actorcritic algorithms are amongst the most popular algorithms in the reinforcement learning framework. Their advantage of being able to search for optimal policies using lowvariance gradient estimates has made them useful in several reallife applications, such as robotics, power control, and finance. Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actorcritic algorithms in particular. This paper, therefore, describes the state of the art of actorcritic algorithms, with a focus on methods that can work in an online setting and use function approximation in order to deal with continuous state and action spaces. After starting with a discussion on the concepts of reinforcement learning and the origins of actorcritic algorithms, this paper describes the workings of the natural gradient, which has made its way into many actorcritic algorithms over the past few years. A review of several standard and natural actorcritic algorithms is given, and the paper concludes with an overview of application areas and a discussion on open issues.
Online at IEEEXplore.
«

I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema,
Efficient Model Learning Methods for ActorCritic Control.
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics,
vol. 42,
no. 3,
pages 591–602,
2012.
»
Abstract: We propose two new actor–critic algorithms for reinforcement learning. Both algorithms use local linear regression
(LLR) to learn approximations of the functions involved. A crucial feature of the algorithms is that they also learn a process
model, and this, in combination with LLR, provides an efficient policy update for faster learning. The first algorithm uses a novel modelbased update rule for the actor parameters. The second algorithm does not use an explicit actor but learns a reference model which represents a desired behavior, from which desired control actions can be calculated using the inverse of the learned process model. The two novel methods and a standard actor–critic algorithm are applied to the pendulum swingup problem, in which the novel methods achieve faster learning than the standard algorithm.
Online at IEEEXplore.
«

S. Adam, L. Busoniu, R. Babuska,
Experience Replay for RealTime Reinforcement Learning Control.
IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews,
vol. 42,
no. 2,
pages 201–212,
2012.
»
Abstract: Reinforcement learning (RL) algorithms can automatically learn optimal control strategies for nonlinear, possibly stochastic
systems. A promising approach for RL control is experience replay (ER), which quickly learns from a limited amount of data by
repeatedly presenting these data to an underlying RL algorithm. Despite its benefits, ER RL has been studied only sporadically
in the literature, and its applications have largely been confined to simulated systems. Therefore, in this paper we evaluate
ER RL on realtime control experiments involving a pendulum swingup problem and the visionbased control of a goalkeeper
robot. These realtime experiments are complemented by simulation studies and comparisons with traditional RL. As a
preliminary, we develop a general ER framework that can be combined with essentially any incremental RL technique, and
instantiate this framework for the approximate Qlearning and SARSA algorithms. The successful realtime learning results
presented here are highly encouraging for the applicability of ER RL in practice.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
CrossEntropy Optimization of Control Policies with Adaptive Basis Functions.
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics,
vol. 41,
no. 1,
pages 196–209,
2011.
»
Abstract: This paper introduces an algorithm for direct search of control policies in continuousstate,
discreteaction Markov decision processes. The algorithm looks for the best closedloop policy
that can be represented using a given number of basis functions (BFs), where a discrete action
is assigned to each BF. The type of the BFs and their number are specified in advance and
determine the complexity of the representation. Considerable flexibility is achieved by
optimizing the locations and shapes of the BFs, together with the action assignments. The
optimization is carried out with the crossentropy method and evaluates the policies by their
empirical return from a representative set of initial states. The return for each representative
state is estimated using Monte Carlo simulations. The resulting algorithm for crossentropy policy
search with adaptive BFs is extensively evaluated in problems with two to six state variables, for
which it reliably obtains good policies with only a small number of BFs. In these experiments,
crossentropy policy search requires vastly fewer BFs than valuefunction techniques with
equidistant BFs, and outperforms policy search with a competing optimization algorithm called DIRECT.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Approximate Dynamic Programming with a Fuzzy Parametrization.
Automatica,
vol. 46,
no. 5,
pages 804–814,
2010.
»
Abstract: Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control.
Computing exact DP solutions is in general only possible when the process states and the
control actions take values in a small discrete set. In practice, it is necessary to approximate
the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy
partition of the state space, and on a discretization of the action space. This
fuzzy Qiteration algorithm works for deterministic processes, under the discounted
return criterion. We prove that fuzzy Qiteration asymptotically converges to a solution that
lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained
in a finite number of iterations is also derived. Under continuity assumptions on the dynamics
and on the reward function, we show that fuzzy Qiteration is consistent, i.e., that it
asymptotically obtains the optimal solution as the approximation accuracy increases. These
properties hold both when the parameters of the approximator are updated in a synchronous
fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to
converge at least as fast as the synchronous one. The performance of fuzzy Qiteration is
illustrated in a twolink manipulator control problem.
Online at ScienceDirect.
«

L. Busoniu, R. Babuska, B. De Schutter,
A Comprehensive Survey of MultiAgent Reinforcement Learning.
IEEE Transactions on Systems, Man, and Cybernetics — Part C: Applications and Reviews,
vol. 38,
no. 2,
pages 156–172,
2008.
Recipient of the 2009 Andrew P. Sage Award for the best paper published annually in the IEEE Transactions on Systems, Man
and Cybernetics.
»
Abstract: Multiagent systems are rapidly finding applications in a variety of domains, including robotics,
distributed control, telecommunications, and economics. The complexity of many tasks arising
in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents
must instead discover a solution on their own, using learning. A significant part of the research
on multiagent learning concerns reinforcement learning techniques. This paper provides a
comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field
is the formal statement of the multiagent learning goal. Different viewpoints on this issue have
led to the proposal of many different goals, among which two focal points can be distinguished:
stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents.
The MARL algorithms described in the literature aim—either explicitly or implicitly—at one of these
two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting.
A representative selection of these algorithms is discussed in detail in this paper, together with the
specific issues that arise in each category. Additionally, the benefits and challenges of MARL are
described along with some of the problem domains where MARL techniques have been applied. Finally,
an outlook for the field is provided.
Keywords: multiagent systems,
reinforcement learning, game theory, distributed control.
This is an extended and revised version of the ICARCV06 MARL paper.
Online at IEEEXplore.
«
Contributions to books

M. Bragagnolo, C. Morarescu, L. Busoniu, P. Riedinger,
Decentralized Formation Control in Fleets of Nonholonomic Robots with a Clustered Pattern.
In Handling Uncertainty and Structure in Robot Control,
L. Busoniu, L. Tamas, Editors,
pages 313–333.
Springer,
2016.
»

E. Pall, L. Tamas, L. Busoniu,
VisionBased Quadcopter Navigation in Structured Environments.
In Handling Uncertainty and Structure in Robot Control,
L. Busoniu, L. Tamas, Editors,
pages 265–290.
Springer,
2016.
»
Abstract: Quadcopters are smallsized aerial vehicles with four fixedpitch propellers. These robots have great potential since they are inexpensive with affordable hardware, and with appropriate software solutions they can accomplish assignments autonomously. They could perform daily tasks in the future, such as package deliveries, inspections, and rescue missions. In this chapter, after an extensive introduction to object recognition and tracking, we present an approach for visionbased autonomous flying of an unmanned quadcopter in various structured environments, such as hallwaylike scenes. The desired flight direction is obtained visually, based on perspective clues, in particular the vanishing point. This point is the intersection of parallel lines viewed in perspective, and is sought on the front camera image. For a stable guidance the position of the vanishing point is filtered with different types of probabilistic filters, such as linear Kalman filter, extended Kalman filter, unscented Kalman filter and particle filter. These are compared in terms of the tracking error and also for computational time. A switching control method is implemented. Each of the modes focuses on controlling only one state variable at a time and the objective is to center the vanishing point on the image. The selected filtering and control methods are tested successfully, both in simulation and in real indoor and outdoor environments.
Online at SpringerLink.
«

L. Busoniu, R. Munos, R. Babuska,
A Review of Optimistic Planning in Markov Decision Processes.
In Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control,
F. Lewis, D. Liu, Editors,
series Computational Intelligence,
pages 494–516.
Wiley,
2012.
»
Abstract: We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall recedinghorizon algorithm results, which can also be seen as a type of modelpredictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value – or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.
Online at Wiley Online Library.
«

L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuska, B. De Schutter,
LeastSquares Methods for Policy Iteration.
In Reinforcement Learning: State of the Art,
M. Wiering, M. van Otterlo, Editors,
series Adaptation, Learning, and Optimization,
no. 12,
pages 75–109.
Springer,
2012.
»
Abstract: Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous stateaction spaces, by using function approximators to represent the solution. This chapter reviews leastsquares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: leastsquares temporal difference, leastsquares policy evaluation, and Bellman residual minimization.We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finitesample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.
Online at SpringerLink.
«

L. Busoniu, R. Babuska, B. De Schutter,
MultiAgent Reinforcement Learning: An Overview.
In Innovations in MultiAgent Systems and Applications,
D. Srinivasan, L. Jain, Editors,
series Studies in Computational Intelligence,
no. 310,
pages 183–221.
Springer,
2010.
»
Abstract: Multiagent systems can be used to address problems in a variety of domains, including robotics, distributed control,
telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult
to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own,
using learning. A significant part of the research on multiagent learning concerns reinforcement learning
techniques. This chapter reviews a representative selection of MARL algorithms for fully cooperative,
fully competitive, and more general (neither cooperative nor competitive) tasks. The benefits and challenges
of MARL are described. A central challenge in the field is the formal statement of a multiagent
learning goal; this chapter reviews the learning goals proposed in the literature. The problem domains where
MARL techniques have been applied are briefly discussed. Several MARL algorithms are applied to an illustrative example
involving the coordinated transportation of an object by two cooperative robots. In an outlook for the MARL field,
a set of important open issues are identified, and promising research directions to address these issues are outlined.
The code used in the example is available for download, as part of the MARL toolbox in the Repository section.
This is an extended and revised version of the SMC 2008 paper above.
Online at SpringerLink.
«

L. Busoniu, B. De Schutter, R. Babuska,
Approximate Dynamic Programming and Reinforcement Learning.
In Interactive Collaborative Information Systems,
R. Babuska, F.C.A. Groen, Editors,
series Studies in Computational Intelligence,
no. 281,
pages 3–44.
Springer,
2010.
»
Abstract: DP and RL can be used to address problems from a variety of fields, including automatic control,
artificial intelligence, operations research, and economy. Many problems in these fields are described
by continuous variables, whereas DP and RL can find exact solutions only in the discrete case.
Therefore, approximation is essential in practical DP and RL. This chapter provides an indepth
review of the literature on approximate DP and RL in large or continuousspace, infinitehorizon
problems. Value iteration, policy iteration, and policy search approaches are presented in turn.
Modelbased (DP) as well as online and batch modelfree (RL) algorithms are discussed. We review
theoretical guarantees on the approximate solutions produced by these algorithms. Numerical
examples illustrate the behavior of several representative algorithms in practice. Techniques
to automatically derive value function approximators are discussed, and a comparison between
value iteration, policy iteration, and policy search is provided. The chapter closes with a
discussion of open issues and promising research directions in approximate DP and RL.
Online at SpringerLink.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
ContinuousState Reinforcement Learning with Fuzzy Approximation.
In Adaptive Agents and MultiAgent Systems III,
K. Tuyls, A. Nowe, Z. Guessoum, D. Kudenko, Editors,
series Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence),
vol. 4865,
pages 27–43.
Springer,
2008.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. There exist several
convergent and consistent RL algorithms which have been intensively studied. In their original form, these algorithms
require that the environment states and agent actions take values in a relatively small discrete set.
Fuzzy representations for approximate, modelfree RL have been proposed in the literature for the
more difficult case where the stateaction space is continuous. In this work, we propose a fuzzy
approximation architecture similar to those previously used for Qlearning, but we combine it with
the modelbased Qvalue iteration algorithm. We prove that the resulting algorithm converges. We also
give a modified, asynchronous variant of the algorithm that converges at least as fast as the original
version. An illustrative simulation example is provided.
This is an extended and revised version of the ALAMAS07 paper.
Online at SpringerLink.
«
Conference papers

M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz,
Stability analysis of discretetime finitehorizon discounted optimal control.
Accepted at
57th IEEE Conference on Decision and Control (CDC18),
Miami, USA,
17–19 December
2018.
»
Abstract: Discounted costs are considered in many fields,
like reinforcement learning, for which various algorithms can
be used to obtain optimal inputs for finite horizons. The related
literature mostly concentrates on optimality and largely ignores
stability. In this context, we study stability of general nonlinear
discretetime systems controlled by an optimal sequence of
inputs that minimizes a finitehorizon discounted cost computed
in a receding horizon fashion. Assumptions are made related to
the stabilizability of the system and its detectability with respect
to the stage cost. Then, a Lyapunov function for the closedloop
system with the receding horizon controller is constructed
and a uniform semiglobal stability property is ensured, where
the adjustable parameters are both the discount factor and
the horizon length. Uniform global exponential stability is
guaranteed by strengthening the initial assumptions, in which
case explicit bounds on the discount factor and the horizon
length are provided. We compare the obtained bounds in the
particular cases where there is no discount or the horizon is
infinite, respectively, with related results in the literature and
we show our bounds improve existing ones on the examples
considered.
«

C. Morarescu, V. Varma, L. Busoniu, S. Lasaulce,
Spacetime budget allocation for marketing over social networks.
In
Proceedings IFAC Conference on Analysis and Design of Hybrid Systems
(ADHS18),
pages 211–216,
Oxford, UK,
11–13 July
2018.
»
Abstract: We address formally the problem of opinion dynamics when the agents of a social network (e.g., consumers) are not only influenced by their neighbors but also by an external influential entity referred to as a marketer. The influential entity tries to sway the overall opinion to its own side by using a specific influence budget during discretetime advertising campaigns; consequently, the overall closedloop dynamics becomes a linearimpulsive (hybrid) one. The main technical issue addressed is finding how the marketer should allocate its budget over time (through marketing campaigns) and over space (among the agents) such that the agents’ opinion be as close as possible to a desired opinion; for instance, the marketer may prioritize certain agents over others based on their influence in the social graph. The corresponding spacetime allocation problem is formulated and solved for several special cases of practical interest. Valuable insights can be extracted from our analysis. For instance, for most cases we prove that the marketer has an interest in investing most of its budget at the beginning of the process and that budget should be shared among agents according to the famous waterfilling allocation rule. Numerical examples illustrate the analysis.
Online at ScienceDirect.
«

G. Feng, L. Busoniu, T.M. Guerra, S. Mohammad,
Reinforcement Learning for Energy Optimization Under Human Fatigue Constraints of PowerAssisted Wheelchairs.
In
IEEE American Control Conference (ACC18),
Milwaukee, USA,
27–29 June
2018.
»
Abstract: In the last decade, PowerAssisted Wheelchairs (PAWs) have been widely used for improving the mobility of disabled persons. The main advantage of PAWs is that users can keep a suitable physical activity. Moreover, the metabolicelectrical energy hybridization of PAWs provides more flexibility for optimal control design. In this context, we propose an optimal control for minimizing the electrical energy consumption under human fatigue constraints, including a human fatigue model. The electrical motor has to cooperate with the user over a given distancetogo. As the human fatigue model is unknown in practice, we use modelfree Policy Gradient methods to directly learn controllers for a given driving task. We verify that the modelfree solution is nearoptimal by computing the modelbased controller, which is generated by Approximate Dynamic Programming. Simulation results confirm that the modelfree Policy Gradient method provides nearoptimal solutions.
Online at IEEEXplore.
«

C. Iuga, P. Dragan, L. Busoniu,
Fall monitoring and detection for atrisk persons using a UAV.
In
IFAC Conference on Embedded Systems, Computational Intelligence
and Telematics in Control (CESCIT18),
Faro, Portugal,
6–8 June
2018.
»
Abstract: We describe a demonstrator application that uses a UAV to monitor and detect falls of an atrisk person. The position and state (upright or fallen) of the person are determined with deeplearningbased computer vision, where existing network weights are used for position detection, while for fall detection the last layer is finetuned in additional training. A simple visual servoing control strategy keeps the person in view of the drone, and maintains the drone at a set distance from the person. In experiments, falls were reliably detected, and the algorithm was able to successfully track the person indoors.
Online at ScienceDirect.
«

G. Feng, T.M. Guerra, S. Mohammad, L. Busoniu,
ObserverBased Assistive Control Design Under TimeVarying Sampling
for PowerAssisted Wheelchairs.
In
IFAC Conference on Embedded Systems, Computational Intelligence
and Telematics in Control (CESCIT18),
Faro, Portugal,
6–8 June
2018.
»
Abstract: Compared to manual wheelchairs and fully electric powered wheelchairs, powerassisted wheelchairs (PAWs) provide a special structure where the human can use her/his propulsion to interact with the assistive system. In this context, different studies have focused on the assistive control of PAWs in recent years. This paper presents an observedbased assistive control design using only position encoders. With a timevarying sampling induced by these position encoders, the wheelchair is described by a discretetime Linear Parameter Varying model. Based on a TakagiSugeno (TS) representation, an observer is designed by using LMI techniques. According to the estimated human torques, we use the frequencies with which the wheels are pushed to compute the reference velocity of the centre of gravity. The wheelchair turns with a constant yaw velocity when one of two wheels is braked by the human. Reference tracking is accomplished by a PI controller. Simulation results confirm that the proposed assistive control algorithm provides a good maneuverability for users to control the velocity of the centre of gravity and the yaw velocity of the wheelchair.
Online at ScienceDirect.
«

G. Feng, T.M. Guerra, L. Busoniu, S. Mohammad,
Unknown input observer in descriptor form via LMIs for powerassisted
wheelchairs.
In
36th Chinese Control Conference (CCC17),
Dalian, China,
26–28 July
2017.
»
Abstract: Powerassisted wheelchairs (PAW) provide an efficient means of transport for disabled persons. In this humanmachine interaction, the humanapplied torque is a crucial variable to implement the assistive system. The present paper describes a novel scheme to design PAWs without torque sensors. Instead of using a torque sensor, a discretetime unknown input observer in descriptor form is applied to estimate the human input torque and the angular velocities of the two wheels via the angular position. Using Finsler's lemma, the observer gains are obtained by solving an LMI problem. Based on the estimation, both a torqueassistance system and a speed controller are introduced. In addition, the InputtoState Stability (ISS) of the interconnected controllerobserver system is analysed for the speed controller. Finally, simulation results validate the observer and the powerassisted algorithms. The methodology follows patent WO2015173094 issued in 2015.
Online at IEEEXplore.
«

J. Xu, L. Busoniu, B. De Schutter,
NearOptimal Control with Adaptive Receding Horizon for DiscreteTime
Piecewise Affine Systems.
In
Proceedings 20th IFAC World Congress (IFAC17),
pages 4168–4173,
Toulouse, France,
9–14 July
2017.
»
Abstract: We consider the infinitehorizon optimal control of discretetime, Lipschitz continuous piecewise affine systems with a single input. Stage costs are discounted, bounded, and use a 1 or infinitynorm. Rather than using the usual fixedhorizon approach from modelpredictive control, we tailor an adaptivehorizon method called optimistic planning for continuous actions (OPC) to solve the piecewise affine control problem in receding horizon. The main advantage is the ability to solve problems requiring arbitrarily long horizons. Furthermore, we introduce a novel extension that provides guarantees on the closedloop performance, by reusing data (learning) across different steps. This extension is general and works for a large class of nonlinear dynamics. In experiments with piecewise affine systems, OPC improves performance compared to a fixedhorizon approach, while the datareuse approach yields further improvements.
Online at ScienceDirect.
«

S. Sabau, I.C. Morarescu, L. Busoniu, A. Jadbabaie,
DecoupledDynamics Distributed Control for Strings of Nonlinear Autonomous Agents.
Accepted at
IEEE American Control Conference (ACC17),
Seattle, USA,
24–26 May
2017.
»
Abstract: We introduce a novel distributed control architecture
for a class of nonlinear dynamical agents moving in the
'string' formation, while guaranteeing trajectory tracking and
collision avoidance. Each autonomous agent uses information
and relative measurements only with respect to its predecessor
in the string. The performance of the scheme is entirely
scalable with respect to the number of agents in formation.
The scalability is a consequence of the “decoupling” of a certain
bounded approximation of the closed–loop equations, entailing
that individual, local analyses of the closed–loops stability at
each agent will in turn guarantee the aggregated stability
of the entire formation. An efficient, practical method for
compensating communications induced delays is also presented.
Online at IEEEXplore.
«

J. Ben Rejeb, L. Busoniu, I.C. Morarescu, J. Daafouz,
NearOptimal Control of Nonlinear Switched Systems with NonCooperative Switching Rules.
In
IEEE American Control Conference (ACC17),
Seattle, USA,
24–26 May
2017.
»
Abstract: This paper presents a predictive, planning algorithm for nonlinear switched systems where there are two switching signals, one controlled and the other uncontrolled, both subject to constraints on the dwell time after a switch. The algorithm solves a minimax problem where the controlled signal is chosen to optimize a discounted sum of rewards, while taking into account the worst possible uncontrolled switches. It is an extension of a classical minimax search method, so we call it optimistic minimax search with dwell time constraints, OMSdelta. For any combination of dwell times, OMSdelta returns a sequence of switches that is provably nearoptimal, and can be applied in receding horizon for closed loop control. For the case when the two dwell times are the same, we provide a convergence rate to the minimax optimum as a function of the computation invested, modulated by a measure of problem complexity. We show how the framework can be used to model switched systems with time delays on the control channel, and provide an illustrative simulation for such a system with nonlinear modes.
Online at IEEEXplore.
«

E. Pall, L. Tamas, L. Busoniu,
Analysis and a Home Assistance Application of Online AEMS2 Planning.
In
Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS16),
Daejeon, Korea,
9–14 October
2016.
»
Abstract: We consider an online planning algorithm for partially observable Markov decision processes (POMDPs), called Anytime Error Minimization Search 2 (AEMS2). Despite the considerable success it has enjoyed in robotics and other problems, no quantitative analysis exists of the relationship between its nearoptimality and the computation invested. Exploiting ideas from fullyobservable MDP planning, we provide here such an analysis, in which the relationship is modulated via a measure of problem complexity called nearoptimality exponent. We illustrate the exponent for some interesting POMDP structures, and examine the role of the informative heuristics used by AEMS2 in the guarantees. In the second part of the paper, we introduce a domestic assistance problem in which a robot monitors partially observable switches and turns them off if needed. AEMS2 successfully solves this task in real experiments, and also works better than several state of the art planners in simulation comparisons.
Online at IEEEXplore.
«

J. Xu, T. van den Boom, L. Busoniu, B. De Schutter,
Model Predictive Control for Continuous Piecewise Affine Systems Using Optimistic Optimization.
In
Proceedings IEEE American Control Conference (ACC16),
Boston, USA,
6–8 July
2016.
»
Abstract: This paper considers model predictive control for continuous piecewise affine (PWA) systems. In general, this leads to a nonlinear, nonconvex optimization problem. We introduce an approach based on optimistic optimization to solve the resulting optimization problem. Optimistic optimization is based on recursive partitioning of the feasible set and is characterized by an efficient exploration strategy seeking for the optimal solution. The advantage of optimistic optimization is that one can guarantee bounds on the suboptimality with respect to the global optimum for a given computational budget. The 1norm and inftynorm objective functions often considered in model predictive control for continuous PWA systems are continuous PWA functions. We derive expressions for the core parameters required by optimistic optimization for the resulting optimization problem. By applying optimistic optimization, a sequence of control inputs is designed satisfying linear constraints. A bound on the suboptimality of the returned solution is also discussed. The performance of the proposed approach is illustrated with a case study on adaptive cruise control.
«

L. Busoniu, E. Pall, R. Munos,
Discounted NearOptimal Control of General ContinuousAction Nonlinear Systems Using Optimistic Planning.
In
Proceedings IEEE American Control Conference (ACC16),
Boston, USA,
6–8 July
2016.
»
Abstract: We propose an optimistic planning method to search for nearoptimal sequences of actions in discretetime, infinitehorizon optimal control problems with discounted rewards. The dynamics are general nonlinear, while the action (input) is scalar and compact. The method works by iteratively splitting the infinitedimensional search space into hyperboxes. Under appropriate conditions on the dynamics and rewards, we analyze the shrinking rate of the range of possible values in each box. When coupled with a measure of problem complexity, this leads to an overall convergence rate of the algorithm to the infinitehorizon optimum, as a function of computation invested. We provide simulation results showing that the algorithm is useful in practice, and comparing it with two alternative planning methods.
«

K. Mathe, L. Busoniu, L. Barabas, L. Miclea, J. Braband, C. Iuga,
VisionBased Control of a Quadrotor for an Object Inspection Scenario.
In
Proceedings 2016 International Conference on Unmanned Aircraft Systems (ICUAS16),
pages 849–857,
Arlington, USA,
7–10 June
2016.
»
Abstract: Unmanned aerial vehicles (UAVs) have gained special attention in recent years, among others in monitoring and inspection applications. In this paper, a less explored application field is proposed, railway inspection, where UAVs can be used to perform visual inspection tasks such as semaphore, catenary, or track inspection. We focus on lightweight UAVs, which can detect many events in railways (for example missing indicators or cabling, or obstacles on the tracks). An outdoor scenario is developed where a quadrotor visually detects a railway semaphore and flies around it autonomously, recording a video of it for offline postprocessing. For these tasks, we exploit object detection methods from literature, and develop a visual servoing technique. Additionally, we perform a thorough comparison of several object detection approaches before selecting a preferred method. Then, we show the performance of the presented filtering solutions when they are used in servoing, and conclude our experiments with evaluating real outdoor flight trajectories using an AR.Drone 2.0 quadrotor.
Online at IEEEXplore.
«

J. Xu, L. Busoniu, T. van den Boom, B. De Schutter,
RecedingHorizon Control for MaxPlus Linear Systems with Discrete Actions Using Optimistic Planning.
In
Proceedings 13th International Workshop on Discrete Event Systems (WODES16),
pages 398–403,
Xi'an, China,
30 May – 1 June
2016.
»
Abstract: This paper addresses the infinitehorizon optimal control problem for maxplus linear systems where the considered objective function is a sum of discounted stage costs over an infinite horizon. The minimization problem of the cost function is equivalently transformed into a maximization problem of a reward function. The resulting optimal control problem is solved based on an optimistic planning algorithm. The control variables are the increments of system inputs and the action space is discretized as a finite set. Given a finite computational budget, a control sequence is returned by the optimistic planning algorithm. The first control action or a subsequence of the returned control sequence is applied to the system and then a recedinghorizon scheme is adopted. The proposed optimistic planning approach allows us to limit the computational budget and also yields a characterization of the level of nearoptimality of the resulting solution. The effectiveness of the approach is illustrated with a numerical example. The results show that the optimistic planning approach results in a lower tracking error compared with a finitehorizon approach when a subsequence of the returned control sequence is applied.
Online at IEEEXplore.
«

L. Busoniu, M.C. Bragagnolo, J. Daafouz, C. Morarescu,
Planning Methods for the Optimal Control and Performance Certification of General Nonlinear Switched Systems.
In
Proceedings 54th IEEE Conference on Decision and Control (CDC15),
Osaka, Japan,
15–18 December
2015.
»
Abstract: We consider two problems for discretetime switched systems with autonomous, general nonlinear modes. The first is optimal control of the switches so as to minimize the discounted infinitehorizon sum of the costs. The second problem occurs when switches are a disturbance, and the worstcase cost under any sequence of switches is sought. We use an optimistic planning (OP) algorithm that can solve general optimal control with discrete inputs such as switches. We extend the analysis of OP to provide sequences of switches with certification (upper and lower) bounds on the optimal and worstcase costs, and to characterize the convergence rate of the gap between these bounds. Since a minimum dwell time between switches must often be ensured, we introduce a new optimistic planning variant that can handle this case, and analyze its convergence rate. Simulations for linear and nonlinear modes illustrate that the approach works in practice.
Online at IEEEXplore.
«

T. Wensveen, L. Busoniu, R. Babuska,
RealTime Optimistic Planning with Action Sequences.
In
Proceedings 20th International Conference on Control Systems and Computer Science (CSCS15),
pages 923–930,
Bucharest, Romania,
27–29 May
2015.
»
Abstract: Optimistic planning (OP) is a promising approach
for recedinghorizon optimal control of general nonlinear systems.
This generality comes however at large computational costs,
which so far have prevented the application of OP to the control
of nonlinear physical systems in realtime. We therefore introduce
an extension of OP to realtime control, which applies openloop
sequences of actions in parallel with finding the next sequence
from the predicted state at the end of the current sequence.
Exploiting OP guarantees, we provide conditions under which
the algorithm is provably feasible in realtime, and we analyze
its performance. We report successful realtime experiments for
the swingup of an inverted pendulum, as well as simulation results
for an acrobot, where the impact of model errors is studied.
Online at IEEEXplore.
«

R. Postoyan, L. Busoniu, D. Nesic, J.
Daafouz,
Stability of InfiniteHorizon Optimal Control with Discounted Cost.
In
Proceedings 53rd IEEE Conference on Decision and Control (CDC14),
pages 3903–3908,
Los Angeles, USA,
15–17 December
2014.
»
Abstract: We investigate the stability of general nonlinear
discretetime systems controlled by an optimal sequence of
inputs that minimizes an infinitehorizon discounted cost. We
first provide conditions under which a global asymptotic stability
property is ensured for the corresponding undiscounted
problem. We then show that this property is semiglobally
and practically preserved in the discounted case, where the
adjustable parameter is the discount factor. We then focus on
a scenario where the stage cost is bounded and we explain
how our framework applies to guarantee stability in this case.
Finally, we provide sufficient conditions, including boundedness
of the stage cost, under which the value function, which serves
as a Lyapunov function for the analysis, is continuous. As
already shown in the literature, the continuity of the Lyapunov
function is crucial to ensure some nominal robustness for the
closedloop system.
Online at IEEEXplore.
«

K. Mathe, L. Busoniu, R. Munos, B. De Schutter,
Optimistic Planning with a Limited Number of Action Switches for NearOptimal Nonlinear Control.
In
Proceedings 53rd IEEE Conference on Decision and Control (CDC14),
pages 3518–3523,
Los Angeles, USA,
15–17 December
2014.
»
Abstract: We consider infinitehorizon optimal control of
nonlinear systems where the actions (inputs) are discrete.
With the goal of limiting computations, we introduce a search
algorithm for action sequences constrained to switch at most
a given number of times between different actions. The new
algorithm belongs to the optimistic planning class originating
in artificial intelligence, and is called optimistic switchlimited
planning (OSP). It inherits the generality of the OP class, so
it works for nonlinear, nonsmooth systems with nonquadratic
costs. We develop analysis showing that the switch constraint
leads to polynomial complexity in the search horizon, in
contrast to the exponential complexity of stateoftheart OP;
and to a correspondingly faster convergence. The degree of
the polynomial varies with the problem and is a meaningful
measure for the difficulty of solving it. We study this degree
in two representative, opposite cases. In simulations we first
apply OSP to a problem where limitedswitch sequences are
nearoptimal, and then in a networked control setting where
the switch constraint must be satisfied in closed loop.
Online at IEEEXplore.
«

L. Busoniu, R. Munos, Elod Pall,
An Analysis of Optimistic, BestFirst Search for Minimax Sequential Decision Making.
In
Proceedings IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL14),
pages 1–8,
Orlando, USA,
10–13 December
2014.
»
Abstract: We consider problems in which a maximizer and a minimizer agent take actions in turn, such as games or optimal control with uncertainty modeled as an opponent. We extend the ideas of optimistic optimization to this setting, obtaining a search algorithm that has been previously considered as the bestfirst search variant of the B* method. We provide a novel analysis of the algorithm relying on a certain structure for the values of action sequences, under which earlier actions are more important than later ones. An asymptotic branching factor is defined as a measure of problem complexity, and it is used to characterize the relationship between computation invested and nearoptimality. In particular, when action importance decreases exponentially, convergence rates are obtained. Throughout, examples illustrate analytical concepts such as the branching factor. In an empirical study, we compare the optimistic bestfirst algorithm with two classical game tree search methods, and apply it to a challenging HIV infection control problem.
Online at IEEEXplore.
«

L. Busoniu, L. Tamas,
Optimistic Planning for the NearOptimal Control of General Nonlinear Systems with Continuous Transition Distributions.
In
Proceedings 19th IFAC World Congress (IFAC14),
pages 1910–1915,
Cape Town, South Africa,
24–29 August
2014.
»
Abstract: Optimistic planning is an optimal control approach from artificial intelligence, which can be applied in receding horizon. It works for very general nonlinear dynamics and cost functions, and its analysis establishes a tight relationship between computation invested and nearoptimality. However, there is no optimistic planning algorithm that searches for closedloop solutions in stochastic problems with continuous transition distributions. Such transitions are essential in control, where they arise e.g. due to continuous disturbances. Existing algorithms only search for openloop input sequences, which are suboptimal. We therefore propose a closedloop algorithm that discretizes the continuous transition distribution into sigma points, and call it sigmaoptimistic planning. Assuming the error introduced by sigmapoint discretization is bounded, we analyze the solution returned, showing that it is nearoptimal. The algorithm is evaluated in simulation experiments, where it performs better than a stateoftheart openloop planning technique; a certaintyequivalence approach also works well.
Online at ScienceDirect.
«

K. Mathe, L. Busoniu, L. Miclea,
Optimistic Planning with Long Sequences of Identical Actions for NearOptimal Nonlinear Control.
In
Proceedings 2014 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR14),
ClujNapoca, Romania,
22–24 May
2014.
»
Abstract: Optimistic planning for deterministic systems (OPD) is an algorithm able to find nearoptimal control for very general, nonlinear systems. OPD iteratively builds nearoptimal sequences of actions by always
refining the most promising sequence; this is done by adding all possible onestep actions. However, OPD has large computational costs, which might be undesirable in real life applications. This paper proposes an adaptation of OPD for a specific subclass of control problems where control actions do not change often (e.g. bangbang, timeoptimal control). The new algorithm is called Optimistic Planning with K identical actions (OKP), and it refines sequences by adding, in addition to onestep actions, also repetitions of each action up to K times. Our analysis proves that the a posteriori performance guarantees are similar to those of OPD, improving with the length of the explored sequences, though the asymptotic behaviour of OKP cannot be formally predicted a priori. Simulations illustrate that for properly chosen parameter K, in a control problem from the class considered, OKP outperforms OPD.
«

E. Pall, K. Mathe, L. Tamas, L. Busoniu,
Railway Track Following with the AR.Drone Using Vanishing Point Detection.
In
Proceedings 2014 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR14),
ClujNapoca, Romania,
22–24 May
2014.
»
Abstract: Unmanned aerial vehicles are increasingly being used and showing their advantages in many domains. However, their application to railway systems is very little studied. In this paper, we focus on controlling an AR.Drone UAV in order to follow the railway track. The method developed relies on visionbased detection and tracking of the vanishing point of the railway tracks, overhead lines, and other related lines in the image, coupled with a controller that adjusts the yaw so as to keep the vanishing point in the center of the image. Simulation results illustrate the method is effective, and are complemented by vanishingpoint tracking results on real images.
«

L. Busoniu, C. Morarescu,
Consensus for Agents with General Dynamics Using Optimistic Optimization.
In
Proceedings 2013 Conference on Decision and Control (CDC13),
Florence, Italy,
10–13 December
2013.
»
Abstract: An important challenge in multiagent systems is consensus, in which the agents must agree on certain controlled variables of interest. So far, most consensus algorithms for agents with nonlinear dynamics exploit the specific form of the nonlinearity. Here, we propose an approach that only requires a blackbox simulation model of the dynamics, and is therefore applicable to a wide class of nonlinearities. This approach works for agents communicating on a fixed, connected network. It designs a reference behavior with a classical consensus protocol, and then finds control actions that drive the nonlinear agents towards the reference states, using a recent optimistic optimization algorithm. By exploiting the guarantees of optimistic optimization, we prove that the agents achieve practical consensus. A representative example is further analyzed, and simulation results on nonlinear robotic arms are provided.
This is an extended version with a more detailed proof of the main result.
«

L. Busoniu, R. Postoyan, J. Daafouz,
NearOptimal Strategies for Nonlinear Networked Control Systems Using Optimistic Planning.
In
Proceedings 2013 American Control Conference (ACC13),
Washington DC, US,
17–19 June
2013.
»
Abstract: We consider the scenario where a controller communicates with a general nonlinear plant via a network, and must optimize a performance index. The problem is modeled in discrete time and the admissible control inputs are constrained to belong to a finite set. Exploiting a recent optimistic planning algorithm from the artificial intelligence field, we propose two control strategies that take into account communication constraints induced by the use of the network. Both resulting algorithms have guaranteed nearoptimality. In the first strategy, input sequences are transmitted to the plant at a fixed period, and we show bounded computation. In the second strategy, the algorithm decides the next transmission instant according to the last state measurement (leading to a selftriggered policy), working within a fixed computation budget. For this case, we guarantee long transmission intervals. Examples and simulation experiments are provided throughout the paper to illustrate the results.
«

L. Busoniu, C. Morarescu,
Optimistic Planning for Consensus.
In
Proceedings 2013 American Control Conference (ACC13),
Washington DC, US,
17–19 June
2013.
»
Abstract: An important challenge in multiagent systems is consensus, in which the agents are required to synchronize certain controlled variables of interest, often using only an incomplete and timevarying communication graph. We propose a consensus approach based on optimistic planning (OP), a predictive control algorithm that finds nearoptimal control actions for any nonlinear dynamics and reward (cost) function. At every step, each agent uses OP to solve a local control problem with rewards that express the consensus objectives. Neighboring agents coordinate by exchanging their predicted behaviors in a predefined order. Due to its generality, OP consensus can adapt to any agent dynamics and, by changing the reward function, to a variety of consensus objectives. OP consensus is demonstrated for velocity consensus (flocking) with a timevarying communication graph, where it preserves connectivity better than a classical algorithm; and for leaderless and leaderbased consensus of robotic arms, where OP easily deals with the nonlinear dynamics.
«

L. Busoniu, A. Daniels, R. Munos, R. Babuska,
Optimistic Planning for ContinuousAction Deterministic Systems.
In
Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL13),
Singapore,
15–19 April
2013.
»
Abstract: We consider the class of online planning algorithms for optimal control, which compared to dynamic programming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SOOP that works for deterministic systems with continuous states and actions. SOOP is the first method to explore the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. SOOP can be used parameterfree at the cost of more model calls, but we also propose a more practical variant tuned by a parameter alpha, which balances finer discretization with longer planning horizons. Experiments on three problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
«

R. Fonteneau, L. Busoniu, R. Munos,
Optimistic Planning for BeliefAugmented Markov Decision Processes.
In
Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL13),
Singapore,
15–19 April
2013.
»
Abstract: This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel modelbased Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OPMDP) algorithm to contexts where the transition model of the MDP is initially unknown and progressively learned through interactions within the environment. The knowledge about the unknown MDP is represented with a probability distribution over all possible transition models using Dirichlet distributions, and the BOP algorithm plans in the beliefaugmented state space constructed by concatenating the original state vector with the current posterior distribution over transition models. We show that BOP becomes Bayesian optimal when the budget parameter increases to infinity. Preliminary empirical validations show promising performance.
«

I. Grondman, L. Busoniu, R. Babuska,
ModelLearning ActorCritic Algorithms: Performance Evaluation in a Motion Control Task.
In
Proceedings 51st IEEE Conference on Decision and Control (CDC12),
pages 5272–5277,
Maui, Hawaii,
10–13 December
2012.
»
Abstract: Reinforcement learning (RL) control provides a means to deal with uncertainty and nonlinearity associated with control tasks in an optimal way. The class of actor–critic RL algorithms proved useful for control systems with continuous state and input variables. In the literature, modelbased actor–critic algorithms have recently been introduced to considerably speed up the the learning by constructing online a model through local linear regression (LLR). However, it has not been analyzed whether the speedup is due to the modellearning structure or the LLR approximator. Therefore, in this paper we generalize the modellearning actor–critic algorithms to make them suitable for use with an arbitrary function approximator. Furthermore, we present the results of an extensive analysis through numerical simulations of a typical nonlinear motion control problem. The LLR approximator is compared with radial basis functions (RBFs) in terms of the initial convergence rate and in terms of the final performance obtained. The results show that LLRbased actor–critic RL outperforms the RBF counterpart: it gives quick initial learning and comparable or even superior final control performance.
Online at IEEEXplore.
«

M. Vaandrager, R. Babuska, L. Busoniu, G. Lopes,
Imitation Learning with NonParametric Regression.
In
Proceedings 2012 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR12),
pages 91–96,
ClujNapoca, Romania,
24–27 May
2012.
»
Abstract: Humans are very fast learners. Yet, we rarely learn a task completely from scratch. Instead, we usually start with a rough approximation of the desired behavior and take the learning from there. In this paper, we use imitation to quickly generate a rough solution to a robotic task from demonstrations, supplied as a collection of statespace trajectories. Appropriate control actions needed to steer the system along the trajectories are then automatically learned in the form of a (nonlinear) statefeedback control law. The learning scheme has two components: a dynamic reference model and an adaptive inverse process model, both based on a datadriven, nonparametric method called local linear regression. The reference model infers the desired behavior from the demonstration trajectories, while the inverse process model provides the control actions to achieve this behavior and is improved online using learning. Experimental results with a pendulum swingup problem and a robotic arm demonstrate the practical usefulness of this approach. The resulting learned dynamics are not limited to single trajectories, but capture instead the overall dynamics of the motion, making the proposed approach a promising step towards versatile learning machines such as future household robots, or robots for autonomous missions.
Online at IEEEXplore.
«

L. Busoniu, R. Munos,
Optimistic Planning for Markov Decision Processes.
In
Proceedings 15th International Conference on Artificial Intelligence and Statistics (AISTATS12),
pages 182–189,
La Palma, Canary Islands, Spain,
21–23 April
2012.
»
Abstract: The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight nearoptimality guarantees are not yet available for the general case of stochastic Markov decision processes and closedloop, statedependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closedloop policies, and we analyze the nearoptimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact nearoptimal policies. We characterize this set by introducing a novel measure of problem complexity, called the nearoptimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer nearoptimal policies and less uniform transition probabilities.
The PDF includes supplementary material to the paper, containing proofs of the analytical results.
Online at JMLR Proceedings.
«

I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema,
ActorCritic Control with Reference Model Learning.
In
Proceedings 18th IFAC World Congress (IFAC11),
pages 14723–14728,
Milano, Italy,
22 August–2 September
2011.
»
Abstract: We propose a new actorcritic algorithm for reinforcement learning. The algorithm does not use an explicit actor, but learns a reference model which represents a desired behaviour, along which the process is to be controlled by using the inverse of a learned process model. The algorithm uses Local Linear Regression (LLR) to learn approximations of all the functions involved. The online learning of a process and reference model, in combination with LLR, provides an efficient policy update for faster learning. In addition, the algorithm facilitates the incorporation of prior knowledge. The novel method and a standard actorcritic algorithm are applied to the pendulum swingup problem, in which the novel method achieves faster learning than the standard algorithm.
Online at ScienceDirect.
«

S. Norrouzadeh, L. Busoniu, R. Babuska,
Efficient Knowledge Transfer in Shaping Reinforcement Learning.
In
Proceedings 18th IFAC World Congress (IFAC11),
Milano, Italy,
22 August–2 September
2011.
»
Abstract: Reinforcement learning is an attractive solution for deriving an optimal control policy by online exploration of the control task. Shaping aims to accelerate reinforcement learning by starting from easy tasks and gradually increasing the complexity, until the original task is solved. In this paper, we consider the essential decision on when to transfer learning from an easier task to a more difficult one, so that the total learning time is reduced. We propose two transfer criteria for making this decision, based on the agent's performance. The first criterion measures the agent's performance by the distance between its current solution and the optimal one, and the second by the empirical return obtained. We investigate the learning time gains achieved by using these criteria in a classical gridworld navigation benchmark. This numerical study also serves to compare several major shaping techniques.
Online at ScienceDirect.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Approximate Reinforcement Learning: An Overview.
In
Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL11),
pages 1–8,
Paris, France,
11–15 April
2011.
»
Abstract: Reinforcement learning (RL) allows agents to learn how to optimally interact with complex environments. Fueled by recent advances in approximationbased algorithms, RL has obtained impressive successes in robotics, artificial intelligence, control, operations research, etc. However, the scarcity of survey papers about approximate RL makes it difficult for newcomers to grasp this intricate field. With the present overview, we take a step toward alleviating this situation. We review methods for approximate RL, starting from their dynamic programming roots and organizing them into three major classes: approximate value iteration, policy iteration, and policy search. Each class is subdivided into representative categories, highlighting among others offline and online algorithms, policy gradient methods, and simulationbased techniques. We also compare the different categories of methods, and outline possible ways to enhance the reviewed algorithms.
Online at IEEEXplore.
«

L. Busoniu, R. Munos, B. De Schutter, R. Babuska,
Optimistic Planning for Sparsely Stochastic Systems.
In
Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL11),
pages 48–55,
Paris, France,
11–15 April
2011.
Part of the Special Session on Active Reinforcement Learning.
»
Abstract: We propose an online planning algorithm for finiteaction, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as modelpredictive (recedinghorizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Online at IEEEXplore.
«

E. Schuitema, L. Busoniu, R. Babuska, P. Jonker,
Control Delay in Reinforcement Learning for RealTime Dynamic Systems: A Memoryless Approach.
In
Proceedings 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS10),
pages 3226–3231,
Taipei, Taiwan,
18–22 October
2010.
»
Abstract: Robots controlled by Reinforcement Learning (RL) are still rare. A core challenge to the application of RL to robotic systems
is to learn despite the existence of control delay – the delay between measuring a system's state and acting upon it.
Control delay is always present in real systems. In this work, we present two novel temporal difference (TD) learning
algorithms for problems with control delay. These algorithms improve learning performance by taking the control delay into
account. We test our algorithms in a gridworld, where the delay is an integer multiple of the time step, as well as in the
simulation of a robotic system, where the delay can have any value. In both tests, our proposed algorithms outperform
classical TD learning algorithms, while maintaining low computational complexity.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Online LeastSquares Policy Iteration for Reinforcement Learning Control.
In
Proceedings 2010 American Control Conference (ACC10),
pages 486–491,
Baltimore, United States,
30 June – 2 July
2010.
»
Abstract: Reinforcement learning is a promising paradigm for learning optimal control.
We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively
evaluate and improve control policies. Stateoftheart, leastsquares
techniques for policy evaluation are sampleefficient and have relaxed convergence requirements.
However, they are typically used in offline PI, whereas a central goal of reinforcement
learning is to develop online algorithms. Therefore, we propose an online PI algorithm
that evaluates policies with the socalled leastsquares temporal difference for Qfunctions (LSTDQ).
The crucial difference between this online leastsquares policy iteration (LSPI)
algorithm and its offline counterpart is that, in the online case, policy improvements must be
performed once every few state transitions, using only an incomplete evaluation of the current policy.
In an extensive experimental evaluation, online LSPI is found to work well for a wide range of
its parameters, and to learn successfully in a realtime example. Online LSPI also compares favorably
with offline LSPI and with a different flavor of online PI, which instead of LSTDQ employs another
leastsquares method for policy evaluation.
Online at IEEEXplore.
«

L. Busoniu, B. De Schutter, R. Babuska, D. Ernst,
Using Prior Knowledge to Accelerate Online LeastSquares Policy Iteration.
In
Proceedings 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR10),
ClujNapoca, Romania,
28–30 May
2010.
»
Abstract: Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as
working without any prior knowledge about the system, such knowledge is often available and can be exploited to great
advantage. In this paper, we consider prior knowledge about the monotonicity of the control policy with respect to the
system states, and we introduce an approach that exploits this type of prior knowledge to accelerate a stateoftheart RL
algorithm called online leastsquares policy iteration (LSPI). Monotonic policies are appropriate for important classes of
systems appearing in control applications. LSPI is a dataefficient RL algorithm that we previously extended to online
learning, but that did not provide until now a way to use prior knowledge about the policy. In an empirical evaluation,
online LSPI with prior knowledge learns much faster and more reliably than the original online LSPI.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Policy Search with CrossEntropy Optimization of Basis Functions.
In
Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL09),
pages 153–160,
Nashville, United States,
30 March – 2 April
2009.
»
Abstract: This paper introduces a novel algorithm for approximate policy search
in continuousstate, discreteaction Markov decision processes (MDPs).
Previous policy search approaches have typically used adhoc parameterizations
developed for specific MDPs. In contrast, the novel algorithm employs a
flexible policy parameterization, suitable for solving general discreteaction MDPs.
The algorithm looks for the best closedloop policy that can be represented
using a given number of basis functions, where a discrete action is assigned
to each basis function. The locations and shapes of the basis functions are optimized,
together with the action assignments. This allows a large class of policies to be represented.
The optimization is carried out with the crossentropy method and evaluates the policies by
their empirical return from a representative set of initial states. We report simulation
experiments in which the algorithm reliably obtains good policies with only a small
number of basis functions, albeit at sizable computational costs.
The SMCB 2010 journal article above is a heavily extended and revised version of this paper.
Online at IEEEXplore.
«

X. Yuan, L. Busoniu, R. Babuska,
Reinforcement Learning for Elevator Control.
In
Proceedings 17th IFAC World Congress (IFAC08),
pages 2212–2217,
Seoul, Korea,
6–11 July
2008.
»
Abstract: Reinforcement learning (RL) comprises an array of techniques that learn a
control policy so as to maximize a reward signal. When applied to the control
of elevator systems, RL has the potential of finding better control policies
than classical heuristic, suboptimal policies. On the other hand, elevator systems
offer an interesting benchmark application for the study of RL. In this paper,
RL is applied to a singleelevator system. The mathematical model of the elevator system
is described in detail, making the system easy to reimplement and reuse.
An experimental comparison is made between the performance of the Qvalue iteration
and Qlearning RL algorithms, when applied to the elevator system.
Online at ScienceDirect.
«

L. Busoniu, D. Ernst, B. De Schutter, R. Babuska,
Fuzzy Partition Optimization for Approximate Fuzzy Qiteration.
In
Proceedings 17th IFAC World Congress (IFAC08),
pages 5629–5634,
Seoul, Korea,
6–11 July
2008.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents.
Because exact RL can only be applied to very simple problems, approximate algorithms are
usually necessary in practice. Many algorithms for approximate RL rely on basisfunction
representations of the value function (or of the Qfunction). Designing a good set of basis
functions without any prior knowledge of the value function (or of the Qfunction) can be a
difficult task. In this paper, we propose instead a technique to optimize the shape of a constant
number of basis functions for the approximate, fuzzy Qiteration algorithm. In contrast to other
approaches to adapt basis functions for RL, our optimization criterion measures the actual
performance of the computed policies in the task, using simulation from a representative set
of initial states. A complete algorithm, using crossentropy optimization of triangular fuzzy
membership functions, is given and applied to the caronthehill example.
Online at ScienceDirect.
«

L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
Consistency of Fuzzy ModelBased Reinforcement Learning.
In
Proceedings 2008 IEEE International Conference on Fuzzy Systems (FUZZIEEE08),
pages 518–524,
Hong Kong,
1–6 June
2008.
»
Abstract: Reinforcement learning (RL) is a widely used paradigm for learning control. Computing exact RL solutions is generally only
possible when process states and control actions take values in a small discrete set. In practice, approximate algorithms are
necessary. In this paper, we propose an approximate, modelbased Qiteration algorithm that relies on a fuzzy partition of the
state space, and a discretization of the action space. Using assumptions on the continuity of the dynamics and of the reward
function, we show that the resulting algorithm is consistent, i.e., that the optimal solution is obtained asymptotically as
the approximation accuracy increases. An experimental study indicates that a continuous reward function is also important for
a predictable improvement in performance as the approximation accuracy increases.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
Fuzzy Approximation for Convergent ModelBased Reinforcement Learning.
In
2007 IEEE International Conference on Fuzzy Systems (FUZZIEEE07),
pages 968–973,
London, United Kingdom,
23–26 July
2007.
»
Abstract: Reinforcement learning (RL) is a learning control paradigm that provides wellunderstood algorithms with good convergence and
consistency properties. Unfortunately, these algorithms require that process states and control
actions take only discrete values. Approximate solutions using fuzzy representations have been
proposed in the literature for the case when the states and possibly the actions are continuous.
However, the link between these mainly heuristic solutions and the larger body of work on approximate
RL, including convergence results, has not been made explicit. In this paper, we propose a fuzzy
approximation structure for the Qvalue iteration algorithm, and show that the resulting algorithm is
convergent. The proof is based on an extension of previous results in approximate RL. We then
propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast
as the original algorithm. An illustrative simulation example is also provided.
Online at IEEEXplore.
«

L. Busoniu, D. Ernst, R. Babuska, B. De Schutter,
ContinuousState Reinforcement Learning with Fuzzy Approximation.
In
Adaptive Learning Agents and MultiAgent Systems (ALAMAS07) Symposium,
pages 21–35,
Maastricht, The Netherlands,
2–3 April
2007.
»
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. Wellunderstood RL algorithms
with good convergence and consistency properties exist. In their original form, these
algorithms require that the environment states and agent actions take values in a relatively
small discrete set. Fuzzy representations for approximate, modelfree RL have been proposed in
the literature for the more difficult case where the stateaction space is continuous. In this
work, we propose a fuzzy approximation structure similar to those previously used for
Qlearning, but we combine it with the modelbased Qvalue iteration algorithm. We show that
the resulting algorithm converges. We also give a modified, serial variant of the algorithm
that converges at least as fast as the original version. An illustrative simulation example is
provided.
The (downloadable) LNAI 2008 paper above is an extended and revised version of this paper.
«

L. Busoniu, B. De Schutter, R. Babuska,
Decentralized Reinforcement Learning Control of a Robotic Manipulator.
In
Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV06),
pages 1347–1352,
Singapore,
5–8 December
2006.
»
Abstract: Multiagent systems are rapidly finding applications in a variety of domains, including
robotics, distributed control, telecommunications, etc. Learning approaches to multiagent
control, many of them based on reinforcement learning (RL), are investigated in complex domains
such as teams of mobile robots. However, the application of decentralized RL to lowlevel
control tasks is not as intensively studied. In this paper, we investigate centralized and
decentralized RL, emphasizing the challenges and potential advantages of the latter. These are
then illustrated on an example: learning to control a twolink rigid manipulator. In closing,
some open issues and future research directions in decentralized RL are outlined.
Keywords: multiagent learning, decentralized control, reinforcement learning.
Online at IEEEXplore.
«

L. Busoniu, R. Babuska, B. De Schutter,
MultiAgent Reinforcement Learning: A Survey.
In
Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV06),
pages 527–532,
Singapore,
5–8 December
2006.
»
Abstract: Multiagent systems are rapidly finding applications in a variety of domains, including
robotics, distributed control, telecommunications, etc. Many tasks arising in these domains
require that the agents learn behaviors online. A significant part of the research on
multiagent learning concerns reinforcement learning techniques. However, due to different
viewpoints on central issues, such as the formal statement of the learning goal, a large number
of different methods and approaches have been introduced. In this paper we aim to present an
integrated survey of the field. First, the issue of the multiagent learning goal is discussed,
after which a representative selection of algorithms is reviewed. Open issues are identified
and future research directions are outlined.
Keywords: multiagent systems, reinforcement learning, game theory,
distributed control.
The (downloadable) SMCC 2008 journal article above is an extended and revised version of this paper.
Online at IEEEXplore.
«

L. Busoniu, B. De Schutter, R. Babuska,
Multiagent Reinforcement Learning with Adaptive State Focus.
In
Proceedings 17th BelgianDutch Conference on Artificial Intelligence (BNAIC05),
pages 35–42,
Brussels, Belgium,
17–18 October
2005.
»
Abstract: In realistic multiagent systems, learning on the basis of complete state information is not
feasible. We introduce adaptive state focus Qlearning, a class of methods derived from
Qlearning that start learning with only the state information that is strictly necessary for a
single agent to perform the task, and that monitor the convergence of learning. If lack of
convergence is detected, the learner dynamically expands its state space to incorporate more state
information (e.g., states of other agents). Learning is faster and takes less resources than if the
complete state were considered from the start, while being able to handle situations where agents
interfere in pursuing their goals. We illustrate our approach by instantiating a simple version of
such a method, and by showing that it outperforms learning with full state information without
being hindered by the deficiencies of learning on the basis of a single agent's state.
Keywords: multiagent learning, adaptive learning, Qlearning, coordination.
Online at PubZone.
«
Theses

L. Busoniu,
Optimistic Planning for Nonlinear Optimal Control and Networked Systems,
habilitation thesis, 2015, 155 pages.
»
Abstract: This thesis deals with the optimal control of nonlinear systems using artificial intelligence techniques. In particular, we exploit the class of optimistic planning methods from AI, and along a first, fundamental research line we extend them to handle several novel aspects, including stochasticity in the dynamics, disturbance modeled conservatively as an an opponent, continuous input actions etc. The second main line of the work applies optimistic planning to unsolved problems in control, including the networked nearoptimal control of general nonlinear systems, and the cooperative control of multiagent systems. All techniques come with validation in simulation or reallife experiments, and most of them are theoretically analyzed to guarantee nearoptimality and other properties, as a function of the computational effort invested. Applications in robotics are additionally explored. The thesis concludes with an outline of future plans, aiming to achieve the overall longterm goal of a learning and planning framework for the control of complex systems.
«

L. Busoniu,
Reinforcement Learning in Continuous State and Action Spaces,
PhD thesis, 2008, 190 pages, ISBN 9789090237541.
»
Abstract:
Reinforcement learning (RL) and dynamic programming (DP) algorithms can be used to solve problems in a variety of fields, among which automatic control, artificial intelligence, operations research, and economy. These algorithms find an optimal policy, which maximizes a numerical reward signal measuring the performance. DP algorithms require a model of the problem's dynamics, whereas RL algorithms work without a model. Online RL algorithms do not even require data in advance; they learn from experience. However, DP and RL can find exact solutions only when the states and the control actions take values in a small discrete set. In large discrete spaces and in continuous spaces, approximate solutions have to be used. This is the case, e.g., in automatic control, where the states and actions are usually continuous.
This thesis proposes several novel algorithms for approximate RL and DP, which work in problems with continuous variables: fuzzy Q iteration, online leastsquares policy iteration, and crossentropy policy search. Fuzzy Qiteration is a DP algorithm that represents the value function (cumulative rewards) using a fuzzy partition of the state space and a discretization of the action space. The value function is used to compute a nearoptimal policy. Fuzzy Qiteration is provably convergent and consistent. Online leastsquares policy iteration is a RL algorithm that efficiently learns from experience an approximate value function and a corresponding policy. It updates the value function parameters by solving linear systems of equations. Crossentropy policy search represents policies using a highly flexible parameterization, and optimizes the parameters with the crossentropy method. A representative selection of control problems is used to assess the performance of the proposed algorithms. Additionally, the thesis provides an extensive review of the stateof theart in approximate DP and RL, and discusses some fundamental open issues in the field.
To obtain a bound hardcopy of the thesis free of charge, please contact me (preferably by email) mentioning your name and address, together with
your interest in the thesis.
«
National journal papers

K. Mathe, L. Busoniu, L. Miclea,
Optimistic Planning with Long Sequences of Identical Actions: An Extended Theoretical and Experimental Study.
Acta Electrotehnica,
vol. 56,
no. 4,
pages 27–34,
2015.
»
Abstract: Optimistic planning for deterministic systems (OPD) finds nearoptimal control solutions for general, nonlinear systems. OPD iteratively explores a search tree of action sequences by always expanding further the most promising sequence, where each expansion appends all possible onestep actions. However, the generality of the algorithm comes at a high computational cost. We aim to alleviate this complexity in a subclass of control problems where longer ranges of constant actions are preferred, by adapting OPD to this class of problems. The novel algorithm is called optimistic planning with K identical actions (OKP), and it creates sequences by appending to them up to K repetitions of each possible action. In our analysis we show that indeed, OKP offers a similar a posteriori performance as OKP and in certain cases the tree depth reached (a measure of the performance) is increased compared to OPD. Our experiments, performed on the inverted pendulum and HIV infection treatment control, confirm that for suitable control problems OKP can perform better than OPD, for properly tuned parameter K.
«

L. Busoniu, B. De Schutter, R. Babuska, D. Ernst,
Exploiting Policy Knowledge in Online LeastSquares Policy Iteration: An Empirical Study.
Automation, Computers, Applied Mathematics,
vol. 19,
no. 4,
pages 521–529,
2010.
»
Abstract: Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems, approximate representations of the solution are necessary. The field of approximate RL has tremendously expanded over the last decade, and a wide array of effective algorithms is now available. However, RL is generally envisioned as working without any prior knowledge about the system or the solution, whereas such knowledge is often available and can be exploited to great advantage. Therefore, in this paper we describe a method that exploits prior knowledge to accelerate online leastsquares policy iteration (LSPI), a stateoftheart algorithm for approximate RL. We focus on prior knowledge about the monotonicity of the control policy with respect to the system states. Such monotonic policies are appropriate for important classes of systems appearing in control applications, including for instance nearly linear systems and linear systems with monotonic input nonlinearities. In an empirical evaluation, online LSPI with prior knowledge is shown to learn much faster and more reliably than the original online LSPI.
«
Abstracts, workshop presentations

L. Busoniu, A. Daniels, R. Munos, R. Babuska,
Optimistic Planning for ContinuousAction Deterministic Systems.
Presented at the 8emes Journees Francophones sur la Planification, la Decision et l'Apprentissage pour la conduite de systemes (JFPDA13),
Lille, France,
1–2 July
2013.
»
Abstract: We consider the optimal control of systems with deterministic dynamics, continuous, possibly largescale state spaces, and continuous, lowdimensional action spaces. We describe an online planning algorithm called SOOP, which like other algorithms in its class has no direct dependence on the state space structure. Unlike previous algorithms, SOOP explores the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. To this end, it borrows the principle of the simultaneous optimistic optimization method, and develops a nontrivial adaptation of this principle to the planning problem. Experiments on four problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
This is an extended version of the ADPRL13 paper with the same title.
«

L. Busoniu, R. Munos, B. De Schutter, R. Babuska,
Optimistic Planning for Sparsely Stochastic Systems.
Presented at the 2011 Workshop on MonteCarlo Tree Search: Theory and Applications, within the 21st International Conference on Automated Planning and Scheduling (ICAPS11),
Freiburg, Germany,
12 June
2011.
»
Abstract: We describe an online planning algorithm for finiteaction,
sparsely stochastic Markov decision processes,
in which the random state transitions can only end up
in a small number of possible next states. The algorithm
builds a planning tree by iteratively expanding
states, where the most promising states are expanded
first, in an optimistic procedure aiming to return a good
action after a strictly limited number of expansions.
The novel algorithm is called optimistic planning for
sparsely stochastic systems.
«
Disclaimer: The following applies to the papers that are directly available for download as PDF files. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each copyright holder. In most cases, these works may not be reposted without the explicit permission of the copyright holder. Additionally, the following applies to IEEE material: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.
